用ruby写的一个网络爬虫程序

全部 Ruby Python PHP Flash C++ .net Rails Flex C C# Django

浏览 6213 次

锁定老帖子主题：用ruby写的一个网络爬虫程序该帖已经被评为隐藏帖
作者	正文
bubble 等级: 初级会员性别: 文章: 108 积分: 88 来自: 辽宁	发表时间：2009-12-09 最后修改：2010-11-10 相关推荐: spidr_test:用于端到端测试的网络爬虫接触ruby第二天用它写的一个网络爬虫程序 ruby语言怎么写个通用爬虫程序？用mac的python写网络爬虫_选择Python写网络爬虫的优势和理由为什么用python写爬虫_什么是网络爬虫？为什么要选择Python写网络爬虫？更多相关推荐 Ruby 这个程序写的很简单，刚接触ruby第二天写的，主要完成的功能是到斯坦福大学的网站上去收集email地址，，默认是10个线程，策略是广度优先，$debug=true时开启调试信息。附件中包含代码和批处理文件。运行命令为： ruby Crawl.rb 2 1000 http://www-cs.stanford.edu/People 其中参数：2->max_depth, 1000->max_pages, http://www-cs.stanford.edu/People->URL 运行结果输出为文档文件emails_md[max_depth]_mp[max_pages]_[URL].txt 居然删不了帖子，那我就拿掉吧，剩着再遭隐藏，呵呵还是贴回来吧，省着自己想找都找不到，哈哈 require 'open-uri' require 'thread' # run it like this : # ruby Crawl.rb 2 1000 http://www-cs.stanford.edu/People # regexp $link_regexp = /href\=\"[^\"]\"/ $email_regexp_1 = /mailto\:[^\@]\@[^\"][\"]/ #mailto:xx@xxxx" $email_regexp_2 = /[\>][^\<]\@[^\>][\<]/ #>xx@xx< $before_at = /[a-zA-Z0-9]+[_?a-zA-Z0-9]+/ $after_at = /[a-zA-Z]+[-?a-zA-Z]\.+[a-zA-Z]+/ $email_regexp=/#{$before_at}\@#{$after_at}/ #xx@xx.xx #ARGV if ARGV==nil\|\|ARGV.length<3 puts '-- Command --' puts 'ruby Crawl.rb 2 1000 http://www-cs.stanford.edu/People' puts 'help: 2->max_depth, 1000->max_pages, http://www-cs.stanford.edu/People->url' exit(0) end $url=ARGV[2] $max_depth=ARGV[0].to_i $max_pages=ARGV[1].to_i $fname='emails_md'+String($max_depth)+'_mp'+String($max_pages)+'_'+$url.gsub(/[\/\:]/,'_')+'.txt' $fname_links='links_md'+String($max_depth)+'_mp'+String($max_pages)+'_'+$url.gsub(/[\/\:]/,'_')+'.txt' $thread_num=10 $debug=true $links_stack=[] #fifo #[[depth1,link1],[depth2,link2],[depth3,link3],...] $links_crawled=[] #[url1,url2,url3,...] $emails=[] #[email1,email2,email3,...] class Crawl def initialize url,depth @url=url while @url[-1,1]=='/' @url=@url.slice(0,@url.length-1) end @depth=depth begin @html=open(@url).read rescue @html='' end end def get_links @html.scan($link_regexp) do \|match\| u=Util.format_url(match,@url) if !(u==nil)&&!$links_crawled.include?(match)&&$links_stack.rassoc(match)==nil $links_stack.push [@depth,u] end end end def get_emails @html.scan($email_regexp_1) do \|match\| match=Util.format_email(match) if match!=nil&&!$emails.include?(match) $emails.push match msg= match+', '+@url puts msg Util.write($fname,msg+"\r\n") end end @html.scan($email_regexp_2) do \|match\| match=Util.format_email(match) if match!=nil&&!$emails.include?(match) $emails.push match msg= match+', '+@url puts msg Util.write($fname,msg+"\r\n") end end end end class Util # format url def Util.format_url url,f_url # remove 'www-' f_url=f_url.gsub(/www\-/, '') url=url[6,url.length-7] # exclude css & js & '#'(eg http://www-cs.stanford.edu/People/faculty#Regular%20Faculty)... if Util.exclude(url)==nil\|\|url.include?('#') return nil end # full path if url[0,4]!='http' while url.index('/')==0 url=url.slice(1,url.length-1) end url=f_url+'/'+url end return url end # format email def Util.format_email email email=email.delete('>').delete('<').delete('mailto:').delete('"').strip if String($email_regexp.match(email))== email return email.downcase else return nil end end # write msg to file def Util.write fname,msg file=File.new(fname,'a') file<<msg file.close() end # exclude css & js... def Util.exclude str ex=['css','js','pdf','jpg'] ex.each do \|e\| index=e.length+1 if str.length>index && str[-index,index]=='.'+e return nil end end return str end end $count=1 0.upto($max_depth) do \|i\| if $debug puts '~~depth->'+String(i) end if i==0 c=Crawl.new($url,i+1) c.get_links c.get_emails $links_crawled.push [i,$url] end #breadth first while $links_stack.length!=0 if $debug puts '~~count->'+String($count)+',stack->'+String($links_stack.length)+',crawled->'+String($links_crawled.length)+',total->'+String($links_crawled.length+$links_stack.length) $count=$count+1 end #Thread.abort_on_exception = true threads = [] if $links_stack.length/$thread_num>=1 ts=$thread_num else ts=$links_stack.length%$thread_num end ts.times { \|i\| threads << Thread.new(i) { Mutex.new.synchronize { if ($links_crawled.length+$links_stack.length)<=$max_pages&&i!=$max_depth link=$links_stack.shift #fifo if link[0]==i+1 #read links & emails from pages in stack c=Crawl.new(link[1],i+2) c.get_links c.get_emails $links_crawled.push link[1] else break end else #only read emails from pages in stack link=$links_stack.shift c=Crawl.new(link[1],i+2) c.get_emails $links_crawled.push link[1] end } } } threads.each{\|t\|t.join} end end 声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

wangxin0072000 等级: 性别: 文章: 110 积分: 110 来自: 北京	发表时间：2009-12-11 $link_regexp = /href\=\"[^\"]*(\")$/
返回顶楼	回帖地址 0 0 请登录后投票

bubble 等级: 初级会员性别: 文章: 108 积分: 88 来自: 辽宁	发表时间：2009-12-12 呵呵，有点郁闷居然6个隐藏帖，给点面子不好吗，投个新手帖我还能接受，javaeye玩Ruby的看来都是高手啊
返回顶楼	回帖地址 0 0 请登录后投票

qiaoakai 等级: 性别: 文章: 20 积分: 100 来自: 北京	发表时间：2009-12-12 Very Good!!!虽然，我没看懂！！
返回顶楼	回帖地址 0 0 请登录后投票

wangfsec 等级: 初级会员性别: 文章: 2 积分: 30 来自: 上海	发表时间：2009-12-12 怎么做到广度优先的？
返回顶楼	回帖地址 0 0 请登录后投票

hc_face 等级: 初级会员性别: 文章: 17 积分: 30 来自: 苏州	发表时间：2009-12-12 我还不会用隐藏贴,但是我在学ｒｕｂｙ,敬佩楼主的贡献精神,学习!!
返回顶楼	回帖地址 0 0 请登录后投票

tianyuzhu 等级: 初级会员性别: 文章: 7 积分: 30 来自: 成都	发表时间：2009-12-12 最后修改：2009-12-12 原来如此，
返回顶楼	回帖地址 0 0 请登录后投票

bubble 等级: 初级会员性别: 文章: 108 积分: 88 来自: 辽宁	发表时间：2009-12-13 我准备把帖子删掉算了，心寒啊。 qiaoakai 写道 Very Good!!!虽然，我没看懂！！呵呵，写的太烂是不是？ wangfsec 写道怎么做到广度优先的？用的队列。 hc_face 写道我还不会用隐藏贴,但是我在学ｒｕｂｙ,敬佩楼主的贡献精神,学习!! 多亏你不会用隐藏帖，要不又多一个，汗！ tianyuzhu 写道原来如此，这位大哥受到啥启发了？
返回顶楼	回帖地址 0 0 请登录后投票

论坛首页 → 编程语言技术版

跳转论坛: