网页抓取

Hooopo

浏览: 339794 次
性别:
来自: 北京

最近访客更多访客>>

shenqiax

u012363178

yuan

southwolf

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Ruby

Ruby Git HP Firebug UML

Firebug + Httpanalyzer + Hpricot + retry...
必要时再用Iconv，net/htttp

貌似这个组合很强大了。。

分享到：

关于虚无主义 | MongoDB【hello word 系列】

2010-03-11 10:02
浏览 3176
评论(23)
分类:编程语言
查看更多

23 楼 Hooopo 2011-10-04

http://chunyemen.org/archives/557

22 楼 Hooopo 2011-08-09

最近发现很多网站返回charset是gb2312，结果里面还有gb2312字符集之外的编码。。。好可恶。
看到gb2312都当gb18030处理。。然后这个世界安静了。
也可以string.encode("UTF-8", :undef => :replace, :replace => "?", :invalid => :replace)
不过这样会丢失一些字符。但是至少不会异常。

21 楼 Hooopo 2011-07-20

str.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

处理invalid byte sequence in UTF-8 error in ruby1.9.2

20 楼 marshluca 2010-05-08

我也推荐一个：selectorgadget，这个东西很强大。

19 楼 Hooopo 2010-04-22

fix hpricot or nokogiri stack level too deep systemstackerror bug
http://dalibornasevic.com/posts/5-ruby-stack-level-too-deep-systemstackerror

18 楼 Hooopo 2010-04-19

替换  o等。。


def html_unescape(s)
        return s unless s
        s.gsub(/&(\w+|#[0-9]+);/) { |match|
          number = case match
                   when /&(\w+);/
                     Hpricot::NamedCharacters[$1]
                   when /&#([0-9]+);/
                     $1.to_i
                   end

          number ? ([number].pack('U') rescue match) : match
        }
      end

17 楼 Hooopo 2010-04-19

Hooopo 写道

遇到一个奇怪的字符。。

>> a.each_byte do |b|
?> p b
>> end
194
160
=> " "
>> a.unpack("U*")
=> [160]
>>

终于找到他了。。这个字符就是nbsp....ri阿
Hpricot里有这么一个东东：

>> Hpricot::NamedCharacters
=> {"yacute"=>253, "Chi"=>935, "image"=>8465, "nbsp"=>160, "there4"=>8756, "euml"=>235, "piv"=>982, "ne"=>8800, "ograve"=>242, "zwj"=>8205, "THORN"=>222, "Atilde"=>195, "igrave"=>236, "sub"=>8834, "raquo"=>187, "hearts"=>9829, "cedil"=>184, "ni"=>8715, "asymp"=>8776, "rArr"=>8658, "aring"=>229, "Uacute"=>218, "perp"=>8869, "empty"=>8709, "ndash"=>8211, "acirc"=>226, "ordf"=>170, "ccedil"=>231, "sbquo"=>8218, "sube"=>8838, "zwnj"=>8204, "uuml"=>252, "Kappa"=>922, "lArr"=>8656, "macr"=>175, "phi"=>966, "beta"=>946, "exist"=>8707, "ucirc"=>251, "sdot"=>8901, "Sigma"=>931, "alpha"=>945, "pound"=>163, "Yacute"=>221, "sum"=>8721, "ge"=>8805, "scaron"=>353, "Psi"=>936, "AElig"=>198, "ordm"=>186, "crarr"=>8629, "Oslash"=>216, "Igrave"=>204, "real"=>8476, "clubs"=>9827, "oline"=>8254, "sup"=>8835, "Beta"=>914, "delta"=>948, "nsub"=>8836, "iuml"=>239, "theta"=>952, "nu"=>957, "frac12"=>189, "Auml"=>196, "alefsym"=>8501, "Ecirc"=>202, "amp"=>38, "frac14"=>188, "circ"=>710, "sect"=>167, "Omicron"=>927, "eta"=>951, "Nu"=>925, "Scaron"=>352, "Icirc"=>206, "Ccedil"=>199, "Prime"=>8243, "hArr"=>8660, "emsp"=>8195, "oacute"=>243, "rceil"=>8969, "iacute"=>237, "oslash"=>248, "iquest"=>191, "diams"=>9830, "epsilon"=>949, "larr"=>8592, "apos"=>39, "micro"=>181, "Ouml"=>214, "yen"=>165, "dagger"=>8224, "not"=>172, "Egrave"=>200, "cent"=>162, "frasl"=>8260, "rsquo"=>8217, "omicron"=>959, "uml"=>168, "eth"=>240, "curren"=>164, "Aring"=>197, "copy"=>169, "Epsilon"=>917, "spades"=>9824, "tilde"=>732, "Oacute"=>211, "para"=>182, "minus"=>8722, "trade"=>8482, "Iacute"=>205, "lang"=>9001, "otilde"=>245, "gt"=>62, "aelig"=>230, "pi"=>960, "rsaquo"=>8250, "Acirc"=>194, "ouml"=>246, "kappa"=>954, "Rho"=>929, "lambda"=>955, "darr"=>8595, "shy"=>173, "notin"=>8713, "chi"=>967, "ETH"=>208, "Tau"=>932, "szlig"=>223, "Pi"=>928, "Gamma"=>915, "cup"=>8746, "euro"=>8364, "uArr"=>8657, "agrave"=>224, "prod"=>8719, "egrave"=>232, "Alpha"=>913, "rfloor"=>8971, "Upsilon"=>933, "rarr"=>8594, "Otilde"=>213, "lfloor"=>8970, "Uuml"=>220, "brvbar"=>166, "ocirc"=>244, "bull"=>8226, "Lambda"=>923, "mu"=>956, "Zeta"=>918, "Dagger"=>8225, "Theta"=>920, "Agrave"=>192, "ecirc"=>234, "weierp"=>8472, "upsilon"=>965, "equiv"=>8801, "lrm"=>8206, "Mu"=>924, "hellip"=>8230, "rang"=>9002, "icirc"=>238, "le"=>8804, "quot"=>34, "oplus"=>8853, "zeta"=>950, "OElig"=>338, "Phi"=>934, "rlm"=>8207, "Omega"=>937, "permil"=>8240, "upsih"=>978, "ugrave"=>249, "thinsp"=>8201, "frac34"=>190, "thorn"=>254, "psi"=>968, "auml"=>228, "ensp"=>8194, "times"=>215, "prop"=>8733, "otimes"=>8855, "supe"=>8839, "part"=>8706, "aacute"=>225, "iota"=>953, "iexcl"=>161, "lceil"=>8968, "deg"=>176, "reg"=>174, "loz"=>9674, "cap"=>8745, "cong"=>8773, "and"=>8743, "nabla"=>8711, "harr"=>8596, "Yuml"=>376, "lsquo"=>8216, "lsaquo"=>8249, "sigmaf"=>962, "gamma"=>947, "eacute"=>233, "Eta"=>919, "isin"=>8712, "acute"=>180, "Iota"=>921, "rdquo"=>8221, "ang"=>8736, "mdash"=>8212, "sigma"=>963, "fnof"=>402, "atilde"=>227, "Ucirc"=>219, "Euml"=>203, "forall"=>8704, "Iuml"=>207, "Ugrave"=>217, "Ograve"=>210, "divide"=>247, "infin"=>8734, "lt"=>60, "ntilde"=>241, "oelig"=>339, "lowast"=>8727, "ldquo"=>8220, "Delta"=>916, "int"=>8747, "Ocirc"=>212, "or"=>8744, "sup1"=>185, "Aacute"=>193, "Eacute"=>201, "xi"=>958, "sup2"=>178, "plusmn"=>177, "rho"=>961, "yuml"=>255, "tau"=>964, "sup3"=>179, "laquo"=>171, "sim"=>8764, "bdquo"=>8222, "thetasym"=>977, "prime"=>8242, "uarr"=>8593, "uacute"=>250, "dArr"=>8659, "middot"=>183, "Xi"=>926, "omega"=>969, "Ntilde"=>209, "radic"=>8730}
>> Hpricot::NamedCharacters["nbsp"]
=> 160

同时Nokogiri里也有一个：

>> p Nokogiri::HTML::NamedCharacters["nbsp"]
160

16 楼 kaka2008 2010-04-15

hooopo，你贴一个open-url抓取网页的完整代码出来，我看看

15 楼 Hooopo 2010-04-15

获取charset

meta = ["Content-Type", "Content-type", "content-type"].map{|c| doc.at("meta[@http-equiv='#{c}']")}.compact.first
      
content_type = meta["content"]  if meta.is_a?(Hpricot::Elem)
charset = content_type[/charset=([\w-]+)/i, 1]

14 楼 Hooopo 2010-04-12

找到了。。。呼呼～～
http://www.simplemachines.org/community/index.php?topic=165420.0

13 楼 Hooopo 2010-04-12

>> a.is_binary_data?
=> true

这是啥东东阿？？

12 楼 Hooopo 2010-04-12

遇到一个奇怪的字符。。

>> a.each_byte do |b|
?> p b
>> end
194
160
=> " "
>> a.unpack("U*")
=> [160]
>>

11 楼 Hooopo 2010-04-09

Segmentation fault occurs when a empty stream is passed

描述:http://github.com/hpricot/hpricot/issues#issue/6

解决办法很简单。。。避免传空流：
不用open("http://...")换成open("http://..").read

10 楼 kaka2008 2010-04-09

目前open-uri就够我用了，哈哈

9 楼 Hooopo 2010-04-07

Hooopo 写道

写了一个带异常处理的open...

def safe_open(url, retries = 5, sleeep = 0.42, headers = {})
    begin
      open(url, headers).read
    rescue StandardError,Timeout::Error, SystemCallError, Errno::ECONNREFUSED #有些异常不是标准异常
      puts $!
      retries -= 1

      if retries > 0
        sleep sleeep and retry
      else
        #TODO Logging..
        #TODO 多次爬取失败后记录到日志
      end

    end
  end

要先 require 'timeout'

8 楼 Hooopo 2010-04-07

fix hpricot_scan.so: undefined method `downcase' for nil:NilClass error:

http://github.com/hpricot/hpricot/issues#issue/8

1.下载hgwr's hpricot(git://github.com/hgwr/hpricot.git)
git clone git://github.com/hgwr/hpricot.git
2.卸载以前的hpricot
sudo gem uninstall hpricot
3.build gemspec
cd hpricot_path
sudo gem build hpricot.gemspec
sudo gem install hpricot-0.8.gem

7 楼 Hooopo 2010-04-03

这个gem也可以用来处理编码
http://www.iteye.com/topic/565606#1435314

6 楼 Hooopo 2010-04-03

编码处理。。

  def force_utf8(html)
    doc = Hpricot.parse(html)
    begin
      content_type = doc.search("meta[@http-equiv='Content-Type'||'Content-type'||'content-type']").attr("content")
      charset = content_type[/charset=([\w-]+)/i, 1]
    rescue
      puts $!
      charset = "gbk"
    end

    if charset.downcase != "utf-8"
      html = Iconv.conv("UTF-8//IGNORE", "#{charset}//IGNORE", html).sub(/charset=[\w-]+/im, "charset=utf-8")
    end

    return html

  end

5 楼 Hooopo 2010-04-03

写了一个带异常处理的open...

def safe_open(url, retries = 5, sleeep = 0.42, headers = {})
    begin
      open(url, headers).read
    rescue StandardError,Timeout::Error, SystemCallError, Errno::ECONNREFUSED #有些异常不是标准异常
      puts $!
      retries -= 1

      if retries > 0
        sleep sleeep and retry
      else
        #TODO Logging..
        #TODO 多次爬取失败后记录到日志
      end

    end
  end

4 楼 Hooopo 2010-03-17

呃呃呃，目前hp已经够我用的了。。有问题再换nokogirl..

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

网页抓取

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

网页抓取

评论

发表评论

相关推荐

新博客

Ruby Verbose Warning Mode

Pattern Match In Ruby

Draper: View Models for Rails

Active Record batch processing in parallel processes

最轻量级的Ruby后台任务

test

fiber

Identity Map in Rails3.1

xx00

挖坑1

websocket demo

ruby GC

reduce method missing call stack with dynamic define method

Autocompete with Trie

用imagemagick和tesseract-ocr破解简单验证码

OAuth gem for rails，支持豆瓣，新浪微薄，腾讯微博，搜狐微博，网易微博

用jmeter模拟amf请求进行压力测试

Memoization in Ruby

整理了一下2008-2010的RubyHeroes博客列表

最近访客更多访客>>