`
candyania
  • 浏览: 33735 次
  • 性别: Icon_minigender_2
  • 来自: 上海
社区版块
存档分类
最新评论

Hpricot - 获取网页页面数据

阅读更多

http://rdoc.info/projects/hpricot/hpricot 转来的关于如何使用Hpricot的rdoc以及example. 等有空的时候把它翻译成中文。

Hpricot, Read Any HTML

Hpricot is a fast, flexible HTML parser written in C. It's designed to be very accommodating (like Tanaka Akira's HTree) and to have a very helpful library (like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS parser, in fact, is based on John Resig's JQuery.

Also, Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out. You know, that sort of thing.

Please read this entire document before making assumptions about how this software works.

An Overview

Let's clear up what Hpricot is.

  • Hpricot is a standalone library. It requires no other libraries. Just Ruby!
  • While priding itself on speed, Hpricot works hard to sort out bad HTML and pays a small penalty in order to get that right. So that's slightly more important to me than speed.
  • If you can see it in Firefox, then Hpricot should parse it. That's how it should be! Let me know the minute it's otherwise.
  • Primarily, Hpricot is used for reading HTML and tries to sort out troubled HTML by having some idea of what good HTML is. Some people still like to use Hpricot for XML reading, but remember to use the Hpricot::XML() method for that!

The Hpricot Kingdom

First, here are all the links you need to know:

  • http://wiki.github.com/hpricot/hpricot is the Hpricot wiki and http://github.com/hpricot/hpricot/issues is the bug tracker. Go there for news and recipes and patches. It's the center of activity.
  • http://github.com/hpricot/hpricot is the main Git repository for Hpricot. You can get the latest code there.
  • See COPYING for the terms of this software. (Spoiler: it's absolutely free.)

If you have any trouble, don't hesitate to contact the author. As always, I'm not going to say "Use at your own risk" because I don't want this library to be risky. If you trip on something, I'll share the liability by repairing things as quickly as I can. Your responsibility is to report the inadequacies.

Installing Hpricot

You may get the latest stable version from Rubyforge. Win32 binaries, Java binaries (for JRuby), and source gems are available.

$ gem install hpricot

An Hpricot Showcase

We're going to run through a big pile of examples to get you jump-started. Many of these examples are also found at http://wiki.github.com/hpricot/hpricot/hpricot-basics, in case you want to add some of your own.

Loading Hpricot Itself

You have probably got the gem, right? To load Hpricot:

require 'rubygems'
require 'hpricot'

If you've installed the plain source distribution, go ahead and just:

require 'hpricot'

Load an HTML Page

The Hpricot() method takes a string or any IO object and loads the contents into a document object.

doc = Hpricot("<p>A simple <b>test</b> string.</p>")

To load from a file, just get the stream open:

doc = open("index.html") { |f| Hpricot(f) }

To load from a web URL, use open-uri, which comes with Ruby:

require 'open-uri'
doc = open("http://qwantz.com/") { |f| Hpricot(f) }

Hpricot uses an internal buffer to parse the file, so the IO will stream properly and large documents won't be loaded into memory all at once. However, the parsed document object will be present in memory, in its entirety.

Search for Elements

Use Doc.search:

doc.search("//p[@class='posted']")
#=> #<Hpricot:Elements[{p ...}, {p ...}]>

Doc.search can take an XPath or CSS expression. In the above example, all paragraph

elements are grabbed which have a class attribute of "posted".

A shortcut is to use the divisor:

(doc/"p.posted")
#=> #<Hpricot:Elements[{p ...}, {p ...}]>

Finding Just One Element

If you're looking for a single element, the at method will return the first element matched by the expression. In this case, you'll get back the element itself rather than the Hpricot::Elements array.

doc.at("body")['onload']

The above code will find the body tag and give you back the onload attribute. This is the most common reason to use the element directly: when reading and writing HTML attributes.

Fetching the Contents of an Element

Just as with browser scripting, the inner_html property can be used to get the inner contents of an element.

(doc/"#elementID").inner_html
#=> "..contents.."

If your expression matches more than one element, you'll get back the contents of ''all the matched elements''. So you may want to use first to be sure you get back only one.

(doc/"#elementID").first.inner_html
#=> "..contents.."

Fetching the HTML for an Element

If you want the HTML for the whole element (not just the contents), use to_html:

(doc/"#elementID").to_html
#=> "<div id='elementID'>...</div>"

Looping

All searches return a set of Hpricot::Elements. Go ahead and loop through them like you would an array.

(doc/"p/a/img").each do |img|
  puts img.attributes['class']
end

Continuing Searches

Searches can be continued from a collection of elements, in order to search deeper.

# find all paragraphs.
elements = doc.search("/html/body//p")
# continue the search by finding any images within those paragraphs.
(elements/"img")
#=> #<Hpricot::Elements[{img ...}, {img ...}]>

Searches can also be continued by searching within container elements.

# find all images within paragraphs.
doc.search("/html/body//p").each do |para|
  puts "== Found a paragraph =="
  pp para

  imgs = para.search("img")
  if imgs.any?
    puts "== Found #{imgs.length} images inside =="
  end
end

Of course, the most succinct ways to do the above are using CSS or XPath.

# the xpath version
(doc/"/html/body//p//img")
# the css version
(doc/"html > body > p img")
# ..or symbols work, too!
(doc/:html/:body/:p/:img)

Looping Edits

You may certainly edit objects from within your search loops. Then, when you spit out the HTML, the altered elements will show.

(doc/"span.entryPermalink").each do |span|
  span.attributes['class'] = 'newLinks'
end
puts doc

This changes all span.entryPermalink elements to span.newLinks. Keep in mind that there are often more convenient ways of doing this. Such as the set method:

(doc/"span.entryPermalink").set(:class => 'newLinks')

Figuring Out Paths

Every element can tell you its unique path (either XPath or CSS) to get to the element from the root tag.

The css_path method:

doc.at("div > div:nth(1)").css_path
  #=> "div > div:nth(1)" 
doc.at("#header").css_path
  #=> "#header" 

Or, the xpath method:

doc.at("div > div:nth(1)").xpath
  #=> "/div/div:eq(1)" 
doc.at("#header").xpath
  #=> "//div[@id='header']"

Hpricot Fixups

When loading HTML documents, you have a few settings that can make Hpricot more or less intense about how it gets involved.

:fixup_tags

Really, there are so many ways to clean up HTML and your intentions may be to keep the HTML as-is. So Hpricot's default behavior is to keep things flexible. Making sure to open and close all the tags, but ignore any validation problems.

As of Hpricot 0.4, there's a new :fixup_tags option which will attempt to shift the document's tags to meet XHTML 1.0 Strict.

doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }

This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's going to move the paragraph below the link. Or up and out of other elements where paragraphs don't belong.

If an unknown element is found, it is ignored. Again, :fixup_tags.

:xhtml_strict

So, let's go beyond just trying to fix the hierarchy. The :xhtml_strict option really tries to force the document to be an XHTML 1.0 Strict document. Even at the cost of removing elements that get in the way.

doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }

What measures does :xhtml_strict take?

  1. Shift elements into their proper containers just like :fixup_tags.
  2. Remove unknown elements.
  3. Remove unknown attributes.
  4. Remove illegal content.
  5. Alter the doctype to XHTML 1.0 Strict.

Hpricot.XML()

The last option is the :xml option, which makes some slight variations on the standard mode. The main difference is that :xml mode won't try to output tags which are friendlier for browsers. For example, if an opening and closing br tag is found, XML mode won't try to turn that into an empty element.

XML mode also doesn't downcase the tags and attributes for you. So pay attention to case, friends.

The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:

doc = open("http://redhanded.hobix.com/index.xml") do |f|
  Hpricot.XML(f)
end

分享到:
评论

相关推荐

    hpricot+tmail

    在Ruby社区中,hpricot被广泛用来抓取网页内容,提取数据,或者进行网页自动化任务。尽管后来有了更现代的替代品如Nokogiri,但hpricot因其简洁的API和对老式HTML的宽容性而仍然有其价值。hpricot通过提供类似于...

    Hpricot API for Ruby

    Hpricot 的API 文档 Hpricot 的API 文档

    hpricot:Hpricot已经结束。 请考虑使用替代品,例如nokogiri

    Hpricot已经结束。 在多年以来一直缺乏适当的珠宝维护者之后,人们决定最终关闭有关hpricot的书籍。 大多数用户已迁移到替代方案,并且根本没有时间或精力继续使用当前代码库。 如果您觉得自己有时间并希望接手,...

    hpple:受Hpricot启发,用于Objective-C的XMLHTML解析器

    受到为什么幸运的的。学分Hpple由Geoffrey , 和。...安装打开您的Xcode项目和Hpple项目。 将“ Hpple”目录拖到您的项目中。 将libxml2.2.dylib框架添加到您的项目中,并按照描述搜索路径即将推出更多文档和简短的...

    iPhone development lecture from stanford

    - **早期尝试(2007年)**:当时还没有iPhone,LinkedIn团队尝试使用LinkedIn Web APIs进行数据抓取,构建了一个基于Rails和Hpricot XML解析器的应用。但这种方式存在安全隐患,当用户更改用户名和密码时,系统会...

    grunt-email-boilerplate:用于创建电子邮件的Grunt模板

    cli&gt; = 0.1.7和Grunt&gt; = 0.4.2( npm install grunt-cli -g ) Ruby&gt; = 1.9.3() 指南针&gt; = 0.12.2( gem install compass ) Premailer&gt; = 1.8.0( gem install premailer ,大多数情况下, gem install hpricot ...

    rb2-examples:kimsQ Rb 2扩展

    Ruby有很多库,如Nokogiri和Hpricot,能够解析和操作HTML文档。 **文件名称列表:rb2-examples-master** 暗示了这是一个Git仓库的克隆版本,通常`master`分支代表项目的主线开发。在这个目录下,你可能会找到如...

    ruby_webscraping_talk_source:Ruby的Web抓取对话

    11. **网页解析**:除了Nokogiri,还有其他解析库如`Hpricot`,虽然不如Nokogiri活跃,但在某些场景下也足够使用。 12. **Ethical Scraping**:尊重网站的robots.txt文件,避免抓取受限制的内容,合理设置抓取频率...

    grunt-email-code:用于自动处理电子邮件的精简版。 由lee munroe创建

    gem install premailer hpricot nokogiri 安装SASS sudo gem安装sass 笨拙的电子邮件设计工作流程 设计和测试电子邮件是一件很痛苦的事情。 HTML表,内联CSS,要测试的各种设备和客户端,以及对最新Web标准的各种...

    downmark_it.zip

    downmark_it 是一个用来将 HTML 转成 Markdown 格式的 Ruby 开发包。依赖于 Hpricot 标签:downmark

    redmine_airbrake_server:一个 Redmine 插件,使其能够接收应用程序错误通知 Airbrake 风格

    安装 hpricot gem,然后将插件安装到 Redmine。 现在转到管理 -&gt; 设置 -&gt; 收到的电子邮件,如有必要,选中启用 WS 接收电子邮件并生成 API 密钥。 这是您在下一步配置通知程序时需要的密钥。 客户端配置 对于 Rails ...

Global site tag (gtag.js) - Google Analytics