- 浏览: 33740 次
- 性别:
- 来自: 上海
-
文章分类
最新评论
-
xueluowuhen_1:
兄弟 这个怎么用的啊
Watir 点击页面提示条及下拉菜单选项方法 -
piecehealth:
你把initialize改写了@a当然没值了,你加上super ...
Ruby的继承
从http://rdoc.info/projects/hpricot/hpricot 转来的关于如何使用Hpricot的rdoc以及example. 等有空的时候把它翻译成中文。 Hpricot is a fast, flexible HTML parser written in C. It's designed to be very accommodating (like Tanaka Akira's HTree) and to have a very helpful library (like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS parser, in fact, is based on John Resig's JQuery. Also, Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out. You know, that sort of thing. Please read this entire document before making assumptions about how this software works. Let's clear up what Hpricot is. First, here are all the links you need to know: If you have any trouble, don't hesitate to contact the author. As always, I'm not going to say "Use at your own risk" because I don't want this library to be risky. If you trip on something, I'll share the liability by repairing things as quickly as I can. Your responsibility is to report the inadequacies. You may get the latest stable version from Rubyforge. Win32 binaries, Java binaries (for JRuby), and source gems are available. We're going to run through a big pile of examples to get you jump-started. Many of these examples are also found at http://wiki.github.com/hpricot/hpricot/hpricot-basics, in case you want to add some of your own. You have probably got the gem, right? To load Hpricot: If you've installed the plain source distribution, go ahead and just: The Hpricot() method takes a string or any IO object and loads the contents into a document object. To load from a file, just get the stream open: To load from a web URL, use open-uri, which comes with Ruby: Hpricot uses an internal buffer to parse the file, so the IO will stream properly and large documents won't be loaded into memory all at once. However, the parsed document object will be present in memory, in its entirety. Use Doc.search: Doc.search can take an XPath or CSS expression. In the above example, all paragraph elements are grabbed which have a class attribute of "posted". A shortcut is to use the divisor: If you're looking for a single element, the at method will return the first element matched by the expression. In this case, you'll get back the element itself rather than the Hpricot::Elements array. The above code will find the body tag and give you back the onload attribute. This is the most common reason to use the element directly: when reading and writing HTML attributes. Just as with browser scripting, the inner_html property can be used to get the inner contents of an element. If your expression matches more than one element, you'll get back the contents of ''all the matched elements''. So you may want to use first to be sure you get back only one. If you want the HTML for the whole element (not just the contents), use to_html: All searches return a set of Hpricot::Elements. Go ahead and loop through them like you would an array. Searches can be continued from a collection of elements, in order to search deeper. Searches can also be continued by searching within container elements. Of course, the most succinct ways to do the above are using CSS or XPath. You may certainly edit objects from within your search loops. Then, when you spit out the HTML, the altered elements will show. This changes all span.entryPermalink elements to span.newLinks. Keep in mind that there are often more convenient ways of doing this. Such as the set method: Every element can tell you its unique path (either XPath or CSS) to get to the element from the root tag. The css_path method: Or, the xpath method: When loading HTML documents, you have a few settings that can make Hpricot more or less intense about how it gets involved. Really, there are so many ways to clean up HTML and your intentions may be to keep the HTML as-is. So Hpricot's default behavior is to keep things flexible. Making sure to open and close all the tags, but ignore any validation problems. As of Hpricot 0.4, there's a new :fixup_tags option which will attempt to shift the document's tags to meet XHTML 1.0 Strict. This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's going to move the paragraph below the link. Or up and out of other elements where paragraphs don't belong. If an unknown element is found, it is ignored. Again, :fixup_tags. So, let's go beyond just trying to fix the hierarchy. The :xhtml_strict option really tries to force the document to be an XHTML 1.0 Strict document. Even at the cost of removing elements that get in the way. What measures does :xhtml_strict take? The last option is the :xml option, which makes some slight variations on the standard mode. The main difference is that :xml mode won't try to output tags which are friendlier for browsers. For example, if an opening and closing br tag is found, XML mode won't try to turn that into an empty element. XML mode also doesn't downcase the tags and attributes for you. So pay attention to case, friends. The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:Hpricot, Read Any HTML
An Overview
The Hpricot Kingdom
Installing Hpricot
$ gem install hpricot
An Hpricot Showcase
Loading Hpricot Itself
require 'rubygems'
require 'hpricot'
require 'hpricot'
Load an HTML Page
doc = Hpricot("<p>A simple <b>test</b> string.</p>")
doc = open("index.html") { |f| Hpricot(f) }
require 'open-uri'
doc = open("http://qwantz.com/") { |f| Hpricot(f) }
Search for Elements
doc.search("//p[@class='posted']")
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
(doc/"p.posted")
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
Finding Just One Element
doc.at("body")['onload']
Fetching the Contents of an Element
(doc/"#elementID").inner_html
#=> "..contents.."
(doc/"#elementID").first.inner_html
#=> "..contents.."
Fetching the HTML for an Element
(doc/"#elementID").to_html
#=> "<div id='elementID'>...</div>"
Looping
(doc/"p/a/img").each do |img|
puts img.attributes['class']
end
Continuing Searches
# find all paragraphs.
elements = doc.search("/html/body//p")
# continue the search by finding any images within those paragraphs.
(elements/"img")
#=> #<Hpricot::Elements[{img ...}, {img ...}]>
# find all images within paragraphs.
doc.search("/html/body//p").each do |para|
puts "== Found a paragraph =="
pp para
imgs = para.search("img")
if imgs.any?
puts "== Found #{imgs.length} images inside =="
end
end
# the xpath version
(doc/"/html/body//p//img")
# the css version
(doc/"html > body > p img")
# ..or symbols work, too!
(doc/:html/:body/:p/:img)
Looping Edits
(doc/"span.entryPermalink").each do |span|
span.attributes['class'] = 'newLinks'
end
puts doc
(doc/"span.entryPermalink").set(:class => 'newLinks')
Figuring Out Paths
doc.at("div > div:nth(1)").css_path
#=> "div > div:nth(1)"
doc.at("#header").css_path
#=> "#header"
doc.at("div > div:nth(1)").xpath
#=> "/div/div:eq(1)"
doc.at("#header").xpath
#=> "//div[@id='header']"
Hpricot Fixups
:fixup_tags
doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
:xhtml_strict
doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
Hpricot.XML()
doc = open("http://redhanded.hobix.com/index.xml") do |f|
Hpricot.XML(f)
end
发表评论
-
Ruby 异常处理
2010-08-27 14:29 961转自: http://hi.baidu.com/xiaoxia ... -
Ruby 操作frame中的页面对象
2010-08-27 08:32 1185#open the IE browser ie = ... -
Ruby 获取键盘输入
2010-08-26 17:24 4115############################ ... -
Ruby实现soapcall
2010-08-03 10:49 915require 'windows/system_info' ... -
Ruby的继承
2010-07-29 09:49 881今天研究了一些Ruby的继承,跟C++等语言有所不同,直接看代 ... -
attr_reader,attr_writer,attr_accessor用法
2010-07-27 17:11 3035Ruby 语言与Python和Perl的一个很大区别,在于Ru ... -
Watir 点击页面提示条及下拉菜单选项方法
2010-07-13 15:03 1843先贴个图: 看了javaeye上某一高人关于如何用 ... -
遍历指定浏览器页面特定元素,并返回locator
2010-07-09 09:47 914############################### ... -
watir语法(Web Application Testing in Ruby)
2010-07-09 09:35 1466# 使用Watir工具,需要在脚本中加上 requi ... -
自动化测试框架和自动化测试语言的区别
2010-07-09 08:23 0Watir (http://wiki.openqa.org/ ... -
Ruby $ie = Watir::IE.new无法启动浏览器问题解决方法
2010-07-08 16:08 2761今天发现一个问题: 多次Run脚本之后,$ie = W ... -
Watir Ruby 对弹出框进行处理的方法总结
2010-07-08 10:08 3643最近想用Watir+Ruby写几个 ...
相关推荐
在Ruby社区中,hpricot被广泛用来抓取网页内容,提取数据,或者进行网页自动化任务。尽管后来有了更现代的替代品如Nokogiri,但hpricot因其简洁的API和对老式HTML的宽容性而仍然有其价值。hpricot通过提供类似于...
Hpricot 的API 文档 Hpricot 的API 文档
Hpricot已经结束。 在多年以来一直缺乏适当的珠宝维护者之后,人们决定最终关闭有关hpricot的书籍。 大多数用户已迁移到替代方案,并且根本没有时间或精力继续使用当前代码库。 如果您觉得自己有时间并希望接手,...
受到为什么幸运的的。学分Hpple由Geoffrey , 和。...安装打开您的Xcode项目和Hpple项目。 将“ Hpple”目录拖到您的项目中。 将libxml2.2.dylib框架添加到您的项目中,并按照描述搜索路径即将推出更多文档和简短的...
- **早期尝试(2007年)**:当时还没有iPhone,LinkedIn团队尝试使用LinkedIn Web APIs进行数据抓取,构建了一个基于Rails和Hpricot XML解析器的应用。但这种方式存在安全隐患,当用户更改用户名和密码时,系统会...
cli> = 0.1.7和Grunt> = 0.4.2( npm install grunt-cli -g ) Ruby> = 1.9.3() 指南针> = 0.12.2( gem install compass ) Premailer> = 1.8.0( gem install premailer ,大多数情况下, gem install hpricot ...
Ruby有很多库,如Nokogiri和Hpricot,能够解析和操作HTML文档。 **文件名称列表:rb2-examples-master** 暗示了这是一个Git仓库的克隆版本,通常`master`分支代表项目的主线开发。在这个目录下,你可能会找到如...
11. **网页解析**:除了Nokogiri,还有其他解析库如`Hpricot`,虽然不如Nokogiri活跃,但在某些场景下也足够使用。 12. **Ethical Scraping**:尊重网站的robots.txt文件,避免抓取受限制的内容,合理设置抓取频率...
gem install premailer hpricot nokogiri 安装SASS sudo gem安装sass 笨拙的电子邮件设计工作流程 设计和测试电子邮件是一件很痛苦的事情。 HTML表,内联CSS,要测试的各种设备和客户端,以及对最新Web标准的各种...
downmark_it 是一个用来将 HTML 转成 Markdown 格式的 Ruby 开发包。依赖于 Hpricot 标签:downmark
安装 hpricot gem,然后将插件安装到 Redmine。 现在转到管理 -> 设置 -> 收到的电子邮件,如有必要,选中启用 WS 接收电子邮件并生成 API 密钥。 这是您在下一步配置通知程序时需要的密钥。 客户端配置 对于 Rails ...