- 浏览: 1661161 次
- 性别:
- 来自: 北京
-
文章分类
- 全部博客 (405)
- C/C++ (16)
- Linux (60)
- Algorithm (41)
- ACM (8)
- Ruby (39)
- Ruby on Rails (6)
- FP (2)
- Java SE (39)
- Java EE (6)
- Spring (11)
- Hibernate (1)
- Struts (1)
- Ajax (5)
- php (2)
- Data/Web Mining (20)
- Search Engine (19)
- NLP (2)
- Machine Learning (23)
- R (0)
- Database (10)
- Data Structure (6)
- Design Pattern (16)
- Hadoop (2)
- Browser (0)
- Firefox plugin/XPCOM (8)
- Eclise development (5)
- Architecture (1)
- Server (1)
- Cache (6)
- Code Generation (3)
- Open Source Tool (5)
- Develope Tools (5)
- 读书笔记 (7)
- 备忘 (4)
- 情感 (4)
- Others (20)
- python (0)
最新评论
-
532870393:
请问下,这本书是基于Hadoop1还是Hadoop2?
Hadoop in Action简单笔记(一) -
dongbiying:
不懂呀。。
十大常用数据结构 -
bing_it:
...
使用Spring MVC HandlerExceptionResolver处理异常 -
一别梦心:
按照上面的执行,文件确实是更新了,但是还是找不到kernel, ...
virtualbox 4.08安装虚机Ubuntu11.04增强功能失败解决方法 -
dsjt:
楼主spring 什么版本,我的3.1 ,xml中配置 < ...
使用Spring MVC HandlerExceptionResolver处理异常
Jsoup是一个Java的HTML解析器,提供了非常方便的抽取和操作HTML文档方法,可以结合DOM,CSS和Jquery类似的方法来定位和得到节点的信息。
有着和Jquery一样强大的select和pipeline的API。
我们以从58同城网抽取租房信息为例,来说明如何使用它:
Selector overview
* tagname: find elements by tag, e.g. a
* ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
* #id: find elements by ID, e.g. #logo
* .class: find elements by class name, e.g. .masthead
* [attribute]: elements with attribute, e.g. [href]
* [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
* [attr=value]: elements with attribute value, e.g. [width=500]
* [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
* [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
* *: all elements, e.g. *
Selector combinations
* el#id: elements with ID, e.g. div#logo
* el.class: elements with class, e.g. div.masthead
* el[attr]: elements with attribute, e.g. a[href]
* Any combination, e.g. a[href].highlight
* ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
* parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
* siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
* siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
* el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
Pseudo selectors
* :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
* :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
* :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
* :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
* :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
* :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
* :containsOwn(text): find elements that directly contain the given text
* :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
* :matchesOwn(regex): find elements whose own text matches the specified regular expression
* Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
更多的信息可以参考[http://jsoup.org/|http://jsoup.org/]
有着和Jquery一样强大的select和pipeline的API。
我们以从58同城网抽取租房信息为例,来说明如何使用它:
package test import org.jsoup.nodes.Document import java.util.HashMap import org.jsoup.Jsoup /** * Author: fuliang * http://fuliang.iteye.com */ class HouseEntry(var title: String,var link: String,var price: Integer, var houseType: String, var date: String){ override def toString(): String = { return String.format("title: %s\tlink:%s\tprice:%d\thouseType:%s\tdate:%s\n", title,link,price,houseType,date); } } class HouseRentCrawler{ def crawl(url: String,keyword: String,lowRange: Int,highRange: Int): List[HouseEntry] = { var doc = fetch(url,keyword,lowRange,highRange); return extract(doc); } private def fetch(url:String,keyword: String,lowRange: Int,highRange: Int): Document = { var params = new HashMap[String,String](); params.put("final","1"); params.put("jump","2"); params.put("searchtype","3"); params.put("key",keyword); params.put("MinPrice",lowRange + "_" + highRange); return Jsoup.connect(url).data(params) .userAgent("Mozilla") .timeout(10000) .get(); } private def extract(doc: Document): List[HouseEntry] = { val elements = doc.select("#infolist > tr:not(.dev)"); var houseEntries = List[HouseEntry](); for(val i <- 0 until elements.size()){ val entry = elements.get(i); val fields = entry.select("td"); val title = fields.get(0).text(); val link = fields.get(0).select("a[class=t]").attr("href"); val price = fields.get(1).text().toInt; val houseType = fields.get(2).text(); val date = fields.get(3).text(); val houseEntry = new HouseEntry(title,link,price,houseType,date); houseEntries ::= houseEntry; } return houseEntries; } } object HouseRentCrawler{ def main(args: Array[String]) { val url = "http://bj.58.com/zufang"; val crawler = new HouseRentCrawler(); val houseEntries = crawler.crawl(url,"智学苑",2000,3500); for(val entry <- houseEntries){ println(entry); } } }
Selector overview
* tagname: find elements by tag, e.g. a
* ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
* #id: find elements by ID, e.g. #logo
* .class: find elements by class name, e.g. .masthead
* [attribute]: elements with attribute, e.g. [href]
* [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
* [attr=value]: elements with attribute value, e.g. [width=500]
* [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
* [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
* *: all elements, e.g. *
Selector combinations
* el#id: elements with ID, e.g. div#logo
* el.class: elements with class, e.g. div.masthead
* el[attr]: elements with attribute, e.g. a[href]
* Any combination, e.g. a[href].highlight
* ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
* parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
* siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
* siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
* el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
Pseudo selectors
* :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
* :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
* :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
* :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
* :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
* :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
* :containsOwn(text): find elements that directly contain the given text
* :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
* :matchesOwn(regex): find elements whose own text matches the specified regular expression
* Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
更多的信息可以参考[http://jsoup.org/|http://jsoup.org/]
发表评论
-
Lucene 索引格式
2013-06-25 20:11 0索引结构: 索引层次 ... -
计算广告学
2012-08-12 13:53 0计算广告学一: 1、核 ... -
《Lucene in Action》简单笔记
2011-12-22 09:19 0第一章 Meet Lucene -
Information Retrieval Resources
2011-04-07 16:40 1411Information Retrieval Resource ... -
常见文件类型识别
2010-09-22 20:09 11924根据文件的后缀名识别文件类型并不准确,可以使用文件的头信息进行 ... -
(zz)信息检索领域资料整理
2010-06-05 13:05 3190A Guide to Information Retrieva ... -
Introduce to Inforamtion Retrieval读书笔记(2)
2009-10-31 13:02 1968The term vocabulary and posting ... -
Introduce to Inforamtion Retrieval读书笔记(1)
2009-10-25 23:49 2070很好的一本书,介绍的非常全面,看了很久了,还没有看完,刚看完前 ... -
Query Log Mining notes
2009-10-02 18:08 1283Enhancing Efficiency of Search ... -
百度搜索的一些高级语法
2009-08-27 20:06 19591.title语法 就是在title ... -
Hadoop好书推荐:Hadoop The Definitive Guide
2009-08-16 22:49 3657第一本详细介绍Hadoop的书籍,从网上下来看了几章,作者是H ... -
Java开源搜索引擎[收藏]
2008-04-24 00:09 2917Egothor Egothor是一个用Java编写的开 ... -
分享一本斯坦福的信息检索的教材
2008-01-04 23:59 2483斯坦福的信息检索的教材,还没出版,先分享一下电子版原稿. 对于 ... -
分享一本搜索引擎的电子书
2007-12-29 19:42 2556还没有来得及看,但搜索引擎的书不是很好找,先放上,希望对大家能 ... -
分享一个Nutch入门学习的资料
2007-12-18 20:49 4284分享一个Nutch入门学习的资料,感觉写的还不错. -
搜索引擎Nutch源代码研究之一 网页抓取(4)
2007-12-17 22:37 8413今天来看看Nutch如何Parse网页的: Nutch使用了两 ... -
[转]MAP/REDUCE:Google和Nutch实现异同及其他
2007-12-15 19:21 3014设计要素 nutch包含以下几个部分: 辅助类 Log:记载运 ... -
Nutch源代码学习小小总结一下
2007-12-15 19:13 4490我现在看得源码主要是网页抓取部分,这部分相对比较容易。我首先定 ... -
搜索引擎Nutch源代码研究之一 网页抓取(3)
2007-12-15 16:39 4608今天我们看看Nutch网页抓取,所用的几种数据结构: 主要涉及 ... -
搜索引擎Nutch源代码研究之一 网页抓取(2)
2007-12-15 00:36 5587今天我们来看看Nutch的源代码中的protocol-h ...
相关推荐
python学习资源
jfinal-undertow 用于开发、部署由 jfinal 开发的 web 项目
基于Andorid的音乐播放器项目设计(国外开源)实现源码,主要针对计算机相关专业的正在做毕设的学生和需要项目实战练习的学习者,也可作为课程设计、期末大作业。
python学习资源
python学习资源
python学习一些项目和资源
【毕业设计】java-springboot+vue家具销售平台实现源码(完整前后端+mysql+说明文档+LunW).zip
HTML+CSS+JavaScarip开发的前端网页源代码
python学习资源
【毕业设计】java-springboot-vue健身房信息管理系统源码(完整前后端+mysql+说明文档+LunW).zip
成绩管理系统C/Go。大学生期末小作业,指针实现,C语言版本(ANSI C)和Go语言版本
1_基于大数据的智能菜品个性化推荐与点餐系统的设计与实现.docx
【毕业设计】java-springboot-vue交流互动平台实现源码(完整前后端+mysql+说明文档+LunW).zip
内容概要:本文主要探讨了在高并发情况下如何设计并优化火车票秒杀系统,确保系统的高性能与稳定性。通过对比分析三种库存管理模式(下单减库存、支付减库存、预扣库存),强调了预扣库存结合本地缓存及远程Redis统一库存的优势,同时介绍了如何利用Nginx的加权轮询策略、MQ消息队列异步处理等方式降低系统压力,保障交易完整性和数据一致性,防止超卖现象。 适用人群:具有一定互联网应用开发经验的研发人员和技术管理人员。 使用场景及目标:适用于电商、票务等行业需要处理大量瞬时并发请求的业务场景。其目标在于通过合理的架构规划,实现在高峰期保持平台的稳定运行,保证用户体验的同时最大化销售额。 其他说明:文中提及的技术细节如Epoll I/O多路复用模型以及分布式系统中的容错措施等内容,对于深入理解大规模并发系统的构建有着重要指导意义。
基于 OpenCV 和 PyTorch 的深度车牌识别
【毕业设计-java】springboot-vue教学资料管理系统实现源码(完整前后端+mysql+说明文档+LunW).zip
此数据集包含有关出租车行程的详细信息,包括乘客人数、行程距离、付款类型、车费金额和行程时长。它可用于各种数据分析和机器学习应用程序,例如票价预测和乘车模式分析。
把代码放到Word中,通过开发工具——Visual Basic——插入模块,粘贴在里在,把在硅基流动中申请的API放到VBA代码中。在Word中,选择一个问题,运行这个DeepSeekV3的宏就可以实现在线问答
【毕业设计】java-springboot+vue机动车号牌管理系统实现源码(完整前后端+mysql+说明文档+LunW).zip
【毕业设计】java-springboot-vue交通管理在线服务系统的开发源码(完整前后端+mysql+说明文档+LunW).zip