论坛首页 → Java企业应用论坛 →

jsoup使用

全部 Hibernate Spring Struts iBATIS 企业应用 Lucene SOA Java综合 Tomcat 设计模式 OO JBoss

浏览 5722 次

锁定老帖子主题：jsoup使用精华帖 (0) :: 良好帖 (1) :: 新手帖 (11) :: 隐藏帖 (0)
作者	正文
ordinary 等级: 初级会员性别: 文章: 70 积分: 80 来自: 北京	发表时间：2011-04-20 最后修改：2011-04-20 相关推荐: c++中的enum类型 protobuf message定义_Protobuf语言指南 C++基础学习（四）枚举和结构体 C++枚举 c++中的结构，联合，枚举更多相关推荐 Java综合 jsoup 是一款 Java 的HTML 解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于JQuery的操作方法来取出和操作数据。请参考：http://jsoup.org/ jsoup的主要功能如下：从一个URL，文件或字符串中解析HTML；使用DOM或CSS选择器来查找、取出数据；可操作HTML元素、属性、文本； jsoup是基于MIT协议发布的，可放心使用于商业项目。下载和安装： maven安装方法：把下面放入pom.xml下 <dependency> <!-- jsoup HTML parser library @ http://jsoup.org/ --> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.5.2</version> </dependency> 用jsoup解析html的方法如下：解析url html方法 `Document doc = Jsoup.connect("http://example.com") .data("query", "Java") .userAgent("Mozilla") .cookie("auth", "token") .timeout(3000) .post();` 从文件中解析的方法： `File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");` 类试js jsoup提供下面方法： `getElementById(String id) 用id获得元素` `getElementsByTag(String tag) 用标签获得元素` `getElementsByClass(String className) 用class获得元素` `getElementsByAttribute(String key) 用属性获得元素` 同时还提供下面的方法提供获取兄弟节点： `siblingElements()`, `firstElementSibling()`, `lastElementSibling()`;`nextElementSibling()`, `previousElementSibling()` 用下面方法获得元素的数据： `attr(String key)` 获得元素的数据 `attr(String key, String value)` t设置元素数据 `attributes()` 获得所以属性 `id()`, `className()` `classNames() 获得id class得值` `text()获得文本值` `text(String value)` 设置文本值 `html()` 获取html `html(String value)设置html` `outerHtml()` 获得内部html `data()获得数据内容` `tag()` 获得tag 和 `tagName() 获得tagname` 操作html提供了下面方法： `append(String html)`, `prepend(String html)` `appendText(String text)`, `prependText(String text)` `appendElement(String tagName)`, `prependElement(String tagName)` `html(String value)` 通过类似jquery的方法操作html `File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements links = doc.select("a[href]"); // a with href Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png Element masthead = doc.select("div.masthead").first(); // div with class=masthead Elements resultLinks = doc.select("h3.r > a"); // direct a after h3` 支持的操作有下面这些： `tagname 操作tag` `ns\|tag ns或tag` `#id 用id获得元素` `.class 用class获得元素` `[attribute] 属性获得元素` `[^attr]`: 以attr开头的属性 `[attr=value] 属性值为`value `[attr^=value]`, `[attr$=value]`, `[attr=value]` `[attr~=regex]正则` ``:所以的标签选择组合 `el#id el和id定位` `el.class e1和class定位` `el[attr]` e1和属性定位 `ancestor child` ancestor下面的child 等等抓取网站标题和内容及里面图片的事例： public void parse(String urlStr) { // 返回结果初始化。 Document doc = null; try { doc = Jsoup .connect(urlStr) .userAgent( "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.15)") // 设置User-Agent .timeout(5000) // 设置连接超时时间 .get(); } catch (MalformedURLException e) { log.error( e); return ; } catch (IOException e) { if (e instanceof SocketTimeoutException) { log.error( e); return ; } if(e instanceof UnknownHostException){ log.error(e); return ; } log.error( e); return ; } system.out.println(doc.title()); Element head = doc.head(); Elements metas = head.select("meta"); for (Element meta : metas) { String content = meta.attr("content"); if ("content-type".equalsIgnoreCase(meta.attr("http-equiv")) && !StringUtils.startsWith(content, "text/html")) { log.debug( urlStr); return ; } if ("description".equalsIgnoreCase(meta.attr("name"))) { system.out.println(meta.attr("content")); } } Element body = doc.body(); for (Element img : body.getElementsByTag("img")) { String imageUrl = img.attr("abs:src");//获得绝对路径 for (String suffix : IMAGE_TYPE_ARRAY) { if(imageUrl.indexOf("?")>0){ imageUrl=imageUrl.substring(0,imageUrl.indexOf("?")); } if (StringUtils.endsWithIgnoreCase(imageUrl, suffix)) { imgSrcs.add(imageUrl); break; } } } } 这里重点要提的是怎么获得图片或链接的决定地址：如上获得绝对地址的方法String imageUrl = img.attr("abs:src");//获得绝对路径，前面添加abs：jsoup就会获得决定地址；想知道原因，咱们查看下源码，如下： //该方面是先从map中找看是否有该属性key，如果有直接返回，如果没有检查是否 //以abs：开头 public String attr(String attributeKey) { Validate.notNull(attributeKey); if (hasAttr(attributeKey)) return attributes.get(attributeKey); else if (attributeKey.toLowerCase().startsWith("abs:")) return absUrl(attributeKey.substring("abs:".length())); else return ""; } 接着查看absUrl方法： /** * Get an absolute URL from a URL attribute that may be relative (i.e. an <code><a href></code> or * <code><img src></code>). * <p/> * E.g.: <code>String absUrl = linkEl.absUrl("href");</code> * <p/> * If the attribute value is already absolute (i.e. it starts with a protocol, like * <code>http://</code> or <code>https://</code> etc), and it successfully parses as a URL, the attribute is * returned directly. Otherwise, it is treated as a URL relative to the element's {@link #baseUri}, and made * absolute using that. * <p/> * As an alternate, you can use the {@link #attr} method with the <code>abs:</code> prefix, e.g.: * <code>String absUrl = linkEl.attr("abs:href");</code> * * @param attributeKey The attribute key * @return An absolute URL if one could be made, or an empty string (not null) if the attribute was missing or * could not be made successfully into a URL. * @see #attr * @see java.net.URL#URL(java.net.URL, String) */ //看到这里大家应该明白绝对地址是怎么取的了 public String absUrl(String attributeKey) { Validate.notEmpty(attributeKey); String relUrl = attr(attributeKey); if (!hasAttr(attributeKey)) { return ""; // nothing to make absolute with } else { URL base; try { try { base = new URL(baseUri); } catch (MalformedURLException e) { // the base is unsuitable, but the attribute may be abs on its own, so try that URL abs = new URL(relUrl); return abs.toExternalForm(); } // workaround: java resolves '//path/file + ?foo' to '//path/?foo', not '//path/file?foo' as desired if (relUrl.startsWith("?")) relUrl = base.getPath() + relUrl; URL abs = new URL(base, relUrl); return abs.toExternalForm(); } catch (MalformedURLException e) { return ""; } } } 声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

1202 等级: 初级会员性别: 文章: 9 积分: 30 来自: 天津	发表时间：2011-08-29 你好，我最近在学习使用jsoup，发现这样一个问题：我像使用div.div.m_l_hot_im 来选取某个class为m_l_hot_im 的div，可结果是连class为div.m_l_hot_iml及div.m_l_hot_imr的所有的以div.m_l_hot_im 开头的都选择了，请问这是一个bug吗
返回顶楼	回帖地址 0 0 请登录后投票

ordinary 等级: 初级会员性别: 文章: 70 积分: 80 来自: 北京	发表时间：2011-08-30 1202 写道你好，我最近在学习使用jsoup，发现这样一个问题：我像使用div.div.m_l_hot_im 来选取某个class为m_l_hot_im 的div，可结果是连class为div.m_l_hot_iml及div.m_l_hot_imr的所有的以div.m_l_hot_im 开头的都选择了，请问这是一个bug吗你能把你的代码贴上来看看嘛?
返回顶楼	回帖地址 0 0 请登录后投票

论坛首页 → Java企业应用版

跳转论坛:

Global site tag (gtag.js) - Google Analytics