Showdown - Java HTML Parsing Comparison
February 2, 2008 at 4:10 pm · Filed under Java
I had to do some HTML parsing today, but unfortunately most HTML on the web is not well-formed like the markup created here at Lumidant. Missing end tags and other broken syntax throws a wrench into the situation. Luckily, others have already addressed this issue. Many times over in fact, leaving many to wonder which solution to implement.
Once you parse HTML, you can do some cool stuff with it like transform it or extract some information. For that reason it is sometimes used for screen scraping. So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. I know that there are many others I could have chosen from as well, but this seemed to be a good sampling and there’s only so much time in the day. I also chose 10 URLs to parse. Being a true Clevelander I picked the sites of a number of local attractions. I’m right near all of the stadiums, so the Quicken Loans Arena website was my first target. I sometimes jokingly refer to my city as the “Mistake on the Lake” and the pure awfulness of the HTML from my city did not fail me. The ten URLs I chose are:
http://www.theqarena.com
http://cleveland.indians.mlb.com
http://www.clevelandbrowns.com
http://www.cbgarden.org
http://www.clemetzoo.com
http://www.cmnh.org
http://www.clevelandart.org
http://www.mocacleveland.org
http://www.glsc.org
http://www.rockhall.com
I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. This was a design tip fresh in my mind from reading my all-time favorite technical book: Effective Java by Josh Bloch. The implementation specific code for each library is below:
NekoHTML:
final DOMParser parser = new DOMParser();
try {
parser.parse(new InputSource(urlIS));
document = parser.getDocument();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
TagSoup:
final Parser parser = new Parser();
SAX2DOM sax2dom = null;
try {
sax2dom = new SAX2DOM();
parser.setContentHandler(sax2dom);
parser.setFeature(Parser.namespacesFeature, false);
parser.parse(new InputSource(urlIS));
} catch (Exception e) {
e.printStackTrace();
}
document = sax2dom.getDOM();
jTidy:
final Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setForceOutput(true);
document = tidy.parseDOM(urlIS, null);
HtmlCleaner:
final HtmlCleaner cleaner = new HtmlCleaner(urlIS);
try {
cleaner.clean();
document = cleaner.createDOM();
} catch (Exception e) {
e.printStackTrace();
}
Finally, to judge the ability to parse the HTML, I ran the XQuery “//a” to grab all the <a> tags from the document. The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. Most of the others were not able to make it past even the very first link I provided, which was to Quicken Loans Arena site. HtmlCleaner’s full results:
Found 87 links at http://www.theqarena.com/
Found 156 links at http://cleveland.indians.mlb.com/
Found 96 links at http://www.clevelandbrowns.com/
Found 106 links at http://www.cbgarden.org/
Found 70 links at http://www.clemetzoo.com/
Found 23 links at http://www.cmnh.org/site/
Found 27 links at http://www.clevelandart.org/
Found 51 links at http://www.mocacleveland.org/
Found 27 links at http://www.glsc.org/
Found 90 links at http://www.rockhall.com/
One disclaimer that I will make is that I did not go out of my way to improve the performance of any of these libraries. Some of them had additional options that could be set to possibly improve performance. I did not delve into wading through the documentation to figure out what these options were and simply used the plain vanilla incantations. HtmlCleaner seems to offer me everything I need and was quick and easy to implement.
转自:
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
分享到:
相关推荐
在`showdown-katex-stable`这个压缩包中,包含了`showdown-katex`稳定版本的源码和相关资源,开发者可以直接引用这些文件进行本地化部署,以实现自定义的Markdown到HTML的转换服务。 总的来说,`showdown-katex`...
npm install --save showdown-highlight # Using yarn yarn add showdown-highlight :clipboard: 例子 const showdown = require ( 'showdown' ) , showdownHighlight = require ( "showdown-highlight" ) ; // ...
rlcard-showdown-master.zip
Showdown HTML Escape 插件 这个插件可以防止使用任意 HTML,并且只允许使用特定的 Markdown 语法。 如果您希望允许用户使用 Markdown 语法设置文本格式,但又不希望他们... script src =" showdown-htmlescape.mi
资源分类:Python库 所属语言:Python 资源全名:mlb-showdown-bot-2.6.3.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
安装扩展节点模块: npm install showdown-furigana-extension --save var furigana = require('showdown-furigana-extension'); var showdown = new Showdown.converter({extensions:[furigana]}); var html = ...
script src =" showdown-toc.js " > </ script > 然后在您的降价中,只需在您希望出现目录的位置放置一个[toc] 。 此扩展将查找 [toc] 之后的第一个标头,并使用它首先找到的任何内容作为 TOC 其
npm install --save showdown-emoji # Using yarn yarn add showdown-emoji :clipboard: 例 const showdown = require ( 'showdown' ) , showdownEmoji = require ( "showdown-emoji" ) ; // After requiring the...
摊牌日期延长摊牌的日期延长。 提供降价使用的日期和时间。 @year通过js new Date().getFullYear()生成: 2014 。 @date通过js new Date().... script src =" showdown-date/showdown-date.js " />
showdown-xss-filter 扩展,使用过滤XSS。客户端< script src =" /path/to/showdown/src/showdown.js " > </ script >< script src =" /path/to/xss/dist/xss.min.js " > </ script >< ...
摊牌内部链接扩展 扩展程序,可轻松将引用添加到您的markdown中! 语法是[^](链接名称),它被替换为标准的[1] [link name]样式链接,您稍后将需要在页面中引用该链接,如下所示: [链接名称]: : “我的Github...
Pokemon-Showdown-Bot 一个简单且非常基本的聊天机器人。安装不建议初学者使用! 为什么? 因为我懒得详细解释一切。 Pokémon Showdown Bot 需要才能运行。 该机器人尚未在所有可能的node.js版本上进行测试,但具有...
数码宝贝对决:维度裂痕这是官方的Digimon Showdown:Dimensional Rift存储库。 这将与《 一起使用。如何在您自己的服务器上实施(mod...(DSDR-CSS)用法(注意:Node.js是必需的) 在Digimon-Showdown-Dimensional-Ri
API Showdown - Dog Breeds项目是一个非常有趣的实践案例,它展示了如何使用不同编程语言和框架来实现同一个功能,即访问和处理关于狗品种的API数据。在这个项目中,开发者可以比较各种技术栈的优缺点,了解它们在...
script src =" /path/to/showdown-intlink-filter.js " > </ script > var converter = new showdown . Converter ( { extensions : [ 'intlink' ] } ) ; 配置:(请注意,这可能会在不久的将来改变) ...
Pokemon-Showdown-Bot 聊天机器人。 这个机器人有许多命令,有些有用,有些不太有用,还有修改的能力。 它只对基本的违规行为做出React,例如洪水/上限/拉伸。安装Pokémon Showdown Bot 需要才能运行。 该机器人...
为了将摊牌中的学习集数据应用于各种ROM Hack中,开发者创建了“Showdown-Learnset-Converter”,这是一个强大的工具,能够帮助我们把摊牌的学习集转换成适合不同ROM Hack格式的数据。 首先,我们来理解什么是学习...
(https://chrome.google.com/webstore/detail/pokémon-showdown-enhanced/gmkpigbaephkbabhogllilhepglikolk) 用于神奇宝贝摊牌的增强工具提示 查看神奇宝贝摊牌工具提示的更多信息。 对新手和专家玩家有用 通过...
戳对决去Golang包装。什么该库提供了一个...例子1v1战斗的简单规格import ("github.com/voidshard/poke-showdown-go/pkg/sim")var spec = & sim. BattleSpec {Format : sim . FormatGen8 ,Players : [][] * sim. Pok