- 浏览: 289547 次
- 性别:
- 来自: 福州
最新评论
-
1641606815:
可以考虑使用HttpClient实现,那都是封装好的东西,使用 ...
Java写的爬虫的基本程序 -
SE_XiaoFeng:
yajie 写道只有对http协议才行,假如有ftp协议呢。不 ...
Java写的爬虫的基本程序 -
dongtianlaile:
如果是https网站,怎么办?
Java写的爬虫的基本程序 -
yeelor:
J2CMS是一个基于JAVAEE平台的轻量极的敏捷开发架构,实 ...
java的CMS,前途在哪里 -
yeelor:
j2cms
java的CMS,前途在哪里
Jericho HTML Parser
Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.
It is an open source library released under both the Eclipse Public License (EPL) and GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in either one of these licence documents.
The javadocs provide comprehensive documentation of the entire API, as well as being a very useful reference on aspects of HTML and XML in general.
Visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/ for downloads and support.
You can also rate the project highly at http://freshmeat.net/projects/jerichohtml/
Release notes for each version can be found in a file called release.txt in the project root directory.
Features
The library distinguishes itself from other HTML parsers with the following major features:
- The presence of badly formatted HTML does not interfere with the parsing of the rest of the document, which makes the library ideal for use with "real-world" HTML that chokes other parsers.
- ASP, JSP, PSP, PHP and Mason server tags are explicitly recognised by the parser. This means that normal HTML is still parsed properly even if there are server tags inside them, which is common for example when dynamically setting element attributes.
- It is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
- Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.
- Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
- The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.
- The row and column number of each position in the source document is easily accessible.
- Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner.
- Custom tag types can be easily defined and registered for recognition by the parser.
- Built-in functionality to format HTML source code that indents elements according to their depth in the document element hierarchy.
- Built-in functionality to render HTML markup with simple text formatting.
- Built-in functionality to extract all text from HTML markup, suitable for feeding into a text search engine such as Apache Lucene.
Sample Programs
The samples/console
directory in the download package contains sample programs for performing common tasks and demonstrating the functionality of the library. The .bat
files can be run directly on a MS-Windows operating system, or the following syntax can be used on a UNIX based operating system from the samples/console
directory:
java -classpath classes;../lib/jericho-html-x.x.jar ProgramName
where x.x
is the current release number and ProgramName
is the name of the sample program to run.
The following sample programs are available:
ConvertStyleSheets.java | Demonstrates how to detect all external style sheets and place them inline into the document. |
DisplayAllElements.java | Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML. |
ExtractText.java | Demonstrates the use of the TextExtractor class that extracts all of the text from a document, as well as the title, description, keywords and links. |
FindSpecificTags.java | Demonstrates how to search for tags with a specified name, in a specified namespace, or special tags such as document type declarations, XML declarations, XML processing instructions, common server tags, PHP tags, Mason tags, and HTML comments. |
FormControlDisplayCharacteristics.java | Demonstrates setting the display characteristics of individual form controls. This allows a control to be disabled, removed, or replaced with a plain text representation of its value (display value). The new document is written to a file called NewForm.html |
FormFieldCSVOutput.java | Demonstrates the use of the FormFields.getColumnValues(Map) method to store form data in a .CSV file, automatically creating separate columns for fields that can contain multiple values (such as checkboxes). The output is written to a file called FormData.csv |
FormFieldList.java | Demonstrates the use of the Segment.findFormFields() method to list all form fields and their associated controls in a document. |
FormFieldSetValues.java | Demonstrates setting the values of form controls, which is best done via the FormFields object. The new document is written to a file called NewForm.html |
FormatSource.java | Demonstrates the use of the SourceFormatter class that formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent. Also known as a "source beautifier". |
RenderToText.java | Demonstrates the use of the Renderer class that performs a simple text rendering of HTML markup, similar to the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails. |
Encoding.java | Demonstrates the use of the EncodingDetector class and how to determine the encoding of a source document. |
SplitLongLines.java | Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines. |
Building
The build and sample files are implemented as DOS .bat files only. This is because I wanted to avoid the need to install ANT for such a simple library. Sorry to all the unix users for the inconvenience.
On the Drawing Board...
- Ability to generate a JDOM document, making it a JTidy alternative
- Online interactive sample programs - please let me know if you are willing to host the FormatSource.jsp page on your web server
- .NET (DotNet) version if enough interest shown (register you interest via the forums)
Alternative HTML Parsers
This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers, none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change, none could reproduce a source document containing badly formatted or non-HTML components without change, and none provided a means to track the positions of nodes in the source text. A list of these parsers and a brief description follows, but please note that I have not revised this analysis since the before this package was written. Please let me know if there are any errors.
- JavaCC HTML Parser by Quiotix Corporation (http://www.quiotix.com/downloads/html-parser/)
GNU GPL licence, expensive licence fee to use in commercial application. Does not support document structure (parses into a flat node stream). - Demonstrational HTML 3.2 parser bundled with JavaCC. Virtually useless.
- JTidy (http://jtidy.sourceforge.net/)
Supports document structure, but by its very nature it "tidies" up anything it doesn't like in the source document. On first glance it looks like the positions of nodes in the source are accessible, at least in protected start and end fields in the Node class, but these are pointers into a different buffer and are of no use. - javax.swing.text.html.parser.Parser
Comes standard in the JDK. Supports document structure. Does not track the positions of nodes in the source text, but can be easily modified to do so (although not sure of legal implications of modifications). Requires a DTD to function, but only comes with HTML3.2 DTD which is unsuitable. Even if an HTML 4.01 DTD were found, the parser itself might need tweaking to cater for the new element types. The DTD needs to be in the format of a "bdtd" file, which is a binary format used only by Sun in this parser implementation. I have found many requests for a 4.01 bdtd file in newsgroups etc on the web, but they all reamain unanswered. Building it from scratch is not so easy. - Kizna HTML Parser v1.1 (http://htmlparser.sourceforge.net/)
GNU LGPL licence. Version 1.1 was very simple without support for document structure. I have since revisited this project at sourceforge (early 2004), where version 1.4 is now available. There are now two separate libraries, one with and one without document structure support. It claims to now also be capable of reproducing source text verbatim. - CyberNeko HTML Parser (http://www.apache.org/~andyc/neko/doc/html/index.html)
Apache-style licence. Supports document structure. Based on the very popular Xerces XML parser. At the time of evaluation this parser didn't regenerate the source accurately enough.
Sponsors: |
Corporate Translations |
Taking Care of Trees |
发表评论
-
java的CMS,前途在哪里
2009-06-12 09:35 22648最近在用CMS做项目。由 ... -
搞定struts中cookie
2008-11-18 14:34 3089今天碰到的一个问题:配置页提交一个信息到struts的acti ... -
java;jsp;tomcat;mysql;hibernate;j2ee 编码中文乱码全面解决方案
2008-04-24 11:13 5010******************************* ... -
用HttpClient来模拟浏览器GET POST
2008-03-14 09:06 5420作者:jaddy0302 日期:2006-12 ... -
Velocity模板引擎体验
2007-12-28 10:20 2256不少人看过或了解过Velocity,名称字面翻译为:速度、速率 ... -
详细解析Java中抽象类和接口的区别
2007-12-26 10:57 1440在Java语言中, abstract class 和inter ... -
利用Java生成静态HMTL页面的方法收集
2007-12-24 09:59 15383生成静态页面技术解决方案之一 转载者前言:这是一个全面的js ... -
HttpClient+Jericho HTML Parser 实现网页的抓取
2007-12-22 21:06 10032Jericho HTML Parser是一个简 ... -
Java写的爬虫的基本程序
2007-12-22 12:51 11557这是一个web搜索的基本程序,从命令行输入搜索条件(起始的UR ... -
网络蜘蛛基本原理
2007-12-22 12:44 3377网络蜘蛛即Web Spider,是 ... -
一个java的web日历实现
2007-04-24 18:05 3715相信大家都看到很 ... -
《让僵冷的翅膀飞起来》系列之三——从Adapter模式到Decorator模式
2007-04-22 20:13 1725一、 考察对象的Adapter模式 从上文看到,经过引入Ada ... -
《让僵冷的翅膀飞起来》系列之二——从实例谈Adapter模式
2007-04-22 20:13 1640在拙文《<让僵冷的翅膀飞起来>系列之一——从实例谈 ... -
《让僵冷的翅膀飞起来》系列之一--从实例谈OOP、工厂模式和重构
2007-04-22 20:12 1719有了翅膀才能飞,欠缺 ...
相关推荐
**jericho HTML Parser** 是一个Java库,专门用于解析HTML文档。它被设计用来处理不规范的HTML,即那些在语法上可能不完全符合HTML标准的实际网页。在处理这种复杂性和不可预测性方面,jericho HTML Parser表现出了...
** Jericho HTML Parser 简介 ** Jericho HTML Parser 是一个专为Java开发者设计的开源库,主要用于解析和操作HTML文档。这个库的核心特性在于它能够处理不规范的HTML,即使遇到错误或非标准的标记,也能尽可能地...
Jericho Html Parser 是一个在 SourceForge 上非常受欢迎的开源HTML解析库,专为处理复杂的HTML文档设计。这个库因其高效和强大的解析能力而备受推崇,尤其是在处理非结构化或不规则的网页时。本文将深入探讨如何...
A simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid...
在给定的文件“基于JerichoHTMLParser的html信息抽取.pdf”中,作者王鸿伟探讨了如何利用Jericho HTML Parser这一Java库进行高效的HTML解析。 Jericho HTML Parser是一款强大的开源HTML解析器,它能够处理不规则和...
"jericho-html-3.2.zip" 是一个包含Jericho HTML解析器的版本3.2的压缩文件。这个解析器是Java开发的,主要用于处理和分析HTML文档。它以其强大的错误容忍性和对不规范HTML的处理能力而闻名,使得开发者在处理网页...
"jericho-html-3.0.zip" 是一个包含Jericho HTML解析器的版本3.0的压缩文件。这个解析器是用于处理HTML和XML文档的Java库,它提供了丰富的功能来解析、操作以及输出HTML内容。以下是关于Jericho HTML解析器3.0的关键...
强大的HTML文档解析包。很方便的就能查找标签
压缩包中的"jericho-html-3.3"很可能是指“Jericho HTML Parser”库,这是一个Java实现的开源HTML解析器。它支持HTML4、HTML5以及XHTML,并且以其对不规则HTML的高容忍度而著称。 Jericho HTML Parser提供了多种API...
**Python库jericho-1.1.1** Python是一种广泛使用的高级编程语言,以其简洁、易读的语法和丰富的库支持而闻名。在后端开发中,Python库扮演着至关重要的角色,它们提供了各种功能,从数据处理到网络通信,无所不包...
接下来,我们创建一个简单的示例来演示如何使用HtmlParser。以下代码展示了如何读取一个HTML文件并打印所有的段落(`<p>`标签)内容: ```java import net.htmlparser.jericho.*; public class HtmlParserExample ...
然后,HTML解析库如Jsoup或 Jericho HTML Parser会被用到,它们能够帮助程序理解HTML结构,找到正文所在的特定部分。HTML解析过程可能涉及到XPath或CSS选择器,以精确定位正文元素。 其次,程序可能运用正则表达式...
面对非结构化的html,无论使用DOM或SAX,都有其不足之处。本文对比DOM、SAX的解析方式,介绍一种开源的JerichoHTMLParser解析方式,其在对html页面信息进行直接解析时,可以获得一个比较好的解析效果。最后,用实验证明...
项目名称中的“Jericho”可能来源于“耶利哥”这一古城的象征意义,暗示着对传统安全边界的打破或挑战。这个项目专注于创建一个基于C#编程语言的开源Tripcode生成器,Tripcode是一种加密技术,广泛应用于网络论坛如4...
耶利哥的开发环境(Jericho-dev)可能包含以下组件: 1. **源代码**:项目的核心代码库,包括主要的控制逻辑、设备接口和通信协议实现。 2. **构建脚本**:可能是Maven或Gradle这样的构建工具,用于自动化构建、...
1. **HTML解析**:使用Jsoup或者其他HTML解析库,如Apache HttpClient或Jericho,来解析`scholar.html`文件中的内容。 2. **DOM操作**:理解HTML文档对象模型(DOM),以便准确地找到并提取目标元素。 3. **字符串...
2. HTML解析技术:使用jericho-html-2.5开源组件对HTML文件进行解析,读取WEB页面内容,抓取目标页面的内容。 3. 开源组件应用:commons-httpclient用于读取WEB页面内容,commons-codec和commons-logging用于辅助...