`
playfish
  • 浏览: 289547 次
  • 性别: Icon_minigender_1
  • 来自: 福州
社区版块
存档分类
最新评论

Jericho HTML Parser的官方演示文档

    博客分类:
  • Java
阅读更多
SourceForge.net Logo

Jericho HTML Parser

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

It is an open source library released under both the Eclipse Public License (EPL) and GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in either one of these licence documents.

The javadocs provide comprehensive documentation of the entire API, as well as being a very useful reference on aspects of HTML and XML in general.

Visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/ for downloads and support.

You can also rate the project highly at http://freshmeat.net/projects/jerichohtml/

Release notes for each version can be found in a file called release.txt in the project root directory.

Features

The library distinguishes itself from other HTML parsers with the following major features:

  • The presence of badly formatted HTML does not interfere with the parsing of the rest of the document, which makes the library ideal for use with "real-world" HTML that chokes other parsers.
  • ASP, JSP, PSP, PHP and Mason server tags are explicitly recognised by the parser. This means that normal HTML is still parsed properly even if there are server tags inside them, which is common for example when dynamically setting element attributes.
  • It is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
  • Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.
  • Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
  • The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.
  • The row and column number of each position in the source document is easily accessible.
  • Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner.
  • Custom tag types can be easily defined and registered for recognition by the parser.
  • Built-in functionality to format HTML source code that indents elements according to their depth in the document element hierarchy.
  • Built-in functionality to render HTML markup with simple text formatting.
  • Built-in functionality to extract all text from HTML markup, suitable for feeding into a text search engine such as Apache Lucene.

Sample Programs

The samples/console directory in the download package contains sample programs for performing common tasks and demonstrating the functionality of the library. The .bat files can be run directly on a MS-Windows operating system, or the following syntax can be used on a UNIX based operating system from the samples/console directory:

java -classpath classes;../lib/jericho-html-x.x.jar ProgramName

where x.x is the current release number and ProgramName is the name of the sample program to run.

The following sample programs are available:

ConvertStyleSheets.java Demonstrates how to detect all external style sheets and place them inline into the document.
DisplayAllElements.java Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML.
ExtractText.java Demonstrates the use of the TextExtractor class that extracts all of the text from a document, as well as the title, description, keywords and links.
FindSpecificTags.java Demonstrates how to search for tags with a specified name, in a specified namespace, or special tags such as document type declarations, XML declarations, XML processing instructions, common server tags, PHP tags, Mason tags, and HTML comments.
FormControlDisplayCharacteristics.java Demonstrates setting the display characteristics of individual form controls. This allows a control to be disabled, removed, or replaced with a plain text representation of its value (display value). The new document is written to a file called NewForm.html
FormFieldCSVOutput.java Demonstrates the use of the FormFields.getColumnValues(Map) method to store form data in a .CSV file, automatically creating separate columns for fields that can contain multiple values (such as checkboxes). The output is written to a file called FormData.csv
FormFieldList.java Demonstrates the use of the Segment.findFormFields() method to list all form fields and their associated controls in a document.
FormFieldSetValues.java Demonstrates setting the values of form controls, which is best done via the FormFields object. The new document is written to a file called NewForm.html
FormatSource.java Demonstrates the use of the SourceFormatter class that formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent. Also known as a "source beautifier".
RenderToText.java Demonstrates the use of the Renderer class that performs a simple text rendering of HTML markup, similar to the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails.
Encoding.java Demonstrates the use of the EncodingDetector class and how to determine the encoding of a source document.
SplitLongLines.java Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines.

Building

The build and sample files are implemented as DOS .bat files only. This is because I wanted to avoid the need to install ANT for such a simple library. Sorry to all the unix users for the inconvenience.

On the Drawing Board...

  • Ability to generate a JDOM document, making it a JTidy alternative
  • Online interactive sample programs - please let me know if you are willing to host the FormatSource.jsp page on your web server
  • .NET (DotNet) version if enough interest shown (register you interest via the forums)

Alternative HTML Parsers

This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers, none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change, none could reproduce a source document containing badly formatted or non-HTML components without change, and none provided a means to track the positions of nodes in the source text. A list of these parsers and a brief description follows, but please note that I have not revised this analysis since the before this package was written. Please let me know if there are any errors.

  • JavaCC HTML Parser by Quiotix Corporation (http://www.quiotix.com/downloads/html-parser/)
    GNU GPL licence, expensive licence fee to use in commercial application. Does not support document structure (parses into a flat node stream).
  • Demonstrational HTML 3.2 parser bundled with JavaCC. Virtually useless.
  • JTidy (http://jtidy.sourceforge.net/)
    Supports document structure, but by its very nature it "tidies" up anything it doesn't like in the source document. On first glance it looks like the positions of nodes in the source are accessible, at least in protected start and end fields in the Node class, but these are pointers into a different buffer and are of no use.
  • javax.swing.text.html.parser.Parser
    Comes standard in the JDK. Supports document structure. Does not track the positions of nodes in the source text, but can be easily modified to do so (although not sure of legal implications of modifications). Requires a DTD to function, but only comes with HTML3.2 DTD which is unsuitable. Even if an HTML 4.01 DTD were found, the parser itself might need tweaking to cater for the new element types. The DTD needs to be in the format of a "bdtd" file, which is a binary format used only by Sun in this parser implementation. I have found many requests for a 4.01 bdtd file in newsgroups etc on the web, but they all reamain unanswered. Building it from scratch is not so easy.
  • Kizna HTML Parser v1.1 (http://htmlparser.sourceforge.net/)
    GNU LGPL licence. Version 1.1 was very simple without support for document structure. I have since revisited this project at sourceforge (early 2004), where version 1.4 is now available. There are now two separate libraries, one with and one without document structure support. It claims to now also be capable of reproducing source text verbatim.
  • CyberNeko HTML Parser (http://www.apache.org/~andyc/neko/doc/html/index.html)
    Apache-style licence. Supports document structure. Based on the very popular Xerces XML parser. At the time of evaluation this parser didn't regenerate the source accurately enough.
分享到:
评论

相关推荐

    jericho html Parser

    **jericho HTML Parser** 是一个Java库,专门用于解析HTML文档。它被设计用来处理不规范的HTML,即那些在语法上可能不完全符合HTML标准的实际网页。在处理这种复杂性和不可预测性方面,jericho HTML Parser表现出了...

    Jericho HTML Parser-开源

    ** Jericho HTML Parser 简介 ** Jericho HTML Parser 是一个专为Java开发者设计的开源库,主要用于解析和操作HTML文档。这个库的核心特性在于它能够处理不规范的HTML,即使遇到错误或非标准的标记,也能尽可能地...

    JerichoHtmlParser使用介绍.pdf

    Jericho Html Parser 是一个在 SourceForge 上非常受欢迎的开源HTML解析库,专为处理复杂的HTML文档设计。这个库因其高效和强大的解析能力而备受推崇,尤其是在处理非结构化或不规则的网页时。本文将深入探讨如何...

    Jericho HTML Parser

    A simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid...

    基于JerichoHTMLParser的html信息抽取.pdf

    在给定的文件“基于JerichoHTMLParser的html信息抽取.pdf”中,作者王鸿伟探讨了如何利用Jericho HTML Parser这一Java库进行高效的HTML解析。 Jericho HTML Parser是一款强大的开源HTML解析器,它能够处理不规则和...

    jericho-html-3.2.zip

    "jericho-html-3.2.zip" 是一个包含Jericho HTML解析器的版本3.2的压缩文件。这个解析器是Java开发的,主要用于处理和分析HTML文档。它以其强大的错误容忍性和对不规范HTML的处理能力而闻名,使得开发者在处理网页...

    jericho-html-3.0.zip

    "jericho-html-3.0.zip" 是一个包含Jericho HTML解析器的版本3.0的压缩文件。这个解析器是用于处理HTML和XML文档的Java库,它提供了丰富的功能来解析、操作以及输出HTML内容。以下是关于Jericho HTML解析器3.0的关键...

    jericho-html-3.1.jar

    强大的HTML文档解析包。很方便的就能查找标签

    HTML解析器

    压缩包中的"jericho-html-3.3"很可能是指“Jericho HTML Parser”库,这是一个Java实现的开源HTML解析器。它支持HTML4、HTML5以及XHTML,并且以其对不规则HTML的高容忍度而著称。 Jericho HTML Parser提供了多种API...

    Python库 | jericho-1.1.1.tar.gz

    **Python库jericho-1.1.1** Python是一种广泛使用的高级编程语言,以其简洁、易读的语法和丰富的库支持而闻名。在后端开发中,Python库扮演着至关重要的角色,它们提供了各种功能,从数据处理到网络通信,无所不包...

    使用HtmlParser

    接下来,我们创建一个简单的示例来演示如何使用HtmlParser。以下代码展示了如何读取一个HTML文件并打印所有的段落(`<p>`标签)内容: ```java import net.htmlparser.jericho.*; public class HtmlParserExample ...

    通用论坛正文提取程序

    然后,HTML解析库如Jsoup或 Jericho HTML Parser会被用到,它们能够帮助程序理解HTML结构,找到正文所在的特定部分。HTML解析过程可能涉及到XPath或CSS选择器,以精确定位正文元素。 其次,程序可能运用正则表达式...

    基于JerichoHTMLParser的html信息抽取 (2010年)

    面对非结构化的html,无论使用DOM或SAX,都有其不足之处。本文对比DOM、SAX的解析方式,介绍一种开源的JerichoHTMLParser解析方式,其在对html页面信息进行直接解析时,可以获得一个比较好的解析效果。最后,用实验证明...

    Jericho_Project-开源

    项目名称中的“Jericho”可能来源于“耶利哥”这一古城的象征意义,暗示着对传统安全边界的打破或挑战。这个项目专注于创建一个基于C#编程语言的开源Tripcode生成器,Tripcode是一种加密技术,广泛应用于网络论坛如4...

    Jericho:Java上的智能家居

    耶利哥的开发环境(Jericho-dev)可能包含以下组件: 1. **源代码**:项目的核心代码库,包括主要的控制逻辑、设备接口和通信协议实现。 2. **构建脚本**:可能是Maven或Gradle这样的构建工具,用于自动化构建、...

    java源码变html-open-soft:复制googleScholar结果页面的源代码,然后粘贴到src/scholar.html中。Ex

    1. **HTML解析**:使用Jsoup或者其他HTML解析库,如Apache HttpClient或Jericho,来解析`scholar.html`文件中的内容。 2. **DOM操作**:理解HTML文档对象模型(DOM),以便准确地找到并提取目标元素。 3. **字符串...

    JAVA技术的网页内容智能抓取.pdf

    2. HTML解析技术:使用jericho-html-2.5开源组件对HTML文件进行解析,读取WEB页面内容,抓取目标页面的内容。 3. 开源组件应用:commons-httpclient用于读取WEB页面内容,commons-codec和commons-logging用于辅助...

Global site tag (gtag.js) - Google Analytics