`
anna_zr
  • 浏览: 200497 次
  • 性别: Icon_minigender_2
  • 来自: 北京
社区版块
存档分类
最新评论

HTML Parser

阅读更多
http://htmlparser.sourceforge.net/

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.
Welcome to the homepage of HTMLParser - a super-fast real-time parser for real-world HTML. What has attracted most developers to HTMLParser has been its simplicity in design, speed and ability to handle streaming real-world html.

The two fundamental use-cases that are handled by the parser are extraction and transformation (the syntheses use-case, where HTML pages are created from scratch, is better handled by other tools closer to the source of data). While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output.

In general, to use the HTMLParser you will need to be able to write code in the Java programming language. Although some example programs are provided that may be useful as they stand, it's more than likely you will need (or want) to create your own programs or modify the ones provided to match your intended application.

To use the library, you will need to add either the htmllexer.jar or htmlparser.jar to your classpath when compiling and running. The htmllexer.jar provides low level access to generic string, remark and tag nodes on the page in a linear, flat, sequential manner. The htmlparser.jar, which includes the classes found in htmllexer.jar, provides access to a page as a sequence of nested differentiated tags containing string, remark and other tag nodes. So where the output from calls to the lexer nextNode() method might be:

    <html>
    <head>
    <title>

    "Welcome"
    </title>
    </head>
    <body>
    etc...
   
The output from the parser NodeIterator would nest the tags as children of the <html>, <head> and other nodes (here represented by indentation):


    <html>
        <head>
            <title>
                "Welcome"
                </title>
            </head>
        <body>

            etc...
   
The parser attempts to balance opening tags with ending tags to present the structure of the page, while the lexer simply spits out nodes. If your application requires only modest structural knowledge of the page, and is primarily concerned with individual, isolated nodes, you should consider using the lightweight lexer. But if your application requires knowledge of the nested structure of the page, for example processing tables, you will probably want to use the full parser.

Extraction
Extraction encompasses all the information retrieval programs that are not meant to preserve the source page. This covers uses like:
text extraction, for use as input for text search engine databases for example
link extraction, for crawling through web pages or harvesting email addresses
screen scraping, for programmatic data input from web pages
resource extraction, collecting images or sound
a browser front end, the preliminary stage of page display
link checking, ensuring links are valid
site monitoring, checking for page differences beyond simplistic diffs
There are several facilities in the HTMLParser codebase to help with extraction, including filters, visitors and JavaBeans.

Transformation
Transformation includes all processing where the input and the output are HTML pages. Some examples are:
URL rewriting, modifying some or all links on a page
site capture, moving content from the web to local disk
censorship, removing offending words and phrases from pages
HTML cleanup, correcting erroneous pages
ad removal, excising URLs referencing advertising
conversion to XML, moving existing web pages to XML
During or after reading in a page, operations on the nodes can accomplish many transformation tasks "in place", which can then be output with the toHtml() method. Depending on the purpose of your application, you will probably want to look into node decorators, visitors, or custom tags in conjunction with the PrototypicalNodeFactory.
分享到:
评论

相关推荐

    html parser 应用ie的内置parser可以解析动态脚本html

    标题提到的"html parser 应用ie的内置parser可以解析动态脚本html",意味着我们将讨论如何利用MSHTML库,这个库是Internet Explorer的核心组件,支持HTML、CSS、JavaScript等网页技术,因此能够处理包含动态脚本的...

    HTML parser选型测试

    在“HTML parser选型测试”这个主题中,博主可能分享了他在选择适合项目需求的HTML解析器时的经验和测试结果。在源码和工具的标签下,我们可以推测这篇博文可能涉及到了代码实现和实际应用工具的比较。 首先,HTML...

    jericho html Parser

    **jericho HTML Parser** 是一个Java库,专门用于解析HTML文档。它被设计用来处理不规范的HTML,即那些在语法上可能不完全符合HTML标准的实际网页。在处理这种复杂性和不可预测性方面,jericho HTML Parser表现出了...

    android html parser

    "Android HTML Parser" 是一个专为Android平台设计的库,它允许开发者高效地解析和提取HTML文档中的信息。这个库的存在使得在手机上分析网页内容变得更为简便,能够帮助开发者快速定位到他们感兴趣的元素。 HTML ...

    python html parser

    Python HTML Parser是一个强大的工具,用于在Python环境中解析HTML和XML文档。这个库使得开发者能够方便地提取和操作网页上的数据,特别是在进行网络爬虫或数据挖掘项目时非常有用。本篇将详细介绍Python中的...

    HTML-Parser-3.51.rar_html parser_parser perl_perl html

    `HTML-Parser`是Perl中一个专门用于解析HTML的模块,它使得程序员能够有效地解析和操作HTML文档。 标题中的"HTML-Parser-3.51.rar"表明这是一个关于HTML-Parser 3.51版本的压缩包,通常包含了该模块的源代码、文档...

    html_parser.zip_Parser_html_html parser_html_parser_zip

    在给定的"html_parser.zip"压缩包中,我们可以推测包含了一个使用Delphi7或Delphi XE2编写的HTML解析库或者示例代码。Delphi是一种强大的Object Pascal编程环境,常用于开发Windows应用程序。 HTML解析器的核心功能...

    html_parser.rar_Parser_delphi html parser_html parser

    本压缩包"html_parser.rar"包含了一个用Delphi语言编写的HTML解析器,它利用了汇编内联技术来提高性能。Delphi是一种基于Object Pascal的高效能、面向对象的编程语言,常用于开发Windows桌面应用。 在解析HTML时,...

    html parser

    HTML解析器是用于处理HTML(超文本标记语言)文档的工具,它能够解析网页内容,提取其中的结构、数据和元信息。在编程领域,HTML解析通常涉及到编程语言中的库或框架,例如Java中的Jsoup或者XML解析库。本文将深入...

    JerichoHtmlParser使用介绍.pdf

    Jericho Html Parser 是一个在 SourceForge 上非常受欢迎的开源HTML解析库,专为处理复杂的HTML文档设计。这个库因其高效和强大的解析能力而备受推崇,尤其是在处理非结构化或不规则的网页时。本文将深入探讨如何...

    PHP HTML parser-开源

    PHP HTML parser是一款开源的PHP库,专为解析HTML文档而设计。这个解析器使得开发者能够从PHP脚本中高效地处理和操作HTML内容,从而实现网页抓取、数据提取或者DOM操作等各种功能。本文将深入探讨这款解析器的关键...

    HTML parser in Delphi:Delphi类,具有读取和解析HTML文件的功能-开源

    THTMLdom是(Delphi)类,具有读取HTML源文件并将其分解为THTMLelement树的功能。 HTML标签的属性存储在元素中。 提供了用于根据属性值或标签名称选择元素的功能。 可以显示树的结构,并且可以将其呈现为纯文本。 ...

    Jericho HTML Parser-开源

    ** Jericho HTML Parser 简介 ** Jericho HTML Parser 是一个专为Java开发者设计的开源库,主要用于解析和操作HTML文档。这个库的核心特性在于它能够处理不规范的HTML,即使遇到错误或非标准的标记,也能尽可能地...

    Kanna(鉋) is an XML_HTML parser for Swift.zip

    《Kanna:Swift中的XML与HTML解析库》 在Swift编程语言中,处理XML和HTML文档是一项常见的任务,尤其在构建iOS、macOS应用程序时,可能需要从网页中提取数据或者解析XML格式的数据源。Kanna,这个源自日本古老神话...

    zhizhu.zip_Java spider_Simple HTML parser_spider

    一个基于Java的web spider框架.它包含一个简单的HTML剖析器能够分析包含HTML内容的输入流.通过实现Arachnid的子类就能够开发一个简单的Web spiders并能够在Web站上的每个页面被解析之后增加几行代码调用。

Global site tag (gtag.js) - Google Analytics