- 浏览: 1526801 次
- 性别:
- 来自: 杭州
文章分类
- 全部博客 (525)
- SEO (16)
- JAVA-EE-Hibernate (6)
- JAVA-EE-Struts (29)
- JAVA-EE-Spring (15)
- Linux (37)
- JAVA-SE (29)
- NetWork (1)
- CMS (14)
- Semantic Research (3)
- RIA-Flex (0)
- Ajax-Extjs (4)
- Ajax-Jquery (1)
- www.godaddy.com (0)
- SSH (34)
- JavaScript (6)
- SoftwareEngineer (9)
- CMMI (0)
- IDE-Myeclipse (3)
- PHP (1)
- Algorithm (3)
- C/C++ (18)
- Concept&Items (2)
- Useful WebSite (1)
- ApacheServer (2)
- CodeReading (1)
- Socket (2)
- UML (10)
- PowerDesigner (1)
- Repository (19)
- MySQL (3)
- SqlServer (0)
- Society (1)
- Tomcat (7)
- WebService (5)
- JBoss (1)
- FCKeditor (1)
- PS/DW/CD/FW (0)
- DesignPattern (11)
- WebSite_Security (1)
- WordPress (5)
- WebConstruction (3)
- XML|XSD (7)
- Android (0)
- Project-In-Action (9)
- DatabaseDesign (3)
- taglib (7)
- DIV+CSS (10)
- Silverlight (52)
- JSON (7)
- VC++ (8)
- C# (8)
- LINQ (1)
- WCF&SOA (5)
- .NET (20)
- SOA (1)
- Mashup (2)
- RegEx (6)
- Psychology (5)
- Stock (1)
- Google (2)
- Interview (4)
- HTML5 (1)
- Marketing (4)
- Vaadin (2)
- Agile (2)
- Apache-common (6)
- ANTLR (0)
- REST (1)
- HtmlAnalysis (18)
- csv-export (3)
- Nucth (3)
- Xpath (1)
- Velocity (6)
- ASP.NET (9)
- Product (2)
- CSS (1)
最新评论
-
lt26w:
理解成门面模式应该比较容易明白吧
FacadePattern-Java代码实例讲解 -
lt26w:
看下面的例子比较明白.
FacadePattern-Java代码实例讲解 -
javaloverkehui:
这也叫文档,别逗我行吗,也就自己看看。
HtmlCleaner API -
SE_XiaoFeng:
至少也应该写个注释吧。
HtmlCleaner API -
jfzshandong:
...
org.springframework.web.filter.CharacterEncodingFilter 配置
HtmlCleaner API
Create cleaner instance:
HtmlCleaner()
|
Create cleaner with default tag information provider. |
HtmlCleaner(ITagInfoProvider)
|
Create cleaner with custom tag information provider. |
Set cleaner properties in order to tune its behavior:
Set cleaner transformations:new!
CleanerTransformations()
|
Create collection of transformations. |
TagTransformation(String,
String, boolean)
|
Create single tag transformation. |
CleanerTransformations.
|
Add tag transormation to transformations collection. |
TagTransformation.
|
Specify attribute transformation for the tag transformation. |
HtmlCleaner.
|
Set cleaner transformations. |
Clean HTML with instance of HtmlCleaner:
class HtmlCleaner
:
clean(String)
|
Clean HTML that comes from verious sources. |
Search cleaned DOM and modify its structure:
class TagNode
:
getAttributeByName(String)
|
Work with node (tag) attributes |
class TagNode
:
getChildTagList()
|
Find and modify nodes. |
HtmlCleaner.setInnerHtml(TagNode,
String)
|
Cleans given portion of HTML and stores it in specified tag node. |
Serialize DOM nodes:
SimpleXmlSerializer(CleanerProperties)
|
Create various kinds of XML serializers. |
class XmlSerializer
:
writeXmlToStream(TagNode,
OutputStream, String)
|
Serialize node to different outputs. |
DomSerializer.createDOM(TagNode)
|
Create common DOM objects out of cleaned HTML. |
Providing custom tag info set
HtmlCleaner implements default HTML tag set and rules for their
balancing, that
is similar to the browsers' behavior. However, user is free to
implement interface
ITagInfoProvider
or extend some of its imlementations in order to provide custom tag
info set.
The easiest way to do that is to write XML configuration file which
describes all tags
and their dependacies and use
ConfigFileTagProvider
like:
HtmlCleaner cleaner = new HtmlCleaner( new ConfigFileTagProvider( myConfigFile) ) ;
Perhaps the best starting point is default tag
ruleset description file
.
It is the basis for
DefaultTagProvider
.
For example, someone may not like the rule that implicit TBODY is
inserted before TR in the HTML table.
To remove it, find <tag name="tr"...
element in the
XML and remove tbody
from
req-enclosing-tags
section.
Setting cleaner transformations
Following code snippet demonstrates how to set tranformations from the example :
... HtmlCleaner cleaner = new HtmlCleaner( ...) ; ... CleanerTransformations transformations = new CleanerTransformations( ) ; TagTransformation tt = new TagTransformation( "cfoutput" ) ; transformations.addTransformation ( tt) ; tt = new TagTransformation( "c:block" , "div" , false ) ; transformations.addTransformation ( tt) ; tt = new TagTransformation( "font" , "span" , true ) ; tt.addAttributeTransformation ( "size" ) ; tt.addAttributeTransformation ( "face" ) ; tt.addAttributeTransformation ( "style" , "${style};font-family=${face};font-size=${size};" ) ; transformations.addTransformation ( tt) ; ... cleaner .setTransformations ( transformations) ; ... TagNode node = cleaner.clean ( ...) ;
发表评论
-
htmlunit 示例
2010-08-20 18:40 4350先下载依赖的相关JAR包:http://sourcefor ... -
HTMLParser的两种使用方法
2010-04-15 16:37 5410HTMLParser的两种使用方法 ... -
HtmlCleanner结合xpath用法
2010-04-15 13:24 3573文章分类:Java编程 ... -
基于Htmlparser的天气预报程序(续)
2010-04-14 13:53 1099zz:http://www.iteye.com/topic/6 ... -
httpclient(校内网)
2010-04-13 15:10 1317Java code <!-- C ... -
httpclient(校内网)
2010-04-13 15:10 1437httpclient(校内网),大家帮忙看看我的 http ... -
HTTPClient模拟登陆人人网
2010-04-13 14:58 1914zz: 目的: http://www.iteye. ... -
htmlcleaner惯用法
2010-04-13 13:39 1464Common usage Tipically the f ... -
htmlcleaner惯用法
2010-04-13 13:39 1540Common usage Tipically t ... -
htmlcleaner 使用示例.
2010-04-13 13:10 10052原文出处:http://blog.chenlb.com/200 ... -
http://htmlparser.com.cn/
2010-04-12 16:20 1064http://htmlparser.com.cn/ ... -
开源网络蜘蛛spider(转载)
2010-04-12 15:42 1346spider是搜索引擎的必须 ... -
基于Spindle的增强HTTP Spider
2010-04-12 15:33 1489zz:http://www.iteye.com/news ... -
Cobra: Java HTML 解析器
2010-04-12 15:32 2966Cobra 简介: Cobra是一个 ... -
用htmlparser分析并抽取正文
2010-04-12 15:26 1562我这次要介绍的是如何抽取正文,这部分是最为核心的.因为如果不能 ... -
HtmlParser初步研究
2010-04-12 15:18 940目的是快速入手,而不 ... -
基于Htmlparser的天气预报程序
2010-04-12 15:16 1084htmlparser是一个纯的java写的html解析的库,它 ...
相关推荐
HtmlCleaner2.1API参考手册.chm HtmlCleaner是一个把html解析为XML文档的Java程序库。 我试过,这是java世界中最快、最好、最小、最强大的Html解析库。 可以解析为DOM对象,然后使用其他的xml分析器进行分析。
HtmlCleaner2.6.1 API (英文) 及 JAR Library API LINK: http://htmlcleaner.sourceforge.net/doc/index.html
8. **doc**:文档目录,可能包含了HTMLCleaner的API文档或其他技术文档,为用户提供使用指南。 9. **lib**:库文件目录,包含了HTMLCleaner运行所需的外部库文件。 10. **build.xml**:Ant构建文件,用于定义项目...
3. **DOM树构建**:HTMLcleaner将清理后的HTML转换成一个干净的DOM(Document Object Model)树,这是一个标准的XML表示形式,方便通过XPath或DOM API进行进一步的元素选择和操作。 4. **元素选择与提取**:使用...
- `doc`:文档目录,可能包含API参考、用户指南等资料。 - `lib`:依赖库目录,包含HTMLCleaner运行所依赖的外部库。 **使用HTMLCleaner的步骤:** 1. 添加HTMLCleaner的JAR文件到项目类路径。 2. 创建`HtmlCleaner...
为了便于使用,HTMLCleaner提供了Java API,允许开发者在Java应用程序中直接集成。此外,它还提供了命令行工具,方便进行快速的HTML清理任务。源代码中可能包含了示例、测试用例以及详细的文档,帮助开发者理解和...
它通过提供一个简单的 API,使得开发者能够方便地从网页中提取所需信息,而无需担心 HTML 的格式问题。在本文中,我们将深入探讨 HtmlCleaner 的核心概念、使用方法以及一些常见的应用场景。 **一、HtmlCleaner ...
`TagNode`对象提供了丰富的API,可以用来遍历HTML结构,查找特定的元素和属性。例如,你可以通过标签名、ID或CSS选择器来定位元素: ```java String title = tagNode.getFirstByXPath("//title").getText().trim(); ...
7. **可扩展性**:HTMLCleaner提供了丰富的API,开发者可以根据需求编写自定义的标签处理器,以处理特定的HTML结构。 在实际应用中,HTMLCleaner常用于以下场景: - **数据抓取**:从网页中提取结构化信息,如新闻...
HtmlCleaner是一个开源的Java语言的Html文档解析器。HtmlCleaner能够重新整理HTML文档的每个元素并生成结构良好(Well-Formed)的 HTML 文档。... 主页地址://htmlcleaner.sourceforge.net/htmlclea
在实际应用HtmlCleaner时,首先需要下载并引入其jar包到Java项目中,然后就可以通过其提供的API接口进行HTML文档的解析和清理。其使用方法一般包括创建HtmlCleaner实例、配置解析器、解析HTML内容并得到...
4. **简单易用的API**:HtmlCleaner提供了简单的API供开发者调用,使得集成到项目中变得非常容易。只需几行代码,就能实现HTML文档的清洗和解析。 5. **可扩展性**:通过自定义TagProcessor和Tokenizer设置,开发者...
这个工具包包含了一个名为 `htmlcleaner-2.10.jar` 的库文件,以及一个实用...同时,深入研究 HTMLCleaner 库的 API 和配置选项,还可以实现更复杂的 HTML 清理需求,如自定义标签白名单、处理 JavaScript 和 CSS 等。
另外,snc-ext-json-1.0.1.jar和snc-ext-htmlcleaner-1.0.jar则表明SNC还提供了JSON解析和HTML清理的扩展功能,这对于处理Web服务返回的数据非常有用。 src目录通常包含了项目的源代码,对于开源项目来说,这是一个...