HtmlCleaner CleanerProperties 参数配置 - macken - ITeye博客

`

macken

浏览: 349677 次
性别:
来自: 北京

最近访客更多访客>>

junJZ_2008

msj_0529

tr1314qq

yang00878

博主相关

博客

微博

相册

收藏

留言

关于我

文章分类

社区版块

存档分类

最新评论

白色蜻蜓： ...
(转载)新浪微博错误提示代码
crzdot：我也是用ultroiso做的mini启用盘，然后再把iso拷到 ...
centos6.4安装
k496229870： ...
libgdx学习之Camera
DiaoCow：蛮不错的。
redis命令思维导图
kingdelee： HTTPClient完胜？
URLConnection与HttpClient的对比

HtmlCleaner CleanerProperties 参数配置

博客分类：

Java

阅读更多

Parameter	Default	Explanation
advancedXmlEscape	true	If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &XXX;
transResCharsToNCR	false	If this parameter is set to true, reserved XML sequences (&, ", ', <, >) are serialized to their Numeric Character Representations (#&38;, #&34;, #&39;, #&60;, #&62;). This parameter has effect only if advancedXmlEscape is set to true.
translateSpecialEntities	true	If true, special HTML entities (i.e. ?, ¡ë, ¡Á) are replaced with unicode characters they represent (?, ¡ë, ¡Á). This doesn't include &, <, >, ", '.
transSpecialEntitiesToNCR	false	If this parameter is set to true, special HTML entities (i.e. ¦¡) are serialized to their Numeric Character Representations (#&913;). This parameter has effect only if translateSpecialEntities is set to true.
recognizeUnicodeChars	true	If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. §Ø is replaced with §Ø)
useCdata	true	If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped).
omitUnknownTags	false	Tells whether to skip (ignore) unknown tags during cleanup.
treatUnknTagsAsContent	false	Tells whether to treat unknown tags as ordinary content, i.e. <something...> will be transformed to <something...>. This attribute is applicable only if omitUnknownTags is set to false.
omitDeprTags	false	Tells whether to skip (ignore) deprecated HTML tags during cleanup.
treatDeprTagsAsContent	false	Tells whether to treat deprecated tags as ordinary content, i.e. <font...> will be transformed to <font...>. This attribute is applicable only if omitDeprecatedTags is set to false.
omitComments	false	Tells whether to skip HTML comments.
omitXmlDeclaration	false	Tells whether or not to put XML declaration line at the beginning of the resulting XML.
omitDoctypeDeclaration	true	Tells whether to skip HTML declaration found in the source document. If HTML document being cleaned doesn't contain one it wouldn't be placed in the result anyway.
omitXmlnsAttributes	false	This flag is depricated since version 1.3 and namespacesAware should be used instead.
omitEnvelope	false	Tells whether to remove open and close tag being serialized. This parameter is introduced in HtmlCleaner 2.2 to replace omitHtmlEnvelope. If set to true, serialization skips open and close tags of the node, outputs only node's children.
useEmptyElementTags	true	Specifies how to serialize tags with empty body - if true, compact notation is used(<xxx/>), otherwise - <xxx></xxx>
allowMultiWordAttributes	true	Tells parser whether to allow attribute values consisting of multiple words or not. If true, attribute att="a b c" will stay like it is, and if false parser will split this into att="a" b="b" c="c" (this is default browsers' behaviour).
allowHtmlInsideAttributes	false	Tells parser whether to allow html tags inside attribute values. For example, when this flag is set att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will end attribute value after "here is". This flag makes sense only if allowMultiWordAttributes is set as well.
ignoreQuestAndExclam	true	Tells parser whether to completely ignore tags that have form <?TAGNAME....> or <!TAGNAME....>. This way some HTML/XML processing instructions may be omitted from the resulting xml.
namespacesAware	true	If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped.
hyphenReplacement	=	XML doesn't allow double hyphen sequence (--) inside comments. This parameter tells which replacement to use for it when double hyphen is encountered during parsing.
pruneTags	empty string	Comma-separated list of tags that will be complitely removed (with all nested elements) from XML tree after parsing. For exampe if pruneTags is "script,style", resulting XML will not contain scripts and styles.
booleanAtts	self	Tells cleaner what value to give to boolean attributes, like checked, selected and similar. Allowed values are self - value of attribute is the same as attribute name (checked = "checked"), empty - attribute value is empty string (checked = "") and true - value of attribute is "true" (checked = "true").
nodeByXpath		XPath expression used to select first node that is going to be serialized instead of whole HTML document. For example if this parameter us set to //table[1] only first table in document will be serialized.

分享到：

linux usermod命令 | dom4j读取http xml文件

2012-07-06 15:34
浏览 3108
评论(0)
分类:编程语言
查看更多

评论

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

HtmlCleaner 用法: 1. ** CleanerProperties**：HtmlCleaner 的配置对象，可以设置一系列参数，如是否忽略未知标签、是否开启实体转换等。通过修改这些属性，我们可以定制 HtmlCleaner 的清理行为。 2. **TagNode**：HtmlCleaner 解析...

网页爬虫demo 带htmlcleaner jar包: 1. 如何设置和使用HTMLCleaner的CleanerProperties。 2. HTMLCleaner解析HTML文档并构建DOM树的过程。 3. 使用XPathSelector选择和提取DOM树中的特定节点。 4. Java编程中如何处理网络请求和文件I/O。 5. 针对实际...

网页解析工具HTMLCleaner: 6. **config**：配置文件目录，可能包含了与HTMLCleaner相关的配置文件，用户可以通过修改这些文件来调整工具的行为。 7. **licence.txt**：许可协议文件，详细说明了HTMLCleaner的使用权限和限制，开发者在使用时...

HtmlCleaner-JAVA爬虫--编写第一个网络爬虫程序: import org.htmlcleaner.CleanerProperties; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.TagNode; public class SpiderDemo01 { public static void main(String[] args) { CleanerProperties ...

HtmlCleaner: HtmlCleaner是一个开源的Java语言的Html文档解析器。 HtmlCleaner能够重新整理HTML文档的每个元素并生成结构良好(Well-Formed)的 HTML 文档。默认它遵循的规则是类似于大部份web浏览器为创文档对象模型所使用的规则...

htmlcleaner html解析器: - `htmlcleaner2_1.jar`：这是HTMLCleaner的二进制库文件，包含了所有必要的类和资源，可以直接在Java项目中引用。 - `licence.txt`：包含了HTMLCleaner的授权协议信息，通常为开源许可证，例如Apache License 2.0。...

HTMLcleaner: 2. **标签匹配与清理**：HTMLcleaner提供了一个自定义的正则表达式规则集，允许开发者定义哪些标签和属性应该被保留，哪些应该被删除或替换。这使得在处理HTML碎片时，可以确保只保留需要的部分。 3. **DOM树构建**...

HtmlCleaner2.6.1 API (英文) 及 JAR Library: HtmlCleaner2.6.1 API (英文) 及 JAR Library API LINK: http://htmlcleaner.sourceforge.net/doc/index.html

htmlcleaner,活跃的.zip: 例如，你可以配置HTMLCleaner忽略某些标签，或者将特定标签转换为其他标签。 HTMLCleaner的另一个重要特性是支持CSS选择器，这使得用户可以通过CSS表达式来查找和修改DOM元素。这大大简化了对HTML文档内容的操作，...

htmlcleaner-2.2.4.jar: 网络爬虫htmlcleaner的jar包

HtmlCleaner使用说明文档: HtmlCleaner提供了丰富的配置选项，允许用户根据实际需求进行定制化的清理工作，比如设置字符编码、转换属性格式、过滤标签或属性等。这些高级功能为开发者提供了便利，使得HTML的清理工作能够更加精细和高效。在...

htmlcleaner html解析器2.2版: htmlcleaner html解析器2.2版，解析速度很快的，比htmlparser1使用还速度快

HtmlCleaner2.1API参考手册.chm: HtmlCleaner2.1API参考手册.chm HtmlCleaner是一个把html解析为XML文档的Java程序库。我试过，这是java世界中最快、最好、最小、最强大的Html解析库。可以解析为DOM对象，然后使用其他的xml分析器进行分析。

htmlcleaner-2.8.jar: Java解析HTML利器 htmlcleaner2.8

页面正文提取htmlcleaner-2.8.jar: CleanerProperties props = new CleanerProperties(); props.setOmitXmlDeclaration(true); HtmlCleaner cleaner = new HtmlCleaner(props); TagNode tagNode = cleaner.clean(htmlContent); ...

htmlcleaner: 3. **标签和属性过滤**：通过配置自定义的规则，HTMLCleaner可以允许或禁止特定的HTML标签和属性。这在防止XSS（Cross-Site Scripting）攻击或限制可接受的HTML内容时非常有用。 4. **DOM操作**：解析后的HTML会被...

htmlcleaner2_1.jar: html解析工具，支持xpath，简单方便

HTMLCleaner(HTML代码优化工具)V1.0官方英文免费版: 使用HTMLCleaner时，用户需要注意保存原始代码副本，以防优化过程中误删重要信息。此外，虽然HTMLCleaner在多数情况下能正确处理代码，但复杂的页面结构和自定义的HTML语法可能导致意外的改变，因此在使用前进行测试...

htmlcleaner-2.22_html_XSS_: HTMLCleaner是一款强大的HTML清理和过滤工具，主要目的是在处理用户输入的HTML内容时，过滤掉不安全的字符和元素，...正确配置和使用HTMLCleaner，对于任何处理用户输入HTML内容的系统来说，都是至关重要的安全措施。

HtmlCleaner-开源: 2. 创建 CleanerProperties 对象，用于配置HtmlCleaner的清理规则，如是否保留某些标签、是否处理脚本和样式等。 3. 使用`HtmlCleaner`类的`clean()`方法，传入待处理的HTML字符串，得到一个TagNode对象，这代表了...

Global site tag (gtag.js) - Google Analytics