htmlcleaner过滤HTML

eimhee

浏览: 2167944 次
性别:
来自: 北京

最近访客更多访客>>

loginboot

u012363178

feichuanliushi

xx5333

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

JAVA

HTML 编程 XML Web .net

曾经用HTMLParser过滤HTML，但发现HTMLParser有时候对不规范的HTMl解析不了，并且不支持xpath,

后来在Web-Harvest开源爬虫网站找到了HTMLParser，能够帮助我们将HTML 文档转化为结构化的XML文档。虽然目前已经有了类似这样的工具，但是HtmlCleaner 能够完成几乎所有的HTML转换，而且不到30k，这是他们值得称道的地方。

1.HtmlCleaner的文档对象模型现在拥有了一些函数，处理节点和属性，所以现在在序列化之前搜索或者编辑是非常容易的。
2.提供基本 HtmlCleaner DOM的XPath支持

3. 解析后编程轻量级文档对象，能够很容易的被转换到DOM或者JDom标准文档，或者通过各种方式(压缩，打印)连续输出XML。

转换完成后，能用JDOM,dom4j对文当进行处理

package com.citgee.webclip;

import org.htmlcleaner.*;

import java.net.*;
import java.io.*;
import java.util.*;

import org.jdom.*;
//import org.jdom.output.*;
import org.jdom.contrib.helpers.XPathHelper;
import org.jdom.filter.Filter;
import org.jdom.output.Format;
import org.jdom.output.XMLOutputter;
import org.jdom.xpath.XPath;

public class WebClipUtils {
 
 public static Document getDocumentByURL(String url,String charset) throws MalformedURLException, IOException{
  HtmlCleaner htmlCleaner = new HtmlCleaner();
  CleanerProperties props = htmlCleaner.getProperties();
  TagNode node = htmlCleaner.clean(new URL(url),charset);
  JDomSerializer jdomSerializer = new JDomSerializer(props,true);
  Document doc = jdomSerializer.createJDom(node);
  return doc;
 }
 
 public static List<Element> getElementsByTagName(Document doc,String tagName){
  List<Element> eleList = new ArrayList<Element>();
  buildList(doc.getRootElement(),tagName,eleList);
  return eleList;
 }
 
 private static void buildList(Element rootEle,String tagName,List<Element> eleList){
  if(rootEle.getName().equals(tagName)){
   eleList.add(rootEle);
  }
  List list = rootEle.getChildren();
  for(Iterator iter = list.iterator();iter.hasNext();){
   Element ele = (Element)iter.next();
   buildList(ele,tagName,eleList);
  }
 }
 
 public static void printElement(Element ele) throws IOException{
  XMLOutputter outputer = new XMLOutputter();
  Format format = outputer.getFormat();
  format.setEncoding("GB2312");
  outputer.setFormat(format);
  outputer.output(ele, System.out);
 }
 
 
 public static void main(String[] args) throws Exception{
  HtmlCleaner htmlCleaner = new HtmlCleaner();
  
  CleanerProperties props = htmlCleaner.getProperties();
  
  
//  TagNode node = htmlCleaner.clean(new URL("http://www.baidu.com"));
  TagNode node = htmlCleaner.clean(new URL("http://www.huanqiu.com"),"UTF-8");
  
//  XmlSerializer xmlSerializer = new PrettyXmlSerializer(props);
//  StringWriter writer = new StringWriter();
//  xmlSerializer.writeXml(node, writer, "GB2312");
//  System.out.println(writer.toString());
  
  JDomSerializer jdomSerializer = new JDomSerializer(props,true);
  Document doc = jdomSerializer.createJDom(node);
  
  Element rootEle = doc.getRootElement();
  
  System.out.println(XPathHelper.getPathString(rootEle));
  final String tagName = "div";
  List list = getElementsByTagName(doc,"div");
  System.out.println(list.size());
  Iterator iter = list.iterator();
  while (iter.hasNext()) {
   Element ele = (Element) iter.next();
   System.out.println();
   System.out.println("*****************************************");
   System.out.println(XPathHelper.getPathString(ele));
   System.out.println("*****************************************");
   printElement(ele);
  }
  

  
 }
}









public class HtmlClean {  
  
    public void cleanHtml(String htmlurl, String xmlurl) {  
        try {  
            long start = System.currentTimeMillis();  
  
            HtmlCleaner cleaner = new HtmlCleaner();  
            CleanerProperties props = cleaner.getProperties();  
            props.setUseCdataForScriptAndStyle(true);  
            props.setRecognizeUnicodeChars(true);  
            props.setUseEmptyElementTags(true);  
            props.setAdvancedXmlEscape(true);  
            props.setTranslateSpecialEntities(true);  
            props.setBooleanAttributeValues("empty");  
  
            TagNode node = cleaner.clean(new File(htmlurl));  
  
            System.out.println("vreme:" + (System.currentTimeMillis() - start));  
  
            new PrettyXmlSerializer(props).writeXmlToFile(node, xmlurl);  
  
            System.out.println("vreme:" + (System.currentTimeMillis() - start));  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
}

8
顶

0
踩

分享到：

Quartz 执行多线程任务 | Prototype设计模式

2010-04-02 16:23
浏览 8497
评论(6)
分类:编程语言
查看更多

6 楼 SE_XiaoFeng 2013-04-15

看的我是晕晕乎乎的。

5 楼 tianhewulei 2010-04-15

NEKOHTML其实也不错...

4 楼 eimhee 2010-04-15

xiaoyiz 写道

我觉得你没必要。。。htmlcleaner本身就支持xpath来查找。。在用jdom dom4j sax之类的再进行解析感觉多此一举了。。。

例如：查找一个<div class='weatherYubaoBox'><p>其他N多节点</p></div>
Object[] weatherBoxInfoObjects = node.evaluateXPath("//div[@class='weatherYubaoBox']//table");

htmlcleaner支持xpath的功能不全

3 楼 xiaoyiz 2010-04-14

2 楼 whaosoft 2010-04-12

这个很好用吗？

1 楼 sdh5724 2010-04-04

性能如何，每秒能处理几M HTML？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论