HTML parser选型测试

chen4w

浏览: 278176 次
性别:
来自: 北京

最近访客更多访客>>

morelily

gyralhorse

g1083737578

gmacel

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

HTML OpenSource CMS 项目管理

内容管理(cms)常常需要将网站频道的摘要(summery)合并到父频道的封面，引入HTML parser，
可以结构化方式操作HTML内容，使网页内容的提取、重构变得容易。
以下链接列出了相关的java opensource项目
http://www.open-open.com/30.htm
根据网友的评论，将htmlcleaner、htmlparser、nekohtml列入候选。
以附件html作为测试用例，按照常见的getElementsByTagName提取Body,
以getElementById获取id为'content6'的script
测试编码如下：

java 代码

static public String Neko(String path) throws SAXException, IOException{
DOMParser parser = new DOMParser();
//InputSource in = new InputSource(new Reader());
parser.parse(TPPath+path);
Document doc=parser.getDocument();
org.w3c.dom.NodeList nl = doc.getElementsByTagName("body");
System.out.println(printNode(nl.item(0)));
System.out.println("----------------------------------------");
org.w3c.dom.Element n = doc.getElementById("content6");
System.out.println(printNode(n));
return "";
}
static public String htmlparser(String path) throws ParserException{
// one of several constructors
Parser p = new Parser(TPPath+path);
NodeList nl=p.parse(new TagNameFilter("body"));
System.out.println(nl.elementAt(0).toHtml());
System.out.println("----------------------------------------");
Parser p2 = new Parser(TPPath+path);
NodeList nl2=p2.parse(new HasAttributeFilter("id","content6"));
System.out.println(nl2.elementAt(0).toHtml());
return "";
}
static public String htmlcleaner(String path) throws Exception{
// one of several constructors
HtmlCleaner cleaner = new HtmlCleaner(new File(TPPath+path));
org.w3c.dom.Document doc = cleaner.createDOM();
org.w3c.dom.NodeList nl = doc.getElementsByTagName("body");
System.out.println(printNode(nl.item(0)));
System.out.println("----------------------------------------");
org.w3c.dom.Element n=doc.getElementById("content6");
System.out.println(printNode(n));
return "";
}

一个打印dom节点的辅助方法如下：

java 代码

public static String printNode(Node node) {
StringBuffer sbuf=new StringBuffer();
String nn=node.getNodeName();
boolean btag=true;
if(nn.equals("#text")) btag=false;
if(btag){
if(node.hasAttributes()){
NamedNodeMap attrs=node.getAttributes();
StringBuffer abuf= new StringBuffer();
for(int i=0,len=attrs.getLength(); i<len; i++){
Node attr=attrs.item(i);
abuf.append(" "+attr.getNodeName()+"=\""+attr.getNodeValue()+"\"");
}
sbuf.append("<"+nn+abuf.toString()+">");
}else sbuf.append("<"+nn+">");
}
if(node.hasChildNodes()){
Node child = node.getFirstChild();
sbuf.append(child.getNodeValue());
while (child != null) {
sbuf.append(printNode(child));
child = child.getNextSibling();
}
}
if(btag)
sbuf.append("</"+nn+">");
return sbuf.toString();
}

测试结果如下：
                   getElementsByTagName        getElementById
htmlcleaner         抛出异常java.lang.NoSuchFieldError: fRecognizedFeatures
htmlparser         在分析到script中的字符串包含"</b>"出现逻辑错误，将该</b>误判为script结束
nekohtml            pass                        pass

nekohtml入选。

藏品.rar (57.6 KB)
下载次数: 88

分享到：

缓存静态页面的编码问题 | RIA带来了什么

2007-07-10 13:23
浏览 4092
评论(1)
查看更多

1 楼 chen4w 2007-08-14

从数据库表中直接获取的数据形式上可能并不满足要求，你看sina新闻的摘要，标题字数都对仗得那么好，没有人工干预恐怕很难做到

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论