lucene-索引HTML文档

deepfuture

浏览: 4375506 次
性别:
来自: 湛江

最近访客更多访客>>

linxl2011

mars36

jccz_zys

zkm0309

博主相关

博客

微博

相册

留言

关于我

博客专栏

: SQLite源码剖析
浏览量：79863

: WIN32汇编语言学习应用...
浏览量：69456

: 神奇的perl
浏览量：102784

: lucene等搜索引擎解析...
浏览量：284004

: 深入lucene3.5源码...
浏览量：14910

: VB.NET并行与分布式编...
浏览量：66977

: silverlight 5...
浏览量：31807

: 算法下午茶系列
浏览量：45791

文章分类

社区版块

存档分类

博客分类：

搜索引擎

HTML lucene Apache Web

1、大部分WEB文档采用HTML格式。

2、本例用如下HTML文档

<html>

<head>

<title>

Laptop power supplies are avaliable in First class only

</title>

</head>

<body>

<h1>code,write,fly</h1>

</body>

</html>

3、使用JTidy

JTidy由Andy Quick编写的Tidy的Java版本。

public class JTidyHTMLHandler implements DocumentHandler{

publicorg.apache.lucene.document.Document getDocument(InputStreamis)

throwsDocumentHandlerException{//传入一个代表HTML文档的InputStream对象

Tidy tidy=new Tidy();

tidy.setQuiet(true);

tidy.setShowWarnings(false);

//解析代表HTML文档的InputStream对象

org.w3c.dom.Documentroot=tidy.parseDOM(is,null);

ElementrawDoc=root.getDocumentElement();

org.apache.lucene.document.Document doc=neworg.apache.lucene.document.Document();

Stringtitle=getTitle(rawDoc);//获得标题

Stringbody=getBody(rawDoc);//获得<body>和</body>之间所有元素

if((title!=null)&&(!title.equals(""))){

doc.add(Field.Text("title",title));

}

if((body!=null)&&(!body.equals(""))){

doc.add(Field.Text("body",body));

}

return doc;

}

protected String getTitle(Element rawDoc){

if(rawDoc==null){

returnnull;

}

Stringtitle="";

NodeListchildren=rawDoc.getElementsByTagName("title");

if(chidren.getLength()>0){//获得第一个<title>标志的文本

Element titleElement=((Element) children.item(0));

Text text=(Text) titleElement.getFirstChild();

if (text!=null){

title=text.getData();

}

returntitle;

}

protected String getBody(ELement rawDoc){

if (rawDoc==null){

return null;

}

String body="";

NodeList children=rawDoc.getElementByTagName("body");//获得<body>标志的引用

if (children.getLength()>0){

body=getText(childre.item(0));//提取<body>和</body>之间的所有文本

}

return body;

}

protected grtText(Node node){

NodeListchildren=node.getChildNodes();

StringBuffer sb=new StringBuffer();

for (inti=0;i<children.getLength();i++){//提取在特定Node对象下元素中的文本

Node children=node.getChildNodes();

StringBuffer sb=new StringBuffer();

for (int i=0;i<children.getLength();i++){

Node child=children.item(i);

switch (child.getNodeType()){

case Node.ELEMENT_NODE:

sb.append(getText(child));

sb.append(" ");

break;

case Node.TEXT_NODE:

sb.append(((Text) child).getData());

break;

}

returnsb.toString();

}

publicstatic void main(String args[]) throws Exception{

JTidyHTMLHandler handler=new JTidyHTMLHandler();

org.apache.lucene.document.Document doc=

handler.getDocument(new FileInputStream(new File(args[0])));

System.out.println(doc);

}

4、使用NekoHTML

NekoHTML是一个简单的HTML扫描器和标签补偿器，它使程序员可以解析并通过标准的XML接口访问HTML文档。解析器扫描HTML文件并修改开发者和计算机用户在编写HTML文档时所犯罪的大量常见错误。

public class NekoHTMLHandler implements Document{

private DOMFragmentParserparser=new DOMFragmentParser();//NEKO针对HTML的DOM解析器

public DocumentgetDocument(InputStream is) throws DocumentHandlerException{

DocumentFragment node=new HTMLDocumentImpl().createDocumentFragment();

try{

parser.parse(new InputSouce(id),node);

}

catch (IOException e){

throw new DocumentHandlerException("cannot parse HTML document",e);

}

catch (SAXException e){

throw new DocumentHandlerException("cannot parse HTMLdocument",e);

}

org.apache.lucene.document.Document doc=

new org.apache.lucene.document.Document();

//提取/存储title中的文本

StringBuffer sb=new StringBuffer();

getText(sb,node,"title");

String title=sb.toString();

//清空stringbuffer

sb.setLength(0);

//从DOM NODE对象中提取出所有文本

getText(sb,node);

String text=sb.toString();

if((title!=null)&&(!title.equals(""))){

doc.add(Field.Text("title",title));

}

if((body!=null)&&(!body.equals(""))){

doc.add(Field.Text("body",text));

}

return doc;

}

private void getText(StringBuffer sb,Nodenode){

if (node.getNodeType()==Node.TEXT_NODE){//从DOMNode对象中提取出表示特定元素的所有文本

sb.append(node.getNodeValue());

}

Nodelist children=node.getChildNodes();

if (children!=null){

int len=children.getLength();

for (int i=0;i<len;i++){

getText(sb,children.item(i));

}

private booleangetText(StringBuffer sb,Node node,String element){

//从Node对象中提取表示特定元素的所有文本

if (node.getNodeType()==Node.ELEMENT_NODE){

if(element.equalsIgnoreCase(node.getNodeName())){

getText(sb,node);

return true;

}

NodeList children=node.getChildNodes();

if (children!=null){

intlen=chidren.getLength();

for (int i=0;i<len;i++){

if (getText(sb,children.item(i)),element){

return true;

}

return false;

}

public static void main(String args[]) throwsException{

NekoHTMLHandler handler=new NekoHTMLHandler();

org.apache.lucene.document.Document doc=

handler.getDocument(new FileInputStream(new File(args[0])));

System.out.println(doc);

}

分享到：

lucene-使用Digester分析XML索引 | lucene-索引word文档

2009-12-24 15:14
浏览 2615
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

博客专栏

文章分类

社区版块

存档分类

最新评论