lucene-创建文档索引处理框架

deepfuture

浏览: 4433084 次
性别:
来自: 湛江

最近访客更多访客>>

linxl2011

mars36

jccz_zys

zkm0309

博主相关

博客

微博

相册

留言

关于我

博客专栏

: SQLite源码剖析
浏览量：80402

: WIN32汇编语言学习应用...
浏览量：70868

: 神奇的perl
浏览量：104166

: lucene等搜索引擎解析...
浏览量：287694

: 深入lucene3.5源码...
浏览量：15170

: VB.NET并行与分布式编...
浏览量：68443

: silverlight 5...
浏览量：32603

: 算法下午茶系列
浏览量：46309

文章分类

社区版块

存档分类

博客分类：

搜索引擎

框架 lucene EXT 工作

1、组成文件索引操作框架的JAVA类

1)DocumentHandler:定义getDocument(InputStream)方法，此方法由所有的文档解析器实现

2)DocumentHandlerException:遇到错误情况时，该类将检测所有从文档解析器抛出的异常

3)FileHanlder:定义getDocument(File)方法，该方法由ExtensionFleHandler类实现

4)FileHanlderException:检测从实现了FileHandler接口的具体类中抛出的异常

5)ExtensionFileHandler:实现FileHandler接口的类与实现Document接口的类具有相同的工作方式，它传经根据由方法getDocument(File)传递给该实现类的文件扩展名来调用相应的分析器对不同的文档进行处理

2、

FileHandler接口

public interface FileHandler{

DocumentgetDocument(File file){

throws FileHandlerException;

}

3、

ExtensionFileHandler类实现了FileHandler接口。

public class ExtensionFileHandler implements FileHandler{

privateProperties handlerProps;

publicExtensionFileHandler(Proprops) throws IOException{

handlerProps=props;//映射文件扩展名

}

publicDocument getDocument(File file) throws FileHandlerException{

Document doc=null;

String name=file.getName();

int dotIndex=name.indexOf(".");

if((dotIndex>0)&&(dotIndex<name.length())){//提取文件扩展

String ext=name.substring(dotIndex+1,name.length());

StringhandlerClassName=handlerProps.getProperty(ext);//将文件传递给解析器实现

try{//查找解析器名称

Class handler handler=(DocumentHandler)handlerClass.newIntance();

return handler.getDocument(new FileInputStream(file));//

}

catch (ClassNotFoundExceptoin e){

thrownew FileHandlerException

("cannotcreate instanceof:"+handlerClassName,e);

}

catch (InstantiationExceptoin e){

thrownew FileHandlerException

("cannotcreate instanceof:"+handlerClassName,e);

}

catch (IllegalAccessExceptoin e){

thrownew FileHandlerException

("cannotcreate instanceof:"+handlerClassName,e);

}

catch (FileNotFoundExceptoin e){

thrownew FileHandlerException

("Filenotfound:"+file.getAsbsolutePath(),e);

}

catch (DocumentHandlerExceptoin e){

thrownew FileHandlerException

("Documentcannot behandler:"+file.getAsbsolutePath(),e);

}

return null;

}

public static void main(String[] args) throws Exception{

if (args.length<2) {

usage();

System.exit(0);

}

Properties props=new Properties();

props.load(newFileInputStream(args[0]));//装载属性文件

ExtensionFileHandler fileHandler=newExtensionFileHandler(props);

Documentdoc=fileHandler.getDocument(new File(args[1]));

}

privatestatic void usage(){

System.err.println("USAGE:java"+ExtensionFileHandler.class.getName()

+"/path/to/properties /path/to/document");

}

4、FileIndex把搜索组件连接在一起，递归地遍历文件系统目录并同时索引其中的所有文件。

public class FileIndexer{

protected FileHandler fileHandler;

publicFileIndexer(Properties props) throws IOException{

fileHandler=newExtensionFileHandler(props);//使用ExtensionFileHandler接口

}

public void index(IndexWriter writer,File file) throwsFileHandlerException{

//index方法

if (file.canRead()){ //递归遍历可读的目录

if (file.isDirectory()){

String[] files=file.list();

if (files!=null){

for (int i=0;i<files.length;i++){

index(writer,newFile(file,files[i]));

}

else{

System.out.println("Indexing"+file);//将文件传递给ExtensionFileHandler

try{

Docuement doc=fileHandler.getDocument(file);

if (doc!=null){

writer.addDocument(doc);//将返回的lucene文档增加到索引中

}

else{

System.err.println("cannothandler"+file.getAbsolutedPath()

+" ;skipping("+e.getMessage()+")");

}

publicstatic void main(String[] args) throws Exception{

if (args.length<3){

usage();

System.exit(0);

}

Properties props=new Properties();

props.load(new FileInputStream(args[0]));//读取命令行指定的属性文件

Directory dir=FSDirectory.getDirectory(args[2],true);

Analyzeranalyzer=new SimpleAnalyzer();

IndexWriter writer=new IndexWriter(dir,analyzer,true);//打开索引

FileIndexer indexer=new FleIndexer(props);//创建FileIndexer实例

long start=new Date().getTime();

indexer.index(writer,new File(args[1]));//首次调用index方法

writer.optimize();//优化索引：关闭索引写入器

writer.close();

long end=new Date().getTime();

System.out.println();

IndexReaderreader=IndexReader.open(dir);

//面向用户的摘要信息

System.out.println("Document indexed:"+reader.numDocs());

System.out.println("Totaltime:"+(end-start)+"ms");

reader.close();

}

privatestatic void usage(){

System.err.println("USAGE:java"+ExtensionFileHandler.class.getName()

+"/path/to/properties /path/to/document");

}

分享到：

lucene-索引纯文本文档 | lucene-sandbox工具包

2009-12-24 13:24
浏览 1634
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

博客专栏

文章分类

社区版块

存档分类

最新评论