【Lucene3.0 初窥】索引创建(1)：IndexWriter索引器

Heart.X.Raid

浏览: 905277 次
性别:
来自: 武汉

最近访客更多访客>>

rocleft

dy.f

leoeco2000

uule

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

信息检索与搜索引擎

lucene 数据结构多线程 Apache D语言

《Lucene索引创建》系列文章将从源代码出发，详细揭示Lucene两大功能之一的索引创建过程。并阐述其中所用到的信息检索的相关技术。由于Lucene是基于多线程的，为了能够简洁明了的说明Lucene索引创建的过程，我们尽量会绕开Lucene多线程机制的处理细节。

1.1 准备工作

我们对content目录中1.txt、2.txt、3.txt这三个英文文档的文件名，路径，内容建立索引。代码如下：

//索引文件存放的位置
File indexDir=new File(".\index");
//需要建立索引的文档集合的位置
 File docDir = new File(".\content"); 
//创建索引器(核心)
IndexWriter standardWriter = new IndexWriter(FSDirectory.open(indexDir),new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);         
//不建立复合式索引文件，默认的情况下是复合式的索引文件
standardWriter.setUseCompoundFile(false);
//为原文档集合中的每个文档的相关信息建立索引
for (File fileSrc : docDir.listFiles()) {   
        //Lucene的文档结构
        Document doc = new Document();  	           	   
        //文件名称，可查询，不分词
        String fileName=file.getName().substring(0,file.getName().indexOf("."));
        doc.add(new Field("name",fileName, Field.Store.YES, Field.Index.NOT_ANALYZED));  
         //文件路径，可查询，不分词
        String filePath=file.getPath();
        doc.add(new Field("path", filePath, Field.Store.YES, Field.Index.NOT_ANALYZED));
        //文件内容，需要检索
        doc.add(new Field("content", new FileReader(file)));  	       	
        //使用索引器对Document文档建索引
       standardWriter.addDocument(doc);  
}  
//关闭索引器，并写入磁盘索引文件
standardWriter.optimize();  
standardWriter.close();

其中，IndexWriter索引器用来创建索引。Document和Field类(详见《Document/Field 数据源组织结构》 )表示这三个文档的结构如下：

Document Field1(name) Field2(path) Field3(content)

doc1 1 e:\\content\\1.txt The lucene is a good IR. I hope I can lean well.

doc2 2 e:\\content\\2.txt You know it's difficult to learn Lucene well.

doc3 3 e:\\content\\3.txt Maybe is very hard. I must do it well.

1.2 IndexWriter 索引器

一个IndexWriter对象创建并且维护(maintains) 一条索引并生成segment，使用DocumentsWriter类来建立多个文档的索引数据，SegmentMerger类负责合并多个 segment。

1.2.1 构造器

org.apache.lucene.index包是lucene用来创建索引的。其中的IndexWriter类是Lucene的索引器。下面是IndexWriter的构造器。

/**
 * d 索引文件存储的目录类
 * a 语言分析器，主要用于分词
 * create 真—创建新的索引文件或者重新覆盖已存在的索引文件; 假—在已存在的索引文件后追加索引记录
 * mfl 允许Field中的词的最大数量
 */
public IndexWriter(Directory d, Analyzer a, boolean create, MaxFieldLength mfl){
    init(d, a, null, mfl.getLimit(), null, null);
}

其中Directory将在后面再谈；Analyzer详细可见《Lucene分析器—Analyzer》；create确定是否一条新的索引将被创建，或者是否一条已经存在的索引将被打开；MaxFieldLength默认的最大长度为10000，在《Document-Field数据源组织结构》中我们知道Field可以表示数据源的任何属性信息，如果Field中的value数据量非常大，那么很有可能在建索引的时候造成内存溢出。所以有必要限制每个Field中value的最大词数。

1.2.3 建立索引的方法：addDocument(Document)

该方法是用来创建索引的。我们进一步看看这个方法的源码：

public void addDocument(Document doc)  {
       addDocument(doc, analyzer);
}

public void addDocument(Document doc, Analyzer analyzer) {
    ensureOpen();
    boolean doFlush = false;
    boolean success = false;
    try {
      try {
            //使用DocumentWriter建立索引
            doFlush = docWriter.addDocument(doc, analyzer);
            success = true;
      } finally {
           if (!success) {

               if (infoStream != null)
               message("hit exception adding document");

               synchronized (this) {

                   if (docWriter != null) {
                        final Collection<String> files = docWriter.abortedFiles();
                        if (files != null)
                        deleter.deleteNewFiles(files);
                   }
               }
           }
      }
      if (doFlush)
             flush(true, false, false);
      }catch (OutOfMemoryError oom) {
             handleOOM(oom, "addDocument");
    }
}

其中docWriter.addDocument(doc,analyzer);调用了DocumentWriter类来创建一个Document对象的索引。而DocumentWriter类将在《索引创建(2)：DocumentWriter处理流程》详细介绍。

7
顶

0
踩

分享到：

【Lucene3.0 初窥】索引创建(2)：Document ... | 【Lucene3.0 初窥】数据源内存组织结构—Do ...

2010-04-07 19:11
浏览 4918
评论(0)
分类:互联网
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论