lucene索引合并与增量索引

isiqi

浏览: 16485090 次
性别:
来自: 济南

最近访客更多访客>>

nison

hellohank

wangyy

devcang

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2012-07 ( 335)
2012-06 ( 252)
2012-05 ( 362)
更多存档...

lucene performance C C++C#

利用 Lucene，在创建索引的工程中你可以充分利用机器的硬件资源来提高索引的效率。当你需要索引大量的文件时，你会注意到索引过程的瓶颈是在往磁盘上写索引文件的过程中。为了解决这个问题, Lucene 在内存中持有一块缓冲区。但我们如何控制 Lucene 的缓冲区呢？幸运的是，Lucene 的类 IndexWriter 提供了三个参数用来调整缓冲区的大小以及往磁盘上写索引文件的频率。

1．合并因子（mergeFactor）

这个参数决定了在 Lucene 的一个索引块中可以存放多少文档以及把磁盘上的索引块合并成一个大的索引块的频率。比如，如果合并因子的值是 10，那么当内存中的文档数达到 10 的时候所有的文档都必须写到磁盘上的一个新的索引块中。并且，如果磁盘上的索引块的隔数达到 10 的话，这 10 个索引块会被合并成一个新的索引块。这个参数的默认值是 10，如果需要索引的文档数非常多的话这个值将是非常不合适的。对批处理的索引来讲，为这个参数赋一个比较大的值会得到比较好的索引效果。

2．最小合并文档数

这个参数也会影响索引的性能。它决定了内存中的文档数至少达到多少才能将它们写回磁盘。这个参数的默认值是10，如果你有足够的内存，那么将这个值尽量设的比较大一些将会显著的提高索引性能。

3．最大合并文档数

这个参数决定了一个索引块中的最大的文档数。它的默认值是 Integer.MAX_VALUE，将这个参数设置为比较大的值可以提高索引效率和检索速度，由于该参数的默认值是整型的最大值，所以我们一般不需要改动这个参数。

/**
 * This class demonstrates how to improve the indexing performance 
 * by adjusting the parameters provided by IndexWriter.
 */
public class AdvancedTextFileIndexer {
 public static void main(String[] args) throws Exception{
 //fileDir is the directory that contains the text files to be indexed
 File fileDir = new File("C:\\files_to_index");

 //indexDir is the directory that hosts Lucene's index files
 File indexDir = new File("C:\\luceneIndex");
 Analyzer luceneAnalyzer = new StandardAnalyzer();
 File[] textFiles = fileDir.listFiles();
 long startTime = new Date().getTime();

 int mergeFactor = 10;
 int minMergeDocs = 10;
 int maxMergeDocs = Integer.MAX_VALUE;
 IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true); 
 indexWriter.mergeFactor = mergeFactor;
 indexWriter.minMergeDocs = minMergeDocs;
 indexWriter.maxMergeDocs = maxMergeDocs;

 //Add documents to the index
 for(int i = 0; i > textFiles[i].getName().endsWith(".txt")){
 Reader textReader = new FileReader(textFiles[i]);
 Document document = new Document();
 document.add(Field.Text("content",textReader));
 document.add(Field.Keyword("path",textFiles[i].getPath()));
 indexWriter.addDocument(document);
 }
 }

 indexWriter.optimize();
 indexWriter.close();
 long endTime = new Date().getTime();

 System.out.println("MergeFactor: " + indexWriter.mergeFactor);
 System.out.println("MinMergeDocs: " + indexWriter.minMergeDocs);
 System.out.println("MaxMergeDocs: " + indexWriter.maxMergeDocs);
 System.out.println("Document number: " + textFiles.length);
 System.out.println("Time consumed: " + (endTime - startTime) + " milliseconds");
 }
}

判断索引目录的segments文件是否存在，
如果存在，用增量索引
否则，重新创建索引

如果是重新创建索引
只需要遍历需要索引的内容，然后新增文档

如果是增量索引
判断主Key[文件名(包含全路径)]在索引中是否存在
如果存在
判断[文件的修改时间]，是否和索引中保存的[修改时间]一致
如果不一致
删除旧的索引中的该索引项目
新增对该文档的索引
否则
新增该文档的索引缺点，不能发现已经删除的文件，当然是认为的去某个目录删除该文件。
如果是程序的话当然是可以的，只需要把索引中的项目删除。

分享到：