Lucene学习总结之四：Lucene索引过程分析(1) -

forfuture1978

浏览: 423176 次
性别:
来自: 北京

最近访客更多访客>>

mushroom12

背着家走

DYM_NEVER

Not_Sky

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Lucene学习总结之四：Lucene索引过程分析(1)

博客分类：

Lucene 学习总结

lucene 多线程 IE JVM Linux

对于Lucene的索引过程，除了将词(Term)写入倒排表并最终写入Lucene的索引文件外，还包括分词(Analyzer)和合并段(merge segments)的过程，本次不包括这两部分，将在以后的文章中进行分析。

Lucene的索引过程，很多的博客，文章都有介绍，推荐大家上网搜一篇文章：《Annotated Lucene》，好像中文名称叫《Lucene源码剖析》是很不错的。

想要真正了解Lucene索引文件过程，最好的办法是跟进代码调试，对着文章看代码，这样不但能够最详细准确的掌握索引过程(描述都是有偏差的，而代码是不会骗你的)，而且还能够学习Lucene的一些优秀的实现，能够在以后的工作中为我所用，毕竟Lucene是比较优秀的开源项目之一。

由于Lucene已经升级到3.0.0了，本索引过程为Lucene 3.0.0的索引过程。

一、索引过程体系结构

Lucene 3.0的搜索要经历一个十分复杂的过程，各种信息分散在不同的对象中分析，处理，写入，为了支持多线程，每个线程都创建了一系列类似结构的对象集，为了提高效率，要复用一些对象集，这使得索引过程更加复杂。

其实索引过程，就是经历下图中所示的索引链的过程，索引链中的每个节点，负责索引文档的不同部分的信息，当经历完所有的索引链的时候，文档就处理完毕了。最初的索引链，我们称之基本索引链 。

为了支持多线程，使得多个线程能够并发处理文档，因而每个线程都要建立自己的索引链体系，使得每个线程能够独立工作，在基本索引链基础上建立起来的每个线程独立的索引链体系，我们称之线程索引链 。线程索引链的每个节点是由基本索引链中的相应的节点调用函数addThreads创建的。

为了提高效率，考虑到对相同域的处理有相似的过程，应用的缓存也大致相当，因而不必每个线程在处理每一篇文档的时候都重新创建一系列对象，而是复用这些对象。所以对每个域也建立了自己的索引链体系，我们称之域索引链 。域索引链的每个节点是由线程索引链中的相应的节点调用addFields创建的。

当完成对文档的处理后，各部分信息都要写到索引文件中，写入索引文件的过程是同步的，不是多线程的，也是沿着基本索引链将各部分信息依次写入索引文件的。

下面详细分析这一过程。

二、详细索引过程

1、创建IndexWriter对象

代码：

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

IndexWriter对象主要包含以下几方面的信息：

用于索引文档
- Directory directory; 指向索引文件夹
- Analyzer analyzer; 分词器
- Similarity similarity = Similarity.getDefault(); 影响打分的标准化因子(normalization factor)部分，对文档的打分分两个部分，一部分是索引阶段计算的，与查询语句无关，一部分是搜索阶段计算的，与查询语句相关。
- SegmentInfos segmentInfos = new SegmentInfos(); 保存段信息，大家会发现，和segments_N中的信息几乎一一对应。
- IndexFileDeleter deleter; 此对象不是用来删除文档的，而是用来管理索引文件的。
- Lock writeLock; 每一个索引文件夹只能打开一个IndexWriter，所以需要锁。
- Set segmentsToOptimize = new HashSet(); 保存正在最优化(optimize)的段信息。当调用optimize的时候，当前所有的段信息加入此Set，此后新生成的段并不参与此次最优化。
用于合并段，在合并段的文章中将详细描述
- SegmentInfos localRollbackSegmentInfos;
- HashSet mergingSegments = new HashSet();
- MergePolicy mergePolicy = new LogByteSizeMergePolicy(this);
- MergeScheduler mergeScheduler = new ConcurrentMergeScheduler();
- LinkedList pendingMerges = new LinkedList();
- Set runningMerges = new HashSet();
- List mergeExceptions = new ArrayList();
- long mergeGen;
为保持索引完整性，一致性和事务性
- SegmentInfos rollbackSegmentInfos; 当IndexWriter对索引进行了添加，删除文档操作后，可以调用commit将修改提交到文件中去，也可以调用rollback取消从上次commit到此时的修改。
- SegmentInfos localRollbackSegmentInfos; 此段信息主要用于将其他的索引文件夹合并到此索引文件夹的时候，为防止合并到一半出错可回滚所保存的原来的段信息。
一些配置
- long writeLockTimeout; 获得锁的时间超时。当超时的时候，说明此索引文件夹已经被另一个IndexWriter打开了。
- int termIndexInterval; 同tii和tis文件中的indexInterval。

有关SegmentInfos对象所保存的信息：

当索引文件夹如下的时候，SegmentInfos对象如下表

segmentInfos    SegmentInfos (id=37)
    capacityIncrement    0
    counter    3
    elementCount    3
    elementData    Object[10] (id=68)
        [0]    SegmentInfo (id=166)
            delCount    0
            delGen    -1
            diagnostics    HashMap (id=170)
            dir    SimpleFSDirectory (id=171)
            docCount    2
            docStoreIsCompoundFile    false
            docStoreOffset    -1
            docStoreSegment    null
            files    ArrayList (id=173)
            hasProx    true
            hasSingleNormFile    true
            isCompoundFile    1
            name    "_0"
            normGen    null
            preLockless    false
            sizeInBytes    635
        [1]    SegmentInfo (id=168)
            delCount    0
            delGen    -1
            diagnostics    HashMap (id=177)
            dir    SimpleFSDirectory (id=171)
            docCount    2
            docStoreIsCompoundFile    false
            docStoreOffset    -1
            docStoreSegment    null
            files    ArrayList (id=178)
            hasProx    true
            hasSingleNormFile    true
            isCompoundFile    1
            name    "_1"
            normGen    null
            preLockless    false
            sizeInBytes    635
        [2]    SegmentInfo (id=169)
            delCount    0
            delGen    -1
            diagnostics    HashMap (id=180)
            dir    SimpleFSDirectory (id=171)
            docCount    2
            docStoreIsCompoundFile    false
            docStoreOffset    -1
            docStoreSegment    null
            files    ArrayList (id=214)
            hasProx    true
            hasSingleNormFile    true
            isCompoundFile    1
            name    "_2"
            normGen    null
            preLockless    false
            sizeInBytes    635
    generation    4
    lastGeneration    4
    modCount    3
    pendingSegnOutput    null
    userData    HashMap (id=146)
    version    1263044890832

有关IndexFileDeleter：

其不是用来删除文档的，而是用来管理索引文件的。
在对文档的添加，删除，对段的合并的处理过程中，会生成很多新的文件，并需要删除老的文件，因而需要管理。
然而要被删除的文件又可能在被用，因而要保存一个引用计数，仅仅当引用计数为零的时候，才执行删除。
下面这个例子能很好的说明IndexFileDeleter如何对文件引用计数并进行添加和删除的。

(1) 创建IndexWriter时

IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setMergeFactor(3);

索引文件夹如下：

引用计数如下：

refCounts    HashMap (id=101)
    size    1
    table    HashMap$Entry[16] (id=105)
        [8]    HashMap$Entry (id=110)
            key    "segments_1"
            value    IndexFileDeleter$RefCount (id=38)
                count    1

(2) 添加第一个段时

indexDocs(writer, docDir);
writer.commit();

首先生成的不是compound文件

因而引用计数如下：

refCounts    HashMap (id=101)
    size    9
    table    HashMap$Entry[16] (id=105)
        [1]    HashMap$Entry (id=129)
            key    "_0.tis"
            value    IndexFileDeleter$RefCount (id=138)
                count    1
        [3]    HashMap$Entry (id=130)
            key    "_0.fnm"
            value    IndexFileDeleter$RefCount (id=141)
                count    1
        [4]    HashMap$Entry (id=134)
            key    "_0.tii"
            value    IndexFileDeleter$RefCount (id=142)
                count    1
        [8]    HashMap$Entry (id=135)
            key    "_0.frq"
            value    IndexFileDeleter$RefCount (id=143)
                count    1
        [10]    HashMap$Entry (id=136)
            key    "_0.fdx"
            value    IndexFileDeleter$RefCount (id=144)
                count    1
        [13]    HashMap$Entry (id=139)
            key    "_0.prx"
            value    IndexFileDeleter$RefCount (id=145)
                count    1
        [14]    HashMap$Entry (id=140)
            key    "_0.fdt"
            value    IndexFileDeleter$RefCount (id=146)
                count    1

然后会合并成compound文件，并加入引用计数

refCounts    HashMap (id=101)
    size    10
    table    HashMap$Entry[16] (id=105)
        [1]    HashMap$Entry (id=129)
            key    "_0.tis"
            value    IndexFileDeleter$RefCount (id=138)
                count    1
        [2]    HashMap$Entry (id=154)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=155)
                count    1
        [3]    HashMap$Entry (id=130)
            key    "_0.fnm"
            value    IndexFileDeleter$RefCount (id=141)
                count    1
        [4]    HashMap$Entry (id=134)
            key    "_0.tii"
            value    IndexFileDeleter$RefCount (id=142)
                count    1
        [8]    HashMap$Entry (id=135)
            key    "_0.frq"
            value    IndexFileDeleter$RefCount (id=143)
                count    1
        [10]    HashMap$Entry (id=136)
            key    "_0.fdx"
            value    IndexFileDeleter$RefCount (id=144)
                count    1
        [13]    HashMap$Entry (id=139)
            key    "_0.prx"
            value    IndexFileDeleter$RefCount (id=145)
                count    1
        [14]    HashMap$Entry (id=140)
            key    "_0.fdt"
            value    IndexFileDeleter$RefCount (id=146)
                count    1

然后会用IndexFileDeleter.decRef()来删除[_0.nrm, _0.tis, _0.fnm, _0.tii, _0.frq, _0.fdx, _0.prx, _0.fdt]文件

refCounts    HashMap (id=101)
    size    2
    table    HashMap$Entry[16] (id=105)
        [2]    HashMap$Entry (id=154)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=155)
                count    1
        [8]    HashMap$Entry (id=110)
            key    "segments_1"
            value    IndexFileDeleter$RefCount (id=38)
                count    1

然后为建立新的segments_2

refCounts    HashMap (id=77)
    size    3
    table    HashMap$Entry[16] (id=84)
        [2]    HashMap$Entry (id=87)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=91)
                count    3
        [8]    HashMap$Entry (id=89)
            key    "segments_1"
            value    IndexFileDeleter$RefCount (id=62)
                count    0
        [9]    HashMap$Entry (id=90)
            key    "segments_2"
            next    null
            value    IndexFileDeleter$RefCount (id=93)
                count    1

然后IndexFileDeleter.decRef() 删除segments_1文件

refCounts    HashMap (id=77)
    size    2
    table    HashMap$Entry[16] (id=84)
        [2]    HashMap$Entry (id=87)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=91)
                count    2
        [9]    HashMap$Entry (id=90)
            key    "segments_2"
            value    IndexFileDeleter$RefCount (id=93)
                count    1

(3) 添加第二个段

indexDocs(writer, docDir);
writer.commit();

(4) 添加第三个段，由于MergeFactor为3，则会进行一次段合并。

indexDocs(writer, docDir);
writer.commit();

首先和其他的段一样，生成_2.cfs以及segments_4

同时创建了一个线程来进行背后进行段合并(ConcurrentMergeScheduler$MergeThread.run())

这时候的引用计数如下

refCounts    HashMap (id=84)
    size    5
    table    HashMap$Entry[16] (id=98)
        [2]    HashMap$Entry (id=112)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=117)
                count    1
        [4]    HashMap$Entry (id=113)
            key    "_3.cfs"
            value    IndexFileDeleter$RefCount (id=118)
                count    1
        [12]    HashMap$Entry (id=114)
            key    "_1.cfs"
            value    IndexFileDeleter$RefCount (id=119)
                count    1
        [13]    HashMap$Entry (id=115)
            key    "_2.cfs"
            value    IndexFileDeleter$RefCount (id=120)
                count    1
        [15]    HashMap$Entry (id=116)
            key    "segments_4"
            value    IndexFileDeleter$RefCount (id=121)
                count    1

(5) 关闭writer

writer.close();

通过IndexFileDeleter.decRef()删除被合并的段

有关SimpleFSLock进行JVM之间的同步：

有时候，我们写java程序的时候，也需要不同的JVM之间进行同步，来保护一个整个系统中唯一的资源。
如果唯一的资源仅仅在一个进程中，则可以使用线程同步的机制
然而如果唯一的资源要被多个进程进行访问，则需要进程间同步的机制，无论是Windows和Linux在操作系统层面都有很多的进程间同步的机制。
但进程间的同步却不是Java的特长，Lucene的SimpleFSLock给我们提供了一种方式。

Lock的抽象类

public abstract class Lock {

public static long LOCK_POLL_INTERVAL = 1000;

public static final long LOCK_OBTAIN_WAIT_FOREVER = -1;

public abstract boolean obtain() throws IOException;

public boolean obtain(long lockWaitTimeout) throws LockObtainFailedException, IOException {

boolean locked = obtain();

if (lockWaitTimeout < 0 && lockWaitTimeout != LOCK_OBTAIN_WAIT_FOREVER)
throw new IllegalArgumentException("...");

long maxSleepCount = lockWaitTimeout / LOCK_POLL_INTERVAL;

long sleepCount = 0;

while (!locked) {

      if (lockWaitTimeout != LOCK_OBTAIN_WAIT_FOREVER && sleepCount++ >= maxSleepCount) {
        throw new LockObtainFailedException("Lock obtain timed out.");
      }
      try {
        Thread.sleep(LOCK_POLL_INTERVAL);
      } catch (InterruptedException ie) {
        throw new ThreadInterruptedException(ie);
      }
      locked = obtain();
    }
    return locked;
}

public abstract void release() throws IOException;

public abstract boolean isLocked() throws IOException;

}

LockFactory的抽象类

public abstract class LockFactory {

public abstract Lock makeLock(String lockName);

abstract public void clearLock(String lockName) throws IOException;
}

SimpleFSLock的实现类

class SimpleFSLock extends Lock {

File lockFile;
File lockDir;

public SimpleFSLock(File lockDir, String lockFileName) {
this.lockDir = lockDir;
lockFile = new File(lockDir, lockFileName);
}

@Override
public boolean obtain() throws IOException {

if (!lockDir.exists()) {

if (!lockDir.mkdirs())
throw new IOException("Cannot create directory: " + lockDir.getAbsolutePath());

} else if (!lockDir.isDirectory()) {

throw new IOException("Found regular file where directory expected: " + lockDir.getAbsolutePath());
}

return lockFile.createNewFile();

}

@Override
public void release() throws LockReleaseFailedException {

if (lockFile.exists() && !lockFile.delete())
throw new LockReleaseFailedException("failed to delete " + lockFile);

}

@Override
public boolean isLocked() {

return lockFile.exists();

}

SimpleFSLockFactory的实现类

public class SimpleFSLockFactory extends FSLockFactory {

public SimpleFSLockFactory(String lockDirName) throws IOException {

setLockDir(new File(lockDirName));

}

@Override
public Lock makeLock(String lockName) {

if (lockPrefix != null) {

lockName = lockPrefix + "-" + lockName;

}

return new SimpleFSLock(lockDir, lockName);

}

@Override
public void clearLock(String lockName) throws IOException {

if (lockDir.exists()) {

if (lockPrefix != null) {

lockName = lockPrefix + "-" + lockName;

}

File lockFile = new File(lockDir, lockName);

if (lockFile.exists() && !lockFile.delete()) {

throw new IOException("Cannot delete " + lockFile);

}

};

2、创建文档Document对象，并加入域(Field)

代码：

Document doc = new Document();

doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));

doc.add(new Field("modified",DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));

doc.add(new Field("contents", new FileReader(f)));

Document对象主要包括以下部分：

此文档的boost，默认为1，大于一说明比一般的文档更加重要，小于一说明更不重要。
一个ArrayList保存此文档所有的域
每一个域包括域名，域值，和一些标志位，和fnm，fdx，fdt中的描述相对应。

doc    Document (id=42)
    boost    1.0
    fields    ArrayList (id=44)
        elementData    Object[10] (id=46)
            [0]    Field (id=48)
                binaryLength    0
                binaryOffset    0
                boost    1.0
                fieldsData    "exampledocs\\file01.txt"
                isBinary    false
                isIndexed    true
                isStored    true
                isTokenized    false
                lazy    false
                name    "path"
                omitNorms    false
                omitTermFreqAndPositions    false
                storeOffsetWithTermVector    false
                storePositionWithTermVector    false
                storeTermVector    false
                tokenStream    null
            [1]    Field (id=50)
                binaryLength    0
                binaryOffset    0
                boost    1.0
                fieldsData    "200910240957"
                isBinary    false
                isIndexed    true
                isStored    true
                isTokenized    false
                lazy    false
                name    "modified"
                omitNorms    false
                omitTermFreqAndPositions    false
                storeOffsetWithTermVector    false
                storePositionWithTermVector    false
                storeTermVector    false
                tokenStream    null
            [2]    Field (id=52)
                binaryLength    0
                binaryOffset    0
                boost    1.0
                fieldsData    FileReader (id=58)
                isBinary    false
                isIndexed    true
                isStored    false
                isTokenized    true
                lazy    false
                name    "contents"
                omitNorms    false
                omitTermFreqAndPositions    false
                storeOffsetWithTermVector    false
                storePositionWithTermVector    false
                storeTermVector    false
                tokenStream    null
        modCount    3
        size    3

查看图片附件

分享到：

Lucene学习总结之四：Lucene索引过程分析(2 ... | Lucene学习总结之三：Lucene的索引文件格式 ...

2010-02-03 22:47
浏览 8350
评论(1)
分类:编程语言
查看更多

1 楼 shantouyyt 2012-02-16

然后为建立新的segments_2

refCounts    HashMap (id=77)
    size    3
    table    HashMap$Entry[16] (id=84)
        [2]    HashMap$Entry (id=87)
            key    "_0.cfs"
            value    IndexFileDeleter$RefCount (id=91)
               count    3
        [8]    HashMap$Entry (id=89)
            key    "segments_1"
            value    IndexFileDeleter$RefCount (id=62)
                count    0
        [9]    HashMap$Entry (id=90)
            key    "segments_2"
            next    null
            value    IndexFileDeleter$RefCount (id=93)
                count    1
然后IndexFileDeleter.decRef() 删除segments_1文件

这里的_0.cfs 为什么是 3

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene学习总结之四：Lucene索引过程分析(1)

一、索引过程体系结构

二、详细索引过程

1、创建IndexWriter对象

2、创建文档Document对象，并加入域(Field)

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene学习总结之四：Lucene索引过程分析(1)

一、索引过程体系结构

二、详细索引过程

1、创建IndexWriter对象

2、创建文档Document对象，并加入域(Field)

评论

发表评论

相关推荐

Lucene应用开发揭秘

Lucene应用开发揭秘上线了

LinkedIn公司实现的实时搜索引擎Zoie

Lucene 原理与代码分析完整版

Lucene学习总结之十：Lucene的分词器Analyzer

Lucene学习总结之九：Lucene的查询对象

Lucene学习总结之九：Lucene的查询对象(3)

Lucene学习总结之九：Lucene的查询对象(2)

Lucene学习总结之九：Lucene的查询对象(1)

Lucene学习总结之八：Lucene的查询语法，JavaCC及QueryParser

Lucene学习总结之八：Lucene的查询语法，JavaCC及QueryParser(2)

Lucene学习总结之八：Lucene的查询语法，JavaCC及QueryParser(1)

Lucene学习总结之七：Lucene搜索过程解析

Lucene学习总结之七：Lucene搜索过程解析

Lucene学习总结之七：Lucene搜索过程解析(8)

Lucene学习总结之七：Lucene搜索过程解析(7)

Lucene学习总结之七：Lucene搜索过程解析(6)

Lucene学习总结之七：Lucene搜索过程解析(5)

Lucene学习总结之七：Lucene搜索过程解析(4)

Lucene学习总结之七：Lucene搜索过程解析(3)

最近访客更多访客>>