Lucene-2.2.0 源代码阅读学习(24)

pavel

浏览: 928252 次
性别:
来自: 北京

最近访客更多访客>>

macmilan

just_Word

沈寅麟

spedit

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

lucene

lucene F#

阅读了这么多代码，该综合总结一下了。

通过在文章 Lucene-2.2.0 源代码阅读学习(4) 中的那个例子，跟踪一下一个IndexWriter索引器实例化过程，及其建立索引的过程中都经过了哪些处理(主要看涉及到了哪些类来完成建立索引的强大功能)。

在文章 Lucene-2.2.0 源代码阅读学习(4) 中的主函数如下所示：

public static void main(String[] args){
   MySearchEngine mySearcher = new MySearchEngine();
   String indexPath = "E:\\Lucene\\myindex";
   File file = new File("E:\\Lucene\\txt");
   mySearcher.setIndexPath(indexPath);
   mySearcher.setFile(file);
   IndexWriter writer;
   try {
    writer = new IndexWriter(mySearcher.getIndexPath(),new CJKAnalyzer(),true);
    mySearcher.createIndex(writer, mySearcher.getFile());
    mySearcher.searchContent("contents","注册");
    writer.close();
   } catch (CorruptIndexException e) {
    e.printStackTrace();
   } catch (LockObtainFailedException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   }
}

从红色标注的代码行看起：

初始化一个IndexWriter索引器的时候(红色标注的第一行代码)，还没有向其中添加任何的Document。当执行到红色标注的第二行的时候，开始从指定的本地磁盘目录中读取数据源文件(没有经过任何处理)，并且根据已经存在的IndexWriter索引器writer，向这个writer中添加Document，可以看到文章 Lucene-2.2.0 源代码阅读学习(4) 中createIndex()方法中的处理是这样的：

writer.addDocument(FileDocument.Document(file));

具体地，从本地磁盘指定目录中读取文件，调用FileDocument类的Document()方法对待处理的原始文件进行一番处理，主要是构造Field，并将构造好的Field添加到Document中，如下所示：

    // 通过f的所在路径构造一个Field对象，并设定该Field对象的一些属性：
    // “path”是构造的Field的名字，通过该名字可以找到该Field
    // Field.Store.YES表示存储该Field；Field.Index.UN_TOKENIZED表示不对该Field进行分词，但是对其进行索引，以便检索
    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));

    // 构造一个具有最近修改修改时间信息的Field
    doc.add(new Field("modified",
        DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
        Field.Store.YES, Field.Index.UN_TOKENIZED));

// 构造一个Field，这个Field可以从一个文件流中读取，必须保证由f所构造的文件流是打开的
doc.add(new Field("contents", new FileReader(f)));

处理完成之后，最后返回一个Document对象，这个Document对象已经含有一定的信息了。接着就把返回的一个个Document对象加入到IndexWriter writer索引器中。

自然而然，从writer.addDocument(FileDocument.Document(file));可以看到，调用的IndexWriter类的addDocumnet()方法，对加入到其中的Document进行处理。这个处理是相当复杂的，我们一步步跟踪它的处理过程，掌握它建立索引的历程。

为了直观，设置了分级序号，即在一个类中是顶级，在该类的某个方法第一次被调用的时候为一级，在此方法中调用到的其他方法都为一级的子级，以此类推。

现在，进入到IndexWriter索引器(IndeWriter writer已经存在，打开了一个流)的内部：

在IndexWriter类中：

1、addDocument()方法中

public void addDocument(Document doc, Analyzer analyzer) throws CorruptIndexException, IOException {
    ensureOpen();
    SegmentInfo newSegmentInfo = buildSingleDocSegment(doc, analyzer);
    synchronized (this) {
      ramSegmentInfos.addElement(newSegmentInfo);
      maybeFlushRamSegments();
    }
}

通过上面知道，传递进来的Document doc和Analyzer analyzer都已经存在在内存中了，可以随时使用他们。

整体概括层次(1.x)

—1.1、调用了ensureOpen()方法

看IndexWriter的addDocument()方法中，首先调用了ensureOpen()方法，该方法根据一个closed标识来保证当前实例化的IndexWriter是否处于打开状态。关于closed的标识的设置，当一个IndexWriter索引器实例化的时候，该值就已经初始化为false了，表示索引器writer已经处于打开状态。如果想要关闭writer，直接调用IndexWriter类的close()方法，可以设置closed的标识为true，表示索引器被关闭了，不能进行有关建立索引的操作了。

—1.2、调用了buildSingleDocSegment(doc, analyzer)方法

IndexWriter的addDocument()方法中，调用buildSingleDocSegment(doc, analyzer)方法，根据传递进来的doc和analyzer(它们都存在于当前的内存中)，构造了一个SegmentInfo对象。具体构造过程如下所示：

SegmentInfo buildSingleDocSegment(Document doc, Analyzer analyzer)
      throws CorruptIndexException, IOException {
    DocumentWriter dw = new DocumentWriter(ramDirectory, analyzer, this);
    dw.setInfoStream(infoStream);
    String segmentName = newRamSegmentName();
    dw.addDocument(segmentName, doc);
    SegmentInfo si = new SegmentInfo(segmentName, 1, ramDirectory, false, false);
    si.setNumFields(dw.getNumFields());
    return si;
}

该buildSingleDocSegment()方法，通过使用DocumentWriter类的实例，实现对Document doc进行处理，处理的过程中使用了传递进来的Analyzer analyzer分析器。

处理完成以后，才构造的SegmentInfo对象，并且根据已经存在的DocumentWriter实例所包含的信息，设置SegmentInfo对象的内容，最后返回一个SegmentInfo si实例。

—1.3、向SegmentInfos中添加一个SegmentInfo

即：ramSegmentInfos.addElement(newSegmentInfo);

—1.4、向SegmentInfos中添加一个SegmentInfo

即：maybeFlushRamSegments();

该方法在IndexWriter类中定义用来监测当前缓冲区，及时将缓冲区中的数据flush到索引目录中。其中可能存在索引段合并的问题。实现代码如下所示：

protected final void maybeFlushRamSegments() throws CorruptIndexException, IOException {
   // 如果缓冲区中有足够多的新的Document，或者足够的缓冲的删除的词条
    if (ramSegmentInfos.size() >= minMergeDocs || numBufferedDeleteTerms >= maxBufferedDeleteTerms) {
      flushRamSegments();
    }
}

调用的 flushRamSegments()方法的定义为：

private final synchronized void flushRamSegments() throws CorruptIndexException, IOException {
flushRamSegments(true);
}

这里又调用了flushRamSegments()方法，定义如下：

protected final synchronized void flushRamSegments(boolean triggerMerge)
      throws CorruptIndexException, IOException {
    if (ramSegmentInfos.size() > 0 || bufferedDeleteTerms.size() > 0) {
      mergeSegments(ramSegmentInfos, 0, ramSegmentInfos.size());
      if (triggerMerge) maybeMergeSegments(minMergeDocs);
    }
}

该方法调用的两个方法mergeSegments()和maybeMergeSegments()才是合并索引段最核心的操作，参考后文。

详细解析层次1.1.x(无)

详细解析层次1.2.x

— —1.2.1、构造DocumentWriter实例

构造一个DocumentWriter实例，需要三个参数。这里：

第一个是ramDirectory，它是IndexWriter类的一个成员，是RAMDirectory类的实例，该类的实例与内存中的目录操作有密切关系。

第二个是analyzer，一个在内存中存在的Analyzer实例。

第三个是this，即当前的这个IndexWriter writer索引器的实例。

DocumentWriter类的这个构造方法定义如下：

DocumentWriter(Directory directory, Analyzer analyzer, IndexWriter writer) {
    this.directory = directory;
    this.analyzer = analyzer;
    this.similarity = writer.getSimilarity();
    this.maxFieldLength = writer.getMaxFieldLength();
    this.termIndexInterval = writer.getTermIndexInterval();
}

可见，在构造一个DocumentWriter的实例的时候，在该构造函数所具有的参数以外，还设定了DocumentWriter类的几个成员属性的值。一共额外设置了三个属性值，都是从IndexWriterwriter索引器中获取到的。其中：

Similarity是在一个IndexWriter writer实例化的时候，便设置了它的内容，它是关于标准化因子的内容的，可以从IndexWriter类中找到该成员的设置如下所示：

private Similarity similarity = Similarity.getDefault();

使用了默认的Similarity，即DefaultSimilarity类的一个实例，该DefaultSimilarity类是Similarity类的继承子类，该类与检索的关系非常密切。(后面再学习)

maxFieldLength指一个可以为多少个Field建立索引，在Lucene中指定的默认的值为10000，可以根据需要修改这个数值。可以从IndexWriter类中看到定义：

public final static int DEFAULT_MAX_FIELD_LENGTH = 10000

termIndexInterval是词条索引区间，与在内存中处理词条相关。如果该值设置的越大，则导致IndexReader使用的内存空间就越小，也就减慢了词条Term的随机存储速度。该参数决定了每次查询要求计算词条Term的数量。在Lucene中的默认值为128，仍然可以从IndexWriter类中看到定义：

public final static int DEFAULT_TERM_INDEX_INTERVAL = 128;

— —1.2.2、为已经构造好的DocumentWriter dw设置一个PrintStream流

即：dw.setInfoStream(infoStream);

一个PrintStream是一个过滤流，继承自FilterOutputStream类。它可以加入到另一个输出流中，并且很方便地把它所具有的各种信息打印出来。PrintStream流具有很好的性能。

— —1.2.3、在内存中创建一个临时的索引段的名称

即：String segmentName = newRamSegmentName();

调用了IndexWriter类的newRamSegmentName()，生成一个临时的索引段名称，该newRamSegmentName()方法比较简单：

final synchronized String newRamSegmentName() {
return "_ram_" + Integer.toString(ramSegmentInfos.counter++, Character.MAX_RADIX);
}

ramSegmentInfos是SegmentInfos类的实例，该类的counter成员是用来为将要写入到索引目录中的索引段文件命名的，应该是内部名，counter的信息后写入到索引段文件中。如果第一次调用该方法，生成的名称为_ram_1。

— —1.2.4、调用DocumentWriter类的addDocument()方法对doc进行处理

即dw.addDocument(segmentName, doc);

关于该方法的说明可以参考文章 Lucene-2.2.0 源代码阅读学习(21) 。

这里，叙述一下在addDocument()方法中都做了哪些事情：

首先，在addDocument()方法中构造了一个FieldInfos对象，将传递进来的doc加入到其中。从doc中提取关于Field的信息，将操作频繁的信息提取出来，构造一个个的FieldInfo对象。然后FieldInfos在对FieldInfo进行管理。

关于FieldInfos类和FieldInfo的说明可以参考文章 Lucene-2.2.0 源代码阅读学习(22) 。

其次，在addDocument()方法中利用FieldInfos管理FieldInfo的便利性，再次提取Field的信息，构造Posting类的实例，关于Posting类说明可以参考文章 Lucene-2.2.0 源代码阅读学习(23) 。这里面，在一个FieldInfos写入到索引文件之前，要对doc进行倒排(因为doc已经加入到FieldInfos中了)，倒排的过程中对Field进行了分词处理。

再次，对doc倒排之后，形成一个Posting[]数组，接着对它进行排序(使用快速排序)，之后FieldInfos才能将各个Field的名称写入索引目录中fieldInfos.write(directory, segment + ".fnm");。如果传递进来的segment值为_ram_1它写到了文件_ram_1.fnm文件中。

接着，构造一个FieldsWriter对象：new FieldsWriter(directory, segment, fieldInfos)，并且将doc添加到FieldWriter对象中，即fieldsWriter.addDocument(doc)，在FieldWriter的addDocument中进行了处理。

最后，处理的是：

writePostings(postings, segment);
writeNorms(segment);

writePostings()方法的实现可以参考文章 Lucene-2.2.0 源代码阅读学习(23) 。

writeNorms()方法的实现如下：

private final void writeNorms(String segment) throws IOException {
    for(int n = 0; n < fieldInfos.size(); n++){
      FieldInfo fi = fieldInfos.fieldInfo(n);
      if(fi.isIndexed && !fi.omitNorms){
        float norm = fieldBoosts[n] * similarity.lengthNorm(fi.name, fieldLengths[n]);
        IndexOutput norms = directory.createOutput(segment + ".f" + n);
        try {
          norms.writeByte(Similarity.encodeNorm(norm));
        } finally {
          norms.close();
        }
      }
    }
}

— —1.2.5、根据经过处理的信息，构造SegmentInfo对象

即：

SegmentInfo si = new SegmentInfo(segmentName, 1, ramDirectory, false, false);
si.setNumFields(dw.getNumFields());

SegmentInfo的该构造函数声明如下：

public SegmentInfo(String name, int docCount, Directory dir, boolean isCompoundFile, boolean hasSingleNormFile)

第一个参数就是索引的文件名，第二个参数是Document的数量，第三个指定是否是复合文件，第四个参数指定是否具有单独的norm文件。

因为是在Lucene 2.2.0版本，已经不使用单独的norm文件了，一个索引段由一个文件统一管理nrom信息。

这里默认设置了Document的数量为1个，且不使用复合文件。

第二行si.setNumFields(dw.getNumFields());，为这个SegmentInfo对象设置Field的数量信息。

最后返回构造好的该SegmentInfo对象。

详细解析层次1.2.4.x

— — —1.2.4.1、关于FieldInfos的第一个write()方法

FieldInfos有两个write()方法，第二个是核心的，先把参数传进第一个方法中：

public void write(Directory d, String name) throws IOException {
IndexOutput output = d.createOutput(name);

/*
    if(d instanceof RAMDirectory){
    System.out.println("d is a instance of RAMDirectory! ");
    }
    else if(d instanceof FSDirectory){
    System.out.println("d is a instance of FSDirectory! ");
    }

*/
    try {
      write(output);    // 调用核心的write()方法
    } finally {
      output.close();
    }
}

上面注释掉的一段代码是用来检测传递进来的Dicrectory到底是FSDirectory还是RAMDirectory。经过测试，这里面传递进来的是实现类RAMDirectory的实例。

— — —1.2.4.2、关于FieldInfos的第二个核心的write()方法

看第一个write()方法中：

IndexOutput output = d.createOutput(name);

返回了一个有name构造的RAMOutputStream输内存出流对象。

接着以output作为参数，调用了FieldInfos的第二个核心的write()方法，该方法实现如下所示：

public void write(IndexOutput output) throws IOException {
    output.writeVInt(size());
    for (int i = 0; i < size(); i++) {
      FieldInfo fi = fieldInfo(i);
      byte bits = 0x0;
      if (fi.isIndexed) bits |= IS_INDEXED;
      if (fi.storeTermVector) bits |= STORE_TERMVECTOR;
      if (fi.storePositionWithTermVector) bits |= STORE_POSITIONS_WITH_TERMVECTOR;
      if (fi.storeOffsetWithTermVector) bits |= STORE_OFFSET_WITH_TERMVECTOR;
      if (fi.omitNorms) bits |= OMIT_NORMS;
      if (fi.storePayloads) bits |= STORE_PAYLOADS;
      output.writeString(fi.name);
      output.writeByte(bits);
    }
}

根据对该层次的详细叙述， output.writeVInt(size());已经将FieldInfos中FieldInfo的个数的数值写入到当前的缓冲区中了。接着看FieldInfos的第二个write()方法的继续执行：

通过一个for循环，根据指定的位置索引(FieldInfo的编号)，遍历每个FieldInfo，对其一些属性进行处理后写入。

      byte bits = 0x0;
      if (fi.isIndexed) bits |= IS_INDEXED;
      if (fi.storeTermVector) bits |= STORE_TERMVECTOR;
      if (fi.storePositionWithTermVector) bits |= STORE_POSITIONS_WITH_TERMVECTOR;
      if (fi.storeOffsetWithTermVector) bits |= STORE_OFFSET_WITH_TERMVECTOR;
      if (fi.omitNorms) bits |= OMIT_NORMS;
      if (fi.storePayloads) bits |= STORE_PAYLOADS;
      output.writeString(fi.name);
      output.writeByte(bits);

很容易，其实就是通过一个字节来识别各个属性(比如是否索引、是否存储词条向量、是否忽略norms、是否存储Payload信息)，最后将FieldInfo的name和这些经过处理的属性写入到输出流中。

详细解析层次1.2.4.1.x

— — — —1.2.4.1.1、根据传进来的name，使用RAMDirectory的createOutput()方法创建一个输出流

即：IndexOutput output = d.createOutput(name);

这时，可以看到，接下来的一些操作都转入到RAMDirectory中去了。也就是根据一个name参数，我们看一下name在进入到RAMDirectory中的过程中发生了怎样的变化。

RAMDirectory的createOutput()方法定义如下所示：

public IndexOutput createOutput(String name) {
    ensureOpen();
    RAMFile file = new RAMFile(this);
    synchronized (this) {
      RAMFile existing = (RAMFile)fileMap.get(name);
      if (existing!=null) {
        sizeInBytes -= existing.sizeInBytes;
        existing.directory = null;
      }
      fileMap.put(name, file);
    }
    return new RAMOutputStream(file);
}

RAMDirectory拥有一个成员变量fileMap(它是一个HashMap)，在调用ensureOpen()方法时，通过fileMap来判断，如果fileMap=null，则抛出异常，终止程序流程。所以当fileMap!=null的时候说明RAMDirectory是处于打开状态的。

根据已经处于打开状态的this(即RAMDirectory)，以其作为参数构造一个 RAMFile file。接着，以当前打开的RAMDirectory作为同步信号量，从fileMap中获取名称为name的一个RAMFile的实例existing，如果内存中不存在名称为name的RAMFile，则要把该RAMFile添加到fileMap中，即fileMap.put(name, file);，同时直接返回以当前RAMDirectory作为参数构造的RAMFile file；如果RAMFile existing存在，则以该name构造的RAMField作为参数，添加到RAMOutputStream内存输出流中。

这里RAMDirectory的sizeInBytes成员指定了当前RAMDirectory中操作的字节数，而对于每个RAMFile都与一个RAMDirectory相关，所以，当从内存中把一个RAMFile放到内存输出流RAMOutputStream中时，当前内存中的字节数便减少了该RAMFile所具有的字节数，即 sizeInBytes -= existing.sizeInBytes;。同时，这个RAMFile进入到了RAMOutputStream中，即表示该文件已经不与RAMDirectory相关了，所以上面有existing.directory = null;。

详细解析层次1.2.4.2.x

— — — —1.2.4.2.1、关于FieldInfos的size()方法

即： output.writeVInt(size());

先看到了size()方法，他是在FieldInfos类的一个方法，定义如下所示：

public int size() {
return byNumber.size();
}

byNumber是FieldInfos类的一个成员，它是一个ArrayList列表，该列表中添加的是一个个的FieldInfo对象，而每个FieldInfo都有一个编号number，可以根据指定的编号从byNumber列表中将FieldInfo取出来，例如，如果指定位置filedNumber，取出该位置上的FieldInfo对象可以使用byNumber.get(filedNumber)。

可见，size()方法返回的是FieldInfos中含有FieldInfo对象的个数。

— — — —1.2.4.2.2、关于IndexOutput的writeVInt()方法

方法定义如下所示：

public void writeVInt(int i) throws IOException {
    while ((i & ~0x7F) != 0) {
      writeByte((byte)((i & 0x7f) | 0x80));
      i >>>= 7;
    }
    writeByte((byte)i);
}

该方法中，0x7F的值为127，按位取反后~0x7F的值为-128。

当0<=i<=127时，(i & ~0x7F) =0；当128<=i<=255时，(i & ~0x7F) =128；当256<=i<=511时，(i & ~0x7F) =256；以此类推。

也就是说，判断的条件不等于0成立，i的取值范围是i>=128。

调用了RAMOutputStream类的writeBytes()方法，写入的字节值为(i & 0x7f) | 0x80，因为，(i & ~0x7F)的值为0，128，256，384，512……，再与0x80(即128)做按位或运算，也就是当(i & 0x7f) 的值为0时，写入的字节是128，从而(i & 0x7f) | 0x80为写入的值：128，256，384，512……，没有0了。

详细解析层次1.2.4.2.2.x

— — — — —1.2.4.2.2.1、关于RAMOutputStream类的writeBytes()方法

output是RAMOutputStream类的一个实例，所以在操作字节时使用了RAMOutputStream类的writeBytes()方法，定义如下：

public void writeByte(byte b) throws IOException {
    if (bufferPosition == bufferLength) {
      currentBufferIndex++;
      switchCurrentBuffer();
    }
    currentBuffer[bufferPosition++] = b;
}

如果bufferPosition == bufferLength，即缓冲区满，需要动态增加当前buffer的容量，索引位置增加1，当前索引位置在buffer的最后位置，然后调用switchCurrentBuffer()方法，实现字节缓冲区的动态分配：

private final void switchCurrentBuffer() throws IOException {
    if (currentBufferIndex == file.buffers.size()) {
      currentBuffer = file.addBuffer(BUFFER_SIZE);
    } else {
      currentBuffer = (byte[]) file.buffers.get(currentBufferIndex);
    }
    bufferPosition = 0;
    bufferStart = BUFFER_SIZE * currentBufferIndex;
    bufferLength = currentBuffer.length;
}

否则。缓冲区未满，可以写入。

把将待写入的经过处理的FieldInfo的个数数值，以128个字节为最小单位，写入到当前的缓冲区中，不够128个字节，填充到128个字节，不能截断数据。

分享到：

Lucene-2.2.0 源代码阅读学习(25) | Lucene-2.2.0 源代码阅读学习(23)

2009-02-06 14:06
浏览 1008
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论