【Lucene3.0 初窥】索引创建(4)：DocumentWriter 处理流程三

Heart.X.Raid

浏览: 905271 次
性别:
来自: 武汉

最近访客更多访客>>

rocleft

dy.f

leoeco2000

uule

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

信息检索与搜索引擎

数据结构 lucene 全文检索

上接《索引创建(3)：DocumentWriter 处理流程二》

1.3.3 第三车间—— TermsHashPerField & FreqProxTermsWriterPerField

TermsHashPerField和FreqProxTermsWriterPerField负责将token信息(字符串内容termTest，所在文档编号docID，所在文档中的位置position，所在文档中的词频frequence)添加到索引的Hash表结构(postingsHash)中。事实上，这些信息并不是直接存放在Hash表中的，而是存放在三个很重要的数据缓存池中(CharBlockPool、IntBlockPool、ByteBlockPool) 。而postingsHash中存放的只是数据在三个数据池中的地址偏移。

★ TermsHashPerField

这里首先简单介绍一下这三个数据池，想要深刻了解这三个数据池，可见《索引数据池及内存数据细节》。

1) CharBlockPool: 存储token的字面信息

2) ByteBlockPool: 存储token所在文档编号，位置和词频信息

3) IntBlockPool: 存储对应的ByteBlockPool中slice的位置

TermsHashPerField类的主要职责就是建立和维护一张postingsHash的倒排索引表。这张表是一张以token字符串作为关键字的Hash表。其结构如下：

final class TermsHashPerField extends InvertedDocConsumerPerField {

    private int postingsHashSize = 4;
    //倒排索引表的Hash数组结构，初始大小为4
    private RawPostingList[] postingsHash = new RawPostingList[postingsHashSize];
}

其中postingHash[]的每个元素都是RawPostingList类型的。该类在内存中表示一个由token唯一标示的posting list(倒排索引结构：token-> posting list)。

//该类在内存中表示一个由token唯一标示的posting list(倒排索引结构: token -> posting list)。
abstract class RawPostingList {
     int textStart;   //存储token信息在CharBlockPool中的初始位置 
     int intStart;     //存储token信息在IntBlockPool中的初始位置
     int byteStart;  //存储token信息在ByteBlockPool中的初始位置
}

实际上postingHash[]的每个元素的实际类型是PostingList。这个类型包含的数据域如下：

static final class PostingList extends RawPostingList {
    int docFreq;  //此词在当前文档中出现的次数
    int lastDocID;  // 上次处理完的包含此词的文档号。
    int lastDocCode;  // 文档号和词频按照或然跟随原则形成的编码
    int lastPosition;  // 上次处理完的此词的位置
}

TermsHashPerField.add()方法是创建待排索引的核心方法，我们分功能段来解析它的源代码：

// 将token加入到倒排索引的Hash链表结构中：token -> posting list
@Override
void add() throws IOException {....}

1、首先，计算token的hashcode值。很显然，下面的代码表明处理token中的字符采用的是UTF-16编码方式(Java对字符串都是Unicode编码的)。想要了解UTF-16编码方式，可参见《解析Unicode编码和Java char 》。另外，token的hashcode计算公式是code=(code*31)+ch，其中ch是token从末尾开始向前遍历的字符。

//得到当前token的内容
final char[] tokenText = termAtt.termBuffer();
//得到当前token的长度
final int tokenTextLen = termAtt.termLength();

int downto = tokenTextLen;
int code = 0;
//计算token的hashcode(其中每个字符按照的UTF-16编码方式进行编码)
while (downto > 0) {
    //从后往前取出token中的每一个字符
    char ch = tokenText[--downto];
    //判断ch是不是Unicode编码中的替代区域(surrogate area),这个区域不表示任何字符
    if (ch >= UnicodeUtil.UNI_SUR_LOW_START && ch <= UnicodeUtil.UNI_SUR_LOW_END) {
           if (0 == downto) {
               ch = tokenText[downto] = UnicodeUtil.UNI_REPLACEMENT_CHAR;
          } else {
              //取出连续两个字符进行hashcode计算(UTF-16编码方式)
               final char ch2 = tokenText[downto-1];
               if (ch2 >= UnicodeUtil.UNI_SUR_HIGH_START && ch2 <= UnicodeUtil.UNI_SUR_HIGH_END) {
                   code = ((code*31) + ch)*31+ch2;
                   downto--;
                   continue;
               } else {
                   ch = tokenText[downto] = UnicodeUtil.UNI_REPLACEMENT_CHAR;
               }            
          }
      } else if (ch >= UnicodeUtil.UNI_SUR_HIGH_START && (ch <= UnicodeUtil.UNI_SUR_HIGH_END ||
                                                          ch == 0xffff)) {
        ch = tokenText[downto] = UnicodeUtil.UNI_REPLACEMENT_CHAR;
      }
      //如果ch是正常字符，则开始计算token的hashcode
      code = (code*31) + ch;
}//end while

2、其次，根据token的hashcode定位到postingHash[]上的位置。我们通过code & postingHashMask来找到当前token在postingHash[]上的位置(其中postingsHashMask = postingsHashSize-1)。当产生Hash冲突的时候，其解决办法就是不断计算新的位置直到不产生冲突为止。

这里顺便提一下：code & postingHashMask 和code % postingHashMask是完全等价的，而且位运算效率更高，Lucene经常使用位运算。

//计算当前token将要插入在Hash表中的位置，code & postingsHashMask等价于code % postingHashMask
//postingsHashMask=postingsHashSize-1
int hashPos = code & postingsHashMask; 
//取出原hash表中hashPos位置上的数据
p = postingsHash[hashPos];
//如果原Hash表中的该位置上有数据并且两个token的字符串内容不等，则产生Hash冲突
if (p != null && !postingEquals(tokenText, tokenTextLen)) {
    // 冲突解决方法：不断计算新的hashcode，直到避开冲突位置
    final int inc = ((code>>8)+code)|1;
    do {
        code += inc;
        hashPos = code & postingsHashMask;
        p = postingsHash[hashPos];
     } while (p != null && !postingEquals(tokenText, tokenTextLen));
}

3、最后，将token的各种信息写入数据池，并在PostingList中存储数据池的地址偏移。整个过程有两种可能：

(1) 如果当前token在前面从未遇到过，也就是已经加入索引结构的所有词都没有这个词语。那么首先需要在postingHash中开辟一个新的PostingList准备存放当前token所对应信息(code line :18，33)。接着在CharBlackPool中创建一个新的区域来存储当前token的字符串信息(line: 27)，并且将地址偏移记录在新创建的PostingList.textStart中(line: 25)。然后就是在ByteBlockPool中开辟两个slice (line:57，58)。并且在IntBlockPool中开辟两个空间存储ByteBlockPool中的新开辟的slice地址偏移(line:57，60)。最后调用FreqProxTermsWriterPerField.newTerm() 将token的docID+freq和position信息存储进去(line: 67)。

(2) 如果当前token在前面已经遇到过了，此时就不需要在三大数据池中分配新的空间存放了。直接调用FreqProxTermsWriterPerField.addTerm()将信息存储进去(line:72)。

//说明当前token以前的文本中从未出现过
if (p == null) {
    final int textLen1 = 1+tokenTextLen;   
    //当CharBlockPool当前的buffer空间不足时，则重新分配一个新的buffer
    if (textLen1 + charPool.charUpto > DocumentsWriter.CHAR_BLOCK_SIZE) {
       if (textLen1 > DocumentsWriter.CHAR_BLOCK_SIZE) {
            if (docState.maxTermPrefix == null)
               docState.maxTermPrefix = new String(tokenText, 0, 30);
       consumer.skippingLongTerm();
       return;
       }  
       charPool.nextBuffer();
    }
      
    if (0 == perThread.freePostingsCount)
        perThread.morePostings();
    //从空闲的posting中分配新的posting
    p = perThread.freePostings[--perThread.freePostingsCount];
    assert p != null;

    //将当前token的内容写入CharBlockPool中,此时text和CharBlockPool中的当前buffer都指向同一块内存区域
    final char[] text = charPool.buffer;
    final int textUpto = charPool.charUpto;
    //PostingList类中的textStart保存的是当前token首字母在CharBlockPool中的位置
    p.textStart = textUpto + charPool.charOffset;
    charPool.charUpto += textLen1;
    System.arraycopy(tokenText, 0, text, textUpto, tokenTextLen);
    //每个词当中存放一个分隔符'#66535'(66535号字符,我们用“#66535”表示)
    text[textUpto+tokenTextLen] = 0xffff;
          
    assert postingsHash[hashPos] == null;
    //将postingHash[hashPos]指向刚开辟的空闲RawPostList链表
    postingsHash[hashPos] = p;
    //记录链表数量
    numPostings++;

    //如果postingsHash的加载因子达到了50%，则扩大2倍的postingsHash容量
    if (numPostings == postingsHashHalfSize)
        rehashPostings(2*postingsHashSize);

    //当IntBlockPool中buffer容量不足时，分配一个新buffer
    if (numPostingInt + intPool.intUpto > DocumentsWriter.INT_BLOCK_SIZE)
        intPool.nextBuffer();
    //当ByteBlockPool中buffer容量不足时，分配一个新buffer
    if (DocumentsWriter.BYTE_BLOCK_SIZE - bytePool.byteUpto < numPostingInt*ByteBlockPool.FIRST_LEVEL_SIZE)
        bytePool.nextBuffer();

    intUptos = intPool.buffer;
    intUptoStart = intPool.intUpto;
    //streamCount=2在ByteBlockPool对一个token同时开辟两个大小一样的slice，一个存放docID+frequence，另一个存放positive.
    //并且在IntBlockPool中也同时开辟两个空间，用于分别存放一个token对应在ByteBlockPool中两个slice的地址偏移
    intPool.intUpto += streamCount;
      
    //PostingList类中的intStart保存的是当前token在IntBlockPool中的两个存储空间的第一个地址
    p.intStart = intUptoStart + intPool.intOffset;
    //每一次记录一个token，都需要在ByteBlckPool中开辟两个块来记录: docID+freq(文档ID+词频) 和 prox(位置)
    for(int i=0;i<streamCount;i++) {
        final int upto = bytePool.newSlice(ByteBlockPool.FIRST_LEVEL_SIZE);
       //IntBlockPool用来存储ByteBlockPool每次开辟的块的初始位置
       intUptos[intUptoStart+i] = upto + bytePool.byteOffset;
    }
    //PostingList类中的byteStart用于存储当前token使用ByteBlckPool开辟的空间的初始位置
    //也就是刚开辟的两个块中第一个块的初始位置
    p.byteStart = intUptos[intUptoStart];
      
    //token原来没有出现过的时候，FreqProxTermsWriterPerField调用newTerm()记录docID,freq和position
    consumer.newTerm(p);

} else { //如果此Token之前曾经出现过，FreqProxTermsWriterPerField调用addTerm()记录docID,freq和position
      intUptos = intPool.buffers[p.intStart >> DocumentsWriter.INT_BLOCK_SHIFT];
      intUptoStart = p.intStart & DocumentsWriter.INT_BLOCK_MASK;
      consumer.addTerm(p);
}

我们在《全文检索基本理论》中讲到的倒排索引结构是 token -> posting list的形式，而文档结合的所有token组成了一个Dictionary。 TermsHashPerField类的作用就是建立这样一个倒排索引结构——postingHash。其中 postingHash[](哈希数组)是以token的字面值作为关键字的，相当于Dictionary。而 postingHash的每一个元素都指向了PostingList对象，这个对象就是用来存储指定token所对应的posting list信息(包括docID，freq，position)。实际上，真正的信息是存储在三大数据池中的，但 PostingList对象只存储三大数据池中的地址偏移。我们通过上面的代码可以发现： TermsHashPerField已经把token的字面值存储在CharBlockPool中了，并且在ByteBlockPool中分配好的存储空间，并将地址偏移记录到了IntBlockPool中了。接下来要做的就是把 token所对应的docID，freq，position的信息通过 FreqProxTermsWriterPerField 的方法写入ByteBlockPool。

★ FreqProxTermsWriterPerField

(1)当出现一个索引结构中没有的token时，我们需要调用newTerm()方法将token所对应的 docID，freq，position信息存储到ByteBlockPool中。

final void newTerm(RawPostingList p0) {
    assert docState.testPoint("FreqProxTermsWriterPerField.newTerm start");
    FreqProxTermsWriter.PostingList p = (FreqProxTermsWriter.PostingList) p0;
    //当一个新的term出现的时候，包含此Term的就只有本篇文档，记录其ID 
    p.lastDocID = docState.docID;
    if (omitTermFreqAndPositions) {
        p.lastDocCode = docState.docID;
    } else {
        //docCode是文档ID左移一位，为什么左移，这就需要Lucene的索引文件结构来解释了。
        p.lastDocCode = docState.docID << 1;
        p.docFreq = 1;
       //写入position信息到bytePool中，此时freq信息还不能写入，因为当前的文档还没有处理完，尚不知道此文档包含此token的总数。 
       writeProx(p, fieldState.position);
    }
}

(2) 当出现的token在索引结构中已经存在，我们需要调用addTerm()方法将token所对应的 docID，freq，position信息存储到 ByteBlockPool中。

final void addTerm(RawPostingList p0) {
    assert docState.testPoint("FreqProxTermsWriterPerField.addTerm start");
    //取出postingHash已经存在的PostingList
    FreqProxTermsWriter.PostingList p = (FreqProxTermsWriter.PostingList) p0;

    assert omitTermFreqAndPositions || p.docFreq > 0;

    if (omitTermFreqAndPositions) {
      //p.lastDocID记录了上一次token写入索引结构的docID
      //docState.docID记录的是token将要写入索引结构的当前docID
      if (docState.docID != p.lastDocID) {
        assert docState.docID > p.lastDocID;
        termsHashPerField.writeVInt(0, p.lastDocCode);
        p.lastDocCode = docState.docID - p.lastDocID;
        p.lastDocID = docState.docID;
      }
    } else {
      if (docState.docID != p.lastDocID) {
        assert docState.docID > p.lastDocID;
        //如果当前的docID与上一次相同的token写入的docID不同
        //则表明上一篇文本中该token已经处理完毕，则将freq信息ByteBlockPool中
        if (1 == p.docFreq)
          termsHashPerField.writeVInt(0, p.lastDocCode|1);
        else {
          termsHashPerField.writeVInt(0, p.lastDocCode);
          termsHashPerField.writeVInt(0, p.docFreq);
        }
        p.docFreq = 1;//对于新的文档，freq还是为1. 
        p.lastDocCode = (docState.docID - p.lastDocID) << 1;//文档号存储差值 
        p.lastDocID = docState.docID;
        writeProx(p, fieldState.position);
      } else {
    	//当文档ID不变的时候，说明此文档中这个词又出现了一次，从而freq加1，写入再次出现的位置信息，用差值存储。这里不写入freq，因为该文档还没有结束
        p.docFreq++;
        //将position信息写入ByteBlockPool中
        writeProx(p, fieldState.position-p.lastPosition);
      }
    }
}

上面newTerm()和addTerm()方法都需要调用writeProx()方法将position信息写入ByteBlockPool中

//将position信息写入ByteBlockPool中
final void writeProx(FreqProxTermsWriter.PostingList p, int proxCode) {
    final Payload payload;
    //payloadAttribute是token的元数据信息
    if (payloadAttribute == null) {
      payload = null;
    } else {
      payload = payloadAttribute.getPayload();
    }
    
    //将position信息写入ByteBlockPool中
    //其中writeVInt()第一个参数1表示将position写入开辟在ByteBlockPool中两个slice的第2个中
    //第一个slice存放docID+freq，第二个slice存放position
   if (payload != null && payload.length > 0) {
      termsHashPerField.writeVInt(1, (proxCode<<1)|1);
      termsHashPerField.writeVInt(1, payload.length);
      termsHashPerField.writeBytes(1, payload.data, payload.offset, payload.length);
      hasPayloads = true;      
    } else
      termsHashPerField.writeVInt(1, proxCode<<1);
    p.lastPosition = fieldState.position;
}

实际上，写入ByteBlockPool中的数据到底是什么样子的呢？还有在CharBlockPool中是如何存储token的字面值的呢？IntBlockPool又是怎么样记录ByteBlockPool中的地址偏移的呢？这些问题我们将在《索引数据池及其存储细节》详细阐明.

★ 总结：

我们以前面所举的例子为例，将《索引数据池及内存数据细节》中的content field的第一个token="lucene"(docID=0，position=1)创建索引结构，其图示如下：

4
顶

0
踩

分享到：

【排序结构4】归并排序 | 【排序结构3】选择排序

2010-04-15 15:36
浏览 3662
评论(0)
分类:互联网
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论