论坛首页 Web前端技术论坛

Lucene4.3进阶开发之千象奔鸣(十二)

浏览 2516 次
精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者 正文
   发表时间:2014-01-18  
DocValues 对于一些存储的值,通常我们可以取得内容,通过docid,有时候为了实现更快的访问,我们可以将其存储的值通过DocValues来把它加载到主存中,存储的值用于在检索时搜索结果汇总,而docvalues的值则对于一些评分因素是非常有用的。

DocValues的文件格式.dv.cfs和.dv.cfe。
信息存在一个复合文件中

<segment>_<fieldNumber>.dat: 存储数据的值
<segment>_<fieldNumber>.idx: 为dat文件建立的索引


DocValues有许多种不同的编码方式,从文件名的角度来说,所有的类型存储他们的值在一个复合格式的dat文件里,在dereferenced/sorted类型的情况下,.dat实际上只包含一个唯一的值,还有一个额外的idx索引文件来存储引用这些值的指针

Fixed(固定长度)
Var  (变长)

DocValues存储的格式如下:
VAR_INTS .dat --> Header, PackedType, MinValue, DefaultValue, PackedStream
FIXED_INTS_8 .dat --> Header, ValueSize, Bytemaxdoc
FIXED_INTS_16 .dat --> Header, ValueSize, Shortmaxdoc
FIXED_INTS_32 .dat --> Header, ValueSize, Int32maxdoc
FIXED_INTS_64 .dat --> Header, ValueSize, Int64maxdoc
FLOAT_32 .dat --> Header, ValueSize, Float32maxdoc
FLOAT_64 .dat --> Header, ValueSize, Float64maxdoc
BYTES_FIXED_STRAIGHT .dat --> Header, ValueSize, (Byte * ValueSize)maxdoc
BYTES_VAR_STRAIGHT .idx --> Header, TotalBytes, Addresses
BYTES_VAR_STRAIGHT .dat --> Header, (Byte * variable ValueSize)maxdoc
BYTES_FIXED_DEREF .idx --> Header, NumValues, Addresses
BYTES_FIXED_DEREF .dat --> Header, ValueSize, (Byte * ValueSize)NumValues
BYTES_VAR_DEREF .idx --> Header, TotalVarBytes, Addresses
BYTES_VAR_DEREF .dat --> Header, (LengthPrefix + Byte * variable ValueSize)NumValues
BYTES_FIXED_SORTED .idx --> Header, NumValues, Ordinals
BYTES_FIXED_SORTED .dat --> Header, ValueSize, (Byte * ValueSize)NumValues
BYTES_VAR_SORTED .idx --> Header, TotalVarBytes, Addresses, Ordinals
BYTES_VAR_SORTED .dat --> Header, (Byte * variable ValueSize)NumValues

数据类型:

Header --> CodecHeader
PackedType --> Byte
MaxAddress, MinValue, DefaultValue --> Int64
PackedStream, Addresses, Ordinals --> PackedInts
ValueSize, NumValues --> Int32
Float32 --> 32-bit float encoded with Float.floatToRawIntBits(float) then written as Int32
Float64 --> 64-bit float encoded with Double.doubleToRawLongBits(double) then written as Int64
TotalBytes --> VLong
TotalVarBytes --> Int64
LengthPrefix --> Length of the data value as VInt (maximum of 2 bytes)



(1)PackedType 是0的时候,代表着是被压缩过的,当写入的流是一个64位的int时,这个值为1
(2)地址存储指针的实际字节位置和docid有关,在VAR_STRAIGHT的情况下,每个值可以有不同的长度,所以确定长度来自docid+1。
(3)在索引的时候,term id默认是有序的,在FIXED_SORTED的情况下,地址被存储在dat文件里,可以通过Header+ValueSize+(ordinal*ValueSize)计算;在VAR_SORTED的情况下,可以间接的通过docid -> ordinal -> address计算,所以决定长度的是ord+1的地址。
(4)BYTES_VAR_STRAIGHT BYTES_VAR_STRAIGHT相比其他的直接格式,通过idx文件,来提升性能。


限制:
二进制的doc values的值被限制在MAX_BINARY_FIELD_LENGTH。



.del文件用来标记在索引中被删除的记录,它仅仅出现在当一个段出现删除操作时,该文件才会生成,这个文件也维护外部的复合段文件

Deletions (.del) --> Format,Header,ByteCount,BitCount, Bits | DGaps (depending on Format)

Format,ByteSize,BitCount --> Uint32
Bits --> <Byte> ByteCount
DGaps --> <DGap,NonOnesByte> NonzeroBytesCount
DGap --> VInt
NonOnesByte --> Byte
Header --> CodecHeader
Format is 1: indicates cleared DGaps.

ByteCount indicates the number of bytes in Bits. It is typically (SegSize/8)+1.

BitCount indicates the number of bits that are currently set in Bits.

Bits contains one bit for each document indexed. When the bit corresponding to a document number is cleared, that document is marked as deleted. Bit ordering is from least to most significant. Thus, if Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as alive (not deleted).

DGaps represents sparse bit-vectors more efficiently than Bits. It is made of DGaps on indexes of nonOnes bytes in Bits, and the nonOnes bytes themselves. The number of nonOnes bytes in Bits (NonOnesBytesCount) is not stored.

For example, if there are 8000 bits and only bits 10,12,32 are cleared, DGaps would be used:

(VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1






论坛首页 Web前端技术版

跳转论坛:
Global site tag (gtag.js) - Google Analytics