`
liuxinglanyue
  • 浏览: 561565 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

Lucene 3.0.2索引文件官方文档(二)

阅读更多

Deletable File

A writer dynamically computes the files that are deletable, instead, so no file is written.

Compound Files

Starting with Lucene 1.4 the compound file format became default. This is simply a container for all files described in the next section (except for the .del file).

Compound (.cfs) --> FileCount, <DataOffset, FileName> FileCount , FileData FileCount

FileCount --> VInt

DataOffset --> Long

FileName --> String

FileData --> raw file data

The raw file data is the data from the individual files named above.

Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx.

Per-Segment Files

The remaining files are all per-segment, and are thus defined by suffix.

Fields


Field Info

Field names are stored in the field info file, with suffix .fnm.

FieldInfos (.fnm) --> FNMVersion,FieldsCount, <FieldName, FieldBits> FieldsCount

FNMVersion, FieldsCount --> VInt

FieldName --> String

FieldBits --> Byte

 

  • The low-order bit is one for indexed fields, and zero for non-indexed fields.
  • The second lowest-order bit is one for fields that have term vectors stored, and zero for fields without term vectors.
  • If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.
  • If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.
  • If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.
  • If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.

 

FNMVersion (added in 2.9) is always -2.

Fields are numbered by their order in this file. Thus field zero is the first field in the file, field one the next, and so on. Note that, like document numbers, field numbers are segment relative.


Stored Fields

Stored fields are represented by two files:

  1. The field index, or .fdx file.

    This contains, for each document, a pointer to its field data, as follows:

    FieldIndex (.fdx) --> <FieldValuesPosition> SegSize

    FieldValuesPosition --> Uint64

    This is used to find the location within the field data file of the fields of a particular document. Because it contains fixed-length data, this file may be easily randomly accessed. The position of document n 's field data is the Uint64 at n*8 in this file.

  2.  The field data, or .fdt file.

    This contains the stored fields of each document, as follows:

    FieldData (.fdt) --> <DocFieldData> SegSize

    DocFieldData --> FieldCount, <FieldNum, Bits, Value> FieldCount

    FieldCount --> VInt

    FieldNum --> VInt

    Bits --> Byte

     

    • low order bit is one for tokenized fields
    • second bit is one for fields containing binary data
    • third bit is one for fields with compression option enabled (if compression is enabled, the algorithm used is ZLIB), only available for indexes until Lucene version 2.9.x

     

    Value --> String | BinaryValue (depending on Bits)

    BinaryValue --> ValueSize, <Byte>^ValueSize

    ValueSize --> VInt

Term Dictionary

The term dictionary is represented as two files:

  1. The term infos, or tis file.

    TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos

    TIVersion --> UInt32

    TermCount --> UInt64

    IndexInterval --> UInt32

    SkipInterval --> UInt32

    MaxSkipLevels --> UInt32

    TermInfos --> <TermInfo> TermCount

    TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

    Term --> <PrefixLength, Suffix, FieldNum>

    Suffix --> String

    PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta 
    --> VInt

    This file is sorted by Term. Terms are ordered first lexicographically (by UTF16 character code) by the term's field name, and within that lexicographically (by UTF16 character code) by the term's text.

    TIVersion names the version of the format of this file and is equal to TermInfosWriter.FORMAT_CURRENT.

    Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

    FieldNumber determines the term's field, whose name is stored in the .fdt file.

    DocFreq is the count of documents which contain the term.

    FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).

    ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields with omitTf true, this will be 0 since prox information is not stored.

    SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data. SkipDelta is only stored if DocFreq is not smaller than SkipInterval.

  2.  The term info index, or .tii file.

    This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.

    The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.

    TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices

    TIVersion --> UInt32

    IndexTermCount --> UInt64

    IndexInterval --> UInt32

    SkipInterval --> UInt32

    TermIndices --> <TermInfo, IndexDelta> IndexTermCount

    IndexDelta --> VLong

    IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.

    SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases.

    MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. See format of .frq file for more information about skip levels.

Frequencies

The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document (if omitTf is false).

FreqFile (.frq) --> <TermFreqs, SkipData> TermCount

TermFreqs --> <TermFreq> DocFreq

TermFreq --> DocDelta[, Freq?]

SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel> <SkipDatum>

SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))

SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?

DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt

SkipChildLevelPointer --> VLong

TermFreqs are ordered by term (the term is implicit, from the .tis file).

TermFreq entries are ordered by increasing document number.

DocDelta: if omitTf is false, this determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt. If omitTf is true, DocDelta contains the gap (not multiplied by 2) between document numbers and no frequency information is stored.

For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven, with omitTf false, would be the following sequence of VInts:

15, 8, 3

If omitTf were true it would be this sequence of VInts instead:

7,4

DocSkip records the document number before every SkipInterval th document in TermFreqs. If payloads are disabled for the term's field, then DocSkip represents the difference from the previous value in the sequence. If payloads are enabled for the term's field, then DocSkip/2 represents the difference from the previous value in the sequence. If payloads are enabled and DocSkip is odd, then PayloadLength is stored indicating the length of the last payload before the SkipIntervalth document in TermPositions. FreqSkip and ProxSkip record the position of every SkipInterval th entry in FreqFile and ProxFile, respectively. File positions are relative to the start of TermFreqs and Positions, to the previous SkipDatum in the sequence.

For example, if DocFreq=35 and SkipInterval=16, then there are two SkipData entries, containing the 15 th and 31 st document numbers in TermFreqs. The first FreqSkip names the number of bytes after the beginning of TermFreqs that the 16 th SkipDatum starts, and the second the number of bytes after that that the 32 nd starts. The first ProxSkip names the number of bytes after the beginning of Positions that the 16 th SkipDatum starts, and the second the number of bytes after that that the 32 nd starts.

Each term can have multiple skip levels. The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))). The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip level is Level=0. 
Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries, containing the 3rd, 7th, 11th, 15th, 19th, 23rd, 27th, and 31st document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the 15th and 31st document numbers in TermFreqs. 
The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer to entry 31 on level 0.

Positions

The .prx file contains the lists of positions that each term occurs at within documents. Note that fields with omitTf true do not store anything into this file, and if all fields in the index have omitTf true then the .prx file will not exist.

ProxFile (.prx) --> <TermPositions> TermCount

TermPositions --> <Positions> DocFreq

Positions --> <PositionDelta,Payload?> Freq

Payload --> <PayloadLength?,PayloadData>

PositionDelta --> VInt

PayloadLength --> VInt

PayloadData --> bytePayloadLength

TermPositions are ordered by term (the term is implicit, from the .tis file).

Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

PositionDelta is, if payloads are disabled for the term's field, the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). If payloads are enabled for the term's field, then PositionDelta/2 is the difference between the current and the previous position. If payloads are enabled and PositionDelta is odd, then PayloadLength is stored, indicating the length of the payload at the current term position.

For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of VInts (payloads disabled):

4, 5, 4

PayloadData is metadata associated with the current term position. If PayloadLength is stored at the current position, then it indicates the length of this Payload. If PayloadLength is not stored, then this Payload has the same length as the Payload at the previous position.

Normalization Factors

There's a single .nrm file containing all norms:

AllNorms (.nrm) --> NormsHeader,<Norms> NumFieldsWithNorms

Norms --> <Byte> SegSize

NormsHeader --> 'N','R','M',Version

Version --> Byte

NormsHeader has 4 bytes, last of which is the format version for this file, currently -1.

Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.

These are converted to an IEEE single float value as follows:

  1. If the byte is zero, use a zero float.

  2. Otherwise, set the sign bit of the float to zero;

  3. add 48 to the exponent and use this as the float's exponent;

  4. map the mantissa to the high-order 3 bits of the float's mantissa; and

  5. set the low-order 21 bits of the float's mantissa to zero.

A separate norm file is created when the norm values of an existing segment are modified. When field N is modified, a separate norm file .sN is created, to maintain the norm values for that field.

Separate norm files are created (when adequate) for both compound and non compound segments.

Term Vectors

Term Vector support is an optional on a field by field basis. It consists of 3 files.

  1. The Document Index or .tvx file.

    For each document, this stores the offset into the document data (.tvd) and field data (.tvf) files.

    DocumentIndex (.tvx) --> TVXVersion<DocumentPosition,FieldPosition> NumDocs

    TVXVersion --> Int (TermVectorsReader.CURRENT)

    DocumentPosition --> UInt64 (offset in the .tvd file)

    FieldPosition --> UInt64 (offset in the .tvf file)

  2. The Document or .tvd file.

    This contains, for each document, the number of fields, a list of the fields with term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file.

    Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions> NumDocs

    TVDVersion --> Int (TermVectorsReader.FORMAT_CURRENT)

    NumFields --> VInt

    FieldNums --> <FieldNumDelta> NumFields

    FieldNumDelta --> VInt

    FieldPositions --> <FieldPositionDelta> NumFields-1

    FieldPositionDelta --> VLong

    The .tvd file is used to map out the fields that have term vectors stored and where the field information is in the .tvf file.

  3. The Field or .tvf file.

    This file contains, for each field that has a term vector stored, a list of the terms, their frequencies and, optionally, position and offest information.

    Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs> NumFields

    TVFVersion --> Int (TermVectorsReader.FORMAT_CURRENT)

    NumTerms --> VInt

    Position/Offset --> Byte

    TermFreqs --> <TermText, TermFreq, Positions?, Offsets?> NumTerms

    TermText --> <PrefixLength, Suffix>

    PrefixLength --> VInt

    Suffix --> String

    TermFreq --> VInt

    Positions --> <VInt>TermFreq

    Offsets --> <VInt, VInt>TermFreq


    Notes:

    • Position/Offset byte stores whether this term vector has position or offset information stored.
    • Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
    • Positions are stored as delta encoded VInts. This means we only store the difference of the current position from the last position
    • Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.

Deleted Documents

The .del file is optional, and only exists when a segment contains deletions.

Although per-segment, this file is maintained exterior to compound segment files.

Deletions (.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format)

Format,ByteSize,BitCount --> Uint32

Bits --> <Byte> ByteCount

DGaps --> <DGap,NonzeroByte> NonzeroBytesCount

DGap --> VInt

NonzeroByte --> Byte

Format is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded.

ByteCount indicates the number of bytes in Bits. It is typically (SegSize/8)+1.

BitCount indicates the number of bits that are currently set in Bits.

Bits contains one bit for each document indexed. When the bit corresponding to a document number is set, that document is marked as deleted. Bit ordering is from least to most significant. Thus, if Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as deleted.

DGaps represents sparse bit-vectors more efficiently than Bits. It is made of DGaps on indexes of nonzero bytes in Bits, and the nonzero bytes themselves. The number of nonzero bytes in Bits (NonzeroBytesCount) is not stored.

For example, if there are 8000 bits and only bits 10,12,32 are set, DGaps would be used:

(VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1

Limitations

When referring to term numbers, Lucene's current implementation uses a Java int to hold the term index, which means the maximum number of unique terms in any single index segment is ~2.1 billion times the term index interval (default 128) = ~274 billion. This is technically not a limitation of the index file format, just of Lucene's current implementation.

Similarly, Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet,VInt values which have no limit.

分享到:
评论

相关推荐

    lucene 3.0.2 core+src+javadoc

    Lucene-core-3.0.2.jar 文件包含了 Lucene 的核心组件,这些组件构成了 Lucene 搜索引擎的基础。其中包括: 1. 文档处理:Document 类用于封装待索引的信息,Field 类则定义了文档中的各个字段,如文本、日期或数值...

    lucene3.0.2jar包

    1. 文档索引:Lucene的核心功能之一是文档索引。它将非结构化的文本数据转化为可以高效搜索的结构化索引。在3.0.2版本中,索引过程支持多线程,提升了大规模数据的处理速度。 2. 分词器(Analyzer):Lucene提供了...

    lucene-core-3.0.2.jar,lucene-demos-3.0.2.jar

    这是Lucene的核心库,包含了所有用于创建、索引和搜索文档的基本组件。它提供了一个高效的倒排索引结构,使得文本搜索变得快速且高效。在3.0.2版本中,Lucene引入了诸多优化,比如更高效的内存管理、更快的搜索速度...

    Lucene-3.0.2 API 下载

    索引阶段,Lucene将文本数据转换为可搜索的倒排索引,这是一种存储结构,能够快速查找包含特定词项的文档。搜索阶段,用户输入的查询会被解析,然后在倒排索引中进行匹配,返回最相关的文档。 在Lucene 3.0.2 API中...

    lucene-3.0.2-dev-src

    在`lucene-3.0.2-dev`源码中,`IndexWriter`类负责创建和更新索引,通过`Term`和`Document`对象来表示关键词和文档。索引的构建过程包括分词(Tokenization)、词项分析(Tokenization)和文档编码(Document ...

    lucene-core-2.9.4,lucene-core-3.0.2,lucene-core-3.0.3,lucene-core-3.4.0

    - **文档处理增强**:引入了对PDF、HTML等更多文件格式的支持,使得Lucene可以处理更广泛的数据源。 - **多线程支持**:在3.0系列中,Lucene增强了对多线程环境的支持,允许并发索引和检索操作,提升性能。 3. **...

    lucene-3.02

    2. **索引文档**:通过Document对象,将要索引的数据(如网页内容)组织起来,添加到IndexWriter中,IndexWriter会自动调用Analyzer进行文本分析并创建索引。 3. **搜索操作**:使用IndexSearcher,配合QueryParser...

    Lucene检索数据库支持中文检索

    在本文档中,我们使用的Lucene版本为3.0.2。这个版本相比于早期版本,在性能和稳定性上有了显著提升,同时也增加了许多新特性,特别是对中文检索的支持更加完善。 ##### 3. 全文检索过程详解 全文检索可以分为两大...

    Lucene检索数据库支持中文检索.doc

    1. 使用`IndexWriter`来创建索引,指定索引文件的位置(INDEX_DIR)和用于分析文档的`Analyzer`。 2. 创建`Document`对象来表示要索引的文档内容。 3. 将不同类型的字段(Field)添加到`Document`中,如文件路径和...

    lucene jar大全 包涵多个版本的jar包2.0-4.1等

    2. **Lucene 3.0.2**: 在3.0系列中,Lucene引入了许多性能优化和稳定性改进。此版本引入了新的查询执行模型,提高了查询速度。此外,它还加强了对多线程环境的支持,增强了内存管理,使得在处理大规模数据时更加高效...

    lucene的jar包

    2. **索引文档**:每个文档表示为一个 Document 对象,包含多个 Field,每个 Field 代表文档的一个属性。 3. **查询**:使用 QueryParser 创建查询对象,然后通过 IndexSearcher 进行搜索。 4. **获取结果**:...

    常用的开发手册,都是chm格式,帮助文档,很好用

    9. **lucene_3.6.1_API.CHM**:Apache Lucene是一个全文搜索引擎库,此文档详细介绍了Lucene 3.6.1版本的API,包括索引构建、查询解析、搜索等功能。 这些CHM文件提供了丰富的编程和Web开发知识,对于学习和提升...

    pdfbox所有jar包以及源码

    PDFBox是Apache软件基金会的一个开源项目,主要用于处理PDF(Portable Document Format)文档。这个压缩包包含了PDFBox的所有jar包以及源码,对于开发者来说,这是一个非常宝贵的资源,可以帮助理解和操作PDF文档,...

    lucence高亮显示

    在提供的文件`lucene-fast-vector-highlighter-3.0.2.jar`中,就包含了这个组件。 Fast Vector Highlighter的主要优点是速度和准确性,因为它不需要重新遍历整个文档来计算分数,而是直接基于预存的TermVector信息...

    solr相关jar包下载只多不少

    2. `poi-3.0.2-FINAL.jar`:Apache POI是一个用于处理Microsoft Office格式文档的库,例如Excel。在Solr中,可能用于导入或导出Excel数据到Solr索引。 3. `poi-scratchpad-3.0.1-FINAL.jar`:这是POI项目的扩展部分...

    Eclipse开发分布式商城系统+完整视频代码及文档

    │ 12.nginx的配置文件-通过端口号区分虚拟机.avi │ 13.通过域名配置虚拟机.avi │ 淘淘商城第二天笔记.docx │ ├─03.第三天 │ 01.课程回顾.avi │ 02.课程计划.avi │ 03.什么是反向代理.avi │ 04.nginx的...

Global site tag (gtag.js) - Google Analytics