- 浏览: 562476 次
- 性别:
- 来自: 杭州
文章分类
- 全部博客 (478)
- lucene (45)
- oracle (19)
- nutch (2)
- blog (2)
- 垂直搜索 (19)
- java综合 (89)
- spring (15)
- Hibernate (9)
- Struts (9)
- Hadoop (16)
- Mysql (12)
- nosql (10)
- Linux (3)
- MyEclipse (4)
- Ant (1)
- 设计模式 (19)
- JBPM (1)
- JSP (1)
- HtmlParser (5)
- SVN (2)
- 插件 (2)
- 收藏 (7)
- Others (1)
- Heritrix (18)
- Solr (4)
- 主题爬虫 (31)
- 内存数据库 (24)
- 分布式与海量数据 (32)
- httpclient (14)
- Tomcat (1)
- 面试宝典 (6)
- Python (14)
- 数据挖掘 (1)
- 算法 (6)
- 其他 (4)
- JVM (12)
- Redis (18)
最新评论
-
hanjiyun:
本人水平还有待提高,进步空间很大,看这些文章给我有很大的指导作 ...
JVM的内存管理 Ⅲ -
liuxinglanyue:
四年后的自己:这种方法 不靠谱。 使用javaagent的方式 ...
计算Java对象占用内存空间的大小(对于32位虚拟机而言) -
jaysoncn:
附件在哪里啊test.NoCertificationHttps ...
使用HttpClient过程中常见的一些问题 -
231fuchenxi:
你好,有redis,memlink,mysql的测试代码吗?可 ...
MemLink 性能测试 -
guyue1015:
[color=orange][/color][size=lar ...
JAVA同步机制
A writer dynamically computes the files that are deletable, instead, so no file is written. Starting with Lucene 1.4 the compound file format became default. This is simply a container for all files described in the next section (except for the .del file). Compound (.cfs) --> FileCount, <DataOffset, FileName> FileCount , FileData FileCount FileCount --> VInt DataOffset --> Long FileName --> String FileData --> raw file data The raw file data is the data from the individual files named above. Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx. The remaining files are all per-segment, and are thus defined by suffix. Field names are stored in the field info file, with suffix .fnm. FieldInfos (.fnm) --> FNMVersion,FieldsCount, <FieldName, FieldBits> FieldsCount FNMVersion, FieldsCount --> VInt FieldName --> String FieldBits --> Byte FNMVersion (added in 2.9) is always -2. Fields are numbered by their order in this file. Thus field zero is the first field in the file, field one the next, and so on. Note that, like document numbers, field numbers are segment relative. Stored fields are represented by two files: The field index, or .fdx file. This contains, for each document, a pointer to its field data, as follows: FieldIndex (.fdx) --> <FieldValuesPosition> SegSize FieldValuesPosition --> Uint64 This is used to find the location within the field data file of the fields of a particular document. Because it contains fixed-length data, this file may be easily randomly accessed. The position of document n 's field data is the Uint64 at n*8 in this file. This contains the stored fields of each document, as follows: FieldData (.fdt) --> <DocFieldData> SegSize DocFieldData --> FieldCount, <FieldNum, Bits, Value> FieldCount FieldCount --> VInt FieldNum --> VInt Bits --> Byte Value --> String | BinaryValue (depending on Bits) BinaryValue --> ValueSize, <Byte>^ValueSize ValueSize --> VInt The term dictionary is represented as two files: The term infos, or tis file. TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos TIVersion --> UInt32 TermCount --> UInt64 IndexInterval --> UInt32 SkipInterval --> UInt32 MaxSkipLevels --> UInt32 TermInfos --> <TermInfo> TermCount TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> Term --> <PrefixLength, Suffix, FieldNum> Suffix --> String PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta This file is sorted by Term. Terms are ordered first lexicographically (by UTF16 character code) by the term's field name, and within that lexicographically (by UTF16 character code) by the term's text. TIVersion names the version of the format of this file and is equal to TermInfosWriter.FORMAT_CURRENT. Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y". FieldNumber determines the term's field, whose name is stored in the .fdt file. DocFreq is the count of documents which contain the term. FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file). ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields with omitTf true, this will be 0 since prox information is not stored. SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data. SkipDelta is only stored if DocFreq is not smaller than SkipInterval. The term info index, or .tii file. This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file. The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta. TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices TIVersion --> UInt32 IndexTermCount --> UInt64 IndexInterval --> UInt32 SkipInterval --> UInt32 TermIndices --> <TermInfo, IndexDelta> IndexTermCount IndexDelta --> VLong IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry. SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases. MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. See format of .frq file for more information about skip levels. The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document (if omitTf is false). FreqFile (.frq) --> <TermFreqs, SkipData> TermCount TermFreqs --> <TermFreq> DocFreq TermFreq --> DocDelta[, Freq?] SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel> <SkipDatum> SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1)) SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer? DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt SkipChildLevelPointer --> VLong TermFreqs are ordered by term (the term is implicit, from the .tis file). TermFreq entries are ordered by increasing document number. DocDelta: if omitTf is false, this determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt. If omitTf is true, DocDelta contains the gap (not multiplied by 2) between document numbers and no frequency information is stored. For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven, with omitTf false, would be the following sequence of VInts: 15, 8, 3 If omitTf were true it would be this sequence of VInts instead: 7,4 DocSkip records the document number before every SkipInterval th document in TermFreqs. If payloads are disabled for the term's field, then DocSkip represents the difference from the previous value in the sequence. If payloads are enabled for the term's field, then DocSkip/2 represents the difference from the previous value in the sequence. If payloads are enabled and DocSkip is odd, then PayloadLength is stored indicating the length of the last payload before the SkipIntervalth document in TermPositions. FreqSkip and ProxSkip record the position of every SkipInterval th entry in FreqFile and ProxFile, respectively. File positions are relative to the start of TermFreqs and Positions, to the previous SkipDatum in the sequence. For example, if DocFreq=35 and SkipInterval=16, then there are two SkipData entries, containing the 15 th and 31 st document numbers in TermFreqs. The first FreqSkip names the number of bytes after the beginning of TermFreqs that the 16 th SkipDatum starts, and the second the number of bytes after that that the 32 nd starts. The first ProxSkip names the number of bytes after the beginning of Positions that the 16 th SkipDatum starts, and the second the number of bytes after that that the 32 nd starts. Each term can have multiple skip levels. The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))). The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip level is Level=0. The .prx file contains the lists of positions that each term occurs at within documents. Note that fields with omitTf true do not store anything into this file, and if all fields in the index have omitTf true then the .prx file will not exist. ProxFile (.prx) --> <TermPositions> TermCount TermPositions --> <Positions> DocFreq Positions --> <PositionDelta,Payload?> Freq Payload --> <PayloadLength?,PayloadData> PositionDelta --> VInt PayloadLength --> VInt PayloadData --> bytePayloadLength TermPositions are ordered by term (the term is implicit, from the .tis file). Positions entries are ordered by increasing document number (the document number is implicit from the .frq file). PositionDelta is, if payloads are disabled for the term's field, the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). If payloads are enabled for the term's field, then PositionDelta/2 is the difference between the current and the previous position. If payloads are enabled and PositionDelta is odd, then PayloadLength is stored, indicating the length of the payload at the current term position. For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of VInts (payloads disabled): 4, 5, 4 PayloadData is metadata associated with the current term position. If PayloadLength is stored at the current position, then it indicates the length of this Payload. If PayloadLength is not stored, then this Payload has the same length as the Payload at the previous position. There's a single .nrm file containing all norms: AllNorms (.nrm) --> NormsHeader,<Norms> NumFieldsWithNorms Norms --> <Byte> SegSize NormsHeader --> 'N','R','M',Version Version --> Byte NormsHeader has 4 bytes, last of which is the format version for this file, currently -1. Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent. These are converted to an IEEE single float value as follows: If the byte is zero, use a zero float. Otherwise, set the sign bit of the float to zero; add 48 to the exponent and use this as the float's exponent; map the mantissa to the high-order 3 bits of the float's mantissa; and set the low-order 21 bits of the float's mantissa to zero. A separate norm file is created when the norm values of an existing segment are modified. When field N is modified, a separate norm file .sN is created, to maintain the norm values for that field. Separate norm files are created (when adequate) for both compound and non compound segments. Term Vector support is an optional on a field by field basis. It consists of 3 files. The Document Index or .tvx file. For each document, this stores the offset into the document data (.tvd) and field data (.tvf) files. DocumentIndex (.tvx) --> TVXVersion<DocumentPosition,FieldPosition> NumDocs TVXVersion --> Int (TermVectorsReader.CURRENT) DocumentPosition --> UInt64 (offset in the .tvd file) FieldPosition --> UInt64 (offset in the .tvf file) The Document or .tvd file. This contains, for each document, the number of fields, a list of the fields with term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file. Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions> NumDocs TVDVersion --> Int (TermVectorsReader.FORMAT_CURRENT) NumFields --> VInt FieldNums --> <FieldNumDelta> NumFields FieldNumDelta --> VInt FieldPositions --> <FieldPositionDelta> NumFields-1 FieldPositionDelta --> VLong The .tvd file is used to map out the fields that have term vectors stored and where the field information is in the .tvf file. The Field or .tvf file. This file contains, for each field that has a term vector stored, a list of the terms, their frequencies and, optionally, position and offest information. Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs> NumFields TVFVersion --> Int (TermVectorsReader.FORMAT_CURRENT) NumTerms --> VInt Position/Offset --> Byte TermFreqs --> <TermText, TermFreq, Positions?, Offsets?> NumTerms TermText --> <PrefixLength, Suffix> PrefixLength --> VInt Suffix --> String TermFreq --> VInt Positions --> <VInt>TermFreq Offsets --> <VInt, VInt>TermFreq Notes: The .del file is optional, and only exists when a segment contains deletions. Although per-segment, this file is maintained exterior to compound segment files. Deletions (.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format) Format,ByteSize,BitCount --> Uint32 Bits --> <Byte> ByteCount DGaps --> <DGap,NonzeroByte> NonzeroBytesCount DGap --> VInt NonzeroByte --> Byte Format is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded. ByteCount indicates the number of bytes in Bits. It is typically (SegSize/8)+1. BitCount indicates the number of bits that are currently set in Bits. Bits contains one bit for each document indexed. When the bit corresponding to a document number is set, that document is marked as deleted. Bit ordering is from least to most significant. Thus, if Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as deleted. DGaps represents sparse bit-vectors more efficiently than Bits. It is made of DGaps on indexes of nonzero bytes in Bits, and the nonzero bytes themselves. The number of nonzero bytes in Bits (NonzeroBytesCount) is not stored. For example, if there are 8000 bits and only bits 10,12,32 are set, DGaps would be used: (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1 When referring to term numbers, Lucene's current implementation uses a Java int to hold the term index, which means the maximum number of unique terms in any single index segment is ~2.1 billion times the term index interval (default 128) = ~274 billion. This is technically not a limitation of the index file format, just of Lucene's current implementation. Similarly, Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet,VInt values which have no limit.Deletable File
Compound Files
Per-Segment Files
Fields
Field Info
Stored Fields
Term Dictionary
--> VIntFrequencies
Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries, containing the 3rd, 7th, 11th, 15th, 19th, 23rd, 27th, and 31st document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the 15th and 31st document numbers in TermFreqs.
The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer to entry 31 on level 0.Positions
Normalization Factors
Term Vectors
Deleted Documents
Limitations
发表评论
-
关于Lucene的讨论
2011-01-01 10:20 1060分类为[lucene]的文章 ... -
有关Lucene的问题(收藏)推荐
2010-12-30 21:02 1107有关Lucene的问题(1):为 ... -
Lucene 学习总结(收藏)推荐
2010-12-30 20:54 1555Lucene学习总结之一:全文检索的基本原理 ... -
基于Lucene的Compass 资源(收藏)
2010-12-29 18:29 11421.2、Compass相关网上资源 1、官方网站1: http ... -
Lucene 3.0.2索引文件官方文档(一)
2010-12-28 22:34 1457Apache Lucene - Index File ... -
Lucene 3.0 索引文件学习总结(收藏)
2010-12-28 22:28 936lucene学习1——词域信息 ... -
Lucene 字符编码问题
2010-12-27 20:29 992现在如果一个txt文件中包含了ANSI编码的文本文件和Uni ... -
Lucene 字符编码问题
2010-12-27 20:20 1029现在如果一个txt文件中包含了ANSI编码的文本文件和Unic ... -
Annotated Lucene(源码剖析中文版)
2010-12-25 22:52 1260Apache Lucene是一个高性能(high-pe ... -
Lucene 学习推荐博客
2010-12-25 22:42 1031深未来deepfuturelx http://deepfut ... -
Lucene3.0 初窥 总结(收藏)
2010-12-25 22:16 1807【Lucene3.0 初窥】全文检索的基本原理 ... -
转:基于lucene实现自己的推荐引擎
2010-12-17 17:05 1053采用基于数据挖掘的 ... -
加速 lucene 的搜索速度 ImproveSearchingSpeed(二)
2010-12-17 17:01 1031本文 为简单翻译,原文在:http://wiki.apac ... -
加速 lucene 索引建立速度 ImproveIndexingSpeed
2010-12-17 16:58 1070本文 只是简单的翻译,原文 在 http://wiki.a ... -
lucene 3.0 中的demo项目部署
2010-12-15 22:02 970转自:bjqincy 1 在myEclipise 建立 ... -
Lucene 3.0.2 源码 - final class Document
2010-12-14 22:33 888package org.apache.lucene.do ... -
Lucene 3.0.2 源码 - final class Field
2010-12-14 22:29 951package org.apache.lucene.do ... -
Lucene 3.0.2 源码 - abstract class AbstractField
2010-12-14 22:28 1039package org.apache.lucene.do ... -
Lucene 3.0.2 源码 - interface Fieldable
2010-12-14 22:28 1176package org.apache.lucene.do ... -
LinkedIn公司实现的实时搜索引擎Zoie
2010-12-14 21:02 872转自:forfuture1978 一 ...
相关推荐
Lucene-core-3.0.2.jar 文件包含了 Lucene 的核心组件,这些组件构成了 Lucene 搜索引擎的基础。其中包括: 1. 文档处理:Document 类用于封装待索引的信息,Field 类则定义了文档中的各个字段,如文本、日期或数值...
1. 文档索引:Lucene的核心功能之一是文档索引。它将非结构化的文本数据转化为可以高效搜索的结构化索引。在3.0.2版本中,索引过程支持多线程,提升了大规模数据的处理速度。 2. 分词器(Analyzer):Lucene提供了...
这是Lucene的核心库,包含了所有用于创建、索引和搜索文档的基本组件。它提供了一个高效的倒排索引结构,使得文本搜索变得快速且高效。在3.0.2版本中,Lucene引入了诸多优化,比如更高效的内存管理、更快的搜索速度...
索引阶段,Lucene将文本数据转换为可搜索的倒排索引,这是一种存储结构,能够快速查找包含特定词项的文档。搜索阶段,用户输入的查询会被解析,然后在倒排索引中进行匹配,返回最相关的文档。 在Lucene 3.0.2 API中...
在`lucene-3.0.2-dev`源码中,`IndexWriter`类负责创建和更新索引,通过`Term`和`Document`对象来表示关键词和文档。索引的构建过程包括分词(Tokenization)、词项分析(Tokenization)和文档编码(Document ...
- **文档处理增强**:引入了对PDF、HTML等更多文件格式的支持,使得Lucene可以处理更广泛的数据源。 - **多线程支持**:在3.0系列中,Lucene增强了对多线程环境的支持,允许并发索引和检索操作,提升性能。 3. **...
2. **索引文档**:通过Document对象,将要索引的数据(如网页内容)组织起来,添加到IndexWriter中,IndexWriter会自动调用Analyzer进行文本分析并创建索引。 3. **搜索操作**:使用IndexSearcher,配合QueryParser...
在本文档中,我们使用的Lucene版本为3.0.2。这个版本相比于早期版本,在性能和稳定性上有了显著提升,同时也增加了许多新特性,特别是对中文检索的支持更加完善。 ##### 3. 全文检索过程详解 全文检索可以分为两大...
1. 使用`IndexWriter`来创建索引,指定索引文件的位置(INDEX_DIR)和用于分析文档的`Analyzer`。 2. 创建`Document`对象来表示要索引的文档内容。 3. 将不同类型的字段(Field)添加到`Document`中,如文件路径和...
2. **Lucene 3.0.2**: 在3.0系列中,Lucene引入了许多性能优化和稳定性改进。此版本引入了新的查询执行模型,提高了查询速度。此外,它还加强了对多线程环境的支持,增强了内存管理,使得在处理大规模数据时更加高效...
2. **索引文档**:每个文档表示为一个 Document 对象,包含多个 Field,每个 Field 代表文档的一个属性。 3. **查询**:使用 QueryParser 创建查询对象,然后通过 IndexSearcher 进行搜索。 4. **获取结果**:...
9. **lucene_3.6.1_API.CHM**:Apache Lucene是一个全文搜索引擎库,此文档详细介绍了Lucene 3.6.1版本的API,包括索引构建、查询解析、搜索等功能。 这些CHM文件提供了丰富的编程和Web开发知识,对于学习和提升...
PDFBox是Apache软件基金会的一个开源项目,主要用于处理PDF(Portable Document Format)文档。这个压缩包包含了PDFBox的所有jar包以及源码,对于开发者来说,这是一个非常宝贵的资源,可以帮助理解和操作PDF文档,...
在提供的文件`lucene-fast-vector-highlighter-3.0.2.jar`中,就包含了这个组件。 Fast Vector Highlighter的主要优点是速度和准确性,因为它不需要重新遍历整个文档来计算分数,而是直接基于预存的TermVector信息...
2. `poi-3.0.2-FINAL.jar`:Apache POI是一个用于处理Microsoft Office格式文档的库,例如Excel。在Solr中,可能用于导入或导出Excel数据到Solr索引。 3. `poi-scratchpad-3.0.1-FINAL.jar`:这是POI项目的扩展部分...
│ 12.nginx的配置文件-通过端口号区分虚拟机.avi │ 13.通过域名配置虚拟机.avi │ 淘淘商城第二天笔记.docx │ ├─03.第三天 │ 01.课程回顾.avi │ 02.课程计划.avi │ 03.什么是反向代理.avi │ 04.nginx的...