- 浏览: 160184 次
- 性别:
- 来自: 北京
最新评论
-
w156445045:
我最近想写这方面的论文,学位论文啊,哎。希望博主能给点思路,谢 ...
《Lucene+Nutch搜索引擎》看过以后。。。 -
inprice:
这也要研究啊!!!!! 失望ing.........
完成了对于heritrix的初步研究 -
dt_fb:
您好,我想问问您,你有跳过recrawl.sh这个脚本文件么? ...
Nutch开源搜索引擎的crawl日志分析及工作目录说明 -
lovepoem:
能增量吗?是不是还是把所有的url遍历出来。和以前的对比。算是 ...
Nutch开源搜索引擎增量索引recrawl的终极解决办法 -
itang:
见到牛人照片了, MS下巴动过刀(开玩笑)
搜索引擎名人堂之Jeff Dean
Index包分析
原创:windshow TjuAILab
Lucene索引中有几个最基础的概念,索引(index),文档(document),域(field),和项(或者译为语词term)
其中Index为Document的序列
Document为Field的序列
Field为Term的序列
Term就是一个子串.
存在于不同的Field中的同一个子串被认为是不同的Term.因此Term实际上是用一对子串表示的,第一个子串为Field的name,第二个为Field中的子串.既然Term这么重要,我们先来认识一下Term.
认识Term
最好的方法就是看其源码表示.
public final class Term implements Comparable, java.io.Serializable {
String field;
String text;
public Term(String fld, String txt) {this(fld, txt, true);}
public final String field() { return field; }
public final String text() { return text; }
//overwrite equals()
public final boolean equals(Object o) { }
//overwrite hashCode()
public final int hashCode() {return field.hashCode() + text.hashCode();
}
public int compareTo(Object other) {return compareTo((Term)other);}
public final int compareTo(Term other)
final void set(String fld, String txt) public final String toString() { return field + ":" + text; }
private void readObject(java.io.ObjectInputStream in)
}
从代码中我们可以大体看出Tern其实是一个二元组<FieldName,text>
倒排索引
为了使得基于项的搜索更有效率,索引中项是静态存储的。Lucene的索引属于索引方式中的倒排索引,因为对于一个项这种索引可以列出包含它的文档。这刚好是文档与项自然联系的倒置。
Field的类型
Lucene中,Field的文本可能以逐字的非倒排的方式存储在索引中。而倒排过的Field称为被索引过了。Field也可能同时被存储和被索引。Field的文本可能被分解许多Term而被索引,或者就被用作一个Term而被索引。大多数的Field是被分解过的,但是有些时候某些标识符域被当做一个Term索引是很有用的。
Index包中的每个类解析
CompoundFileReader
提供读取.cfs文件的方法.
CompoundFileWriter
用来构建.cfs文件,从Lucene1.4开始,会将下面要提到的各类文件,譬如.tii,.tis等合并成一个.cfs文件!
其结构如下
Compound (.cfs) --> FileCount, <DataOffset, FileName>FileCount, FileDataFileCount
FileCount --> VInt
DataOffset --> Long
FileName --> String
FileData --> raw file data
DocumentWriter
构建.frq,.prx,.f文件
1.FreqFile (.frq) --> <TermFreqs, SkipData>TermCount
TermFreqs --> <TermFreq>DocFreq
TermFreq --> DocDelta, Freq?
SkipData --> <SkipDatum>DocFreq/SkipInterval
SkipDatum --> DocSkip,FreqSkip,ProxSkip
DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> VInt
2.The .prx file contains the lists of positions that each term occurs at within documents.
ProxFile (.prx) --> <TermPositions>TermCount
TermPositions --> <Positions>DocFreq
Positions --> <PositionDelta>Freq
PositionDelta --> VInt
3.There's a norm file for each indexed field with a byte for each document. The .f[0-9]* file contains, for each document, a byte that encodes a value that is multiplied into the score for hits on that field:
Norms (.f[0-9]*) --> <Byte>SegSize
Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.
These are converted to an IEEE single float value as follows:
1. If the byte is zero, use a zero float.
2. Otherwise, set the sign bit of the float to zero;
3. add 48 to the exponent and use this as the float's exponent;
4. map the mantissa to the high-order 3 bits of the float's mantissa; and
5. set the low-order 21 bits of the float's mantissa to zero.
FieldInfo
里边有Field的部分信息,是一个四元组<name,isIndexed,num, storeTermVector>
FieldInfos
此类用来描述Document的fields是否被索引.每个Segment有一个单独的FieldInfo 文件.对于多线程,此类的对象为线程安全的.但是某一时刻,只允许一个线程添加document.别的reader和writer不允许进入.此类维护两个容器ArrayList和HashMap,这两个容器都不是synchronized,何言线程安全,不解??
观察write函数可知 .fnm文件的构成为
FieldInfos (.fnm) --> FieldsCount, <FieldName, FieldBits>FieldsCount
FieldsCount --> VInt
FieldName --> String
FieldBits --> Byte
FieldReader
用来读取.fdx文件和.fdt文件
FieldWriter
此类创建两个文件.fdx和.fdt文件
FieldIndex(.fdx)对于每一个Document,里面都含有一个指向Field的指针(其实是整数)
<FieldValuesPosition>SegSize
FieldValuesPosition --> Uint64
则第n个document的Field pointer为n*8
FieldData(.fdt)里面包含了每一个文档包含的存储的field信息.内容如下:
<DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, Bits, Value>FieldCount
FieldCount --> VInt
FieldNum --> VInt
Lucene <= 1.4:
Bits --> Byte
Value --> String
Only the low-order bit of Bits is used. It is one for tokenized fields, and zero for non-tokenized fields.
FilterIndexReader
扩展自IndexReader,提供了具体的方法.
IndexReader
为abstract class!用来读取建完索引的Directory,并可以返回各种信息,譬如Term,TermPosition等等.
IndexWriter
IndexWriter用来创建和维护索引。
IndexWriter构造函数中的第三个参数决定一个新的索引是否被创建,或者一个存在的索引是否开放给欲新加入的新的document
通过addDocument()0函数加入新的documents,当添加完document之后,调用close()函数
如果一个Index没有document需要加入并且需要优化查询性能。则在索引close()之前,调用optimize()函数进行优化。
Deleteable文件结构:
A file named "deletable" contains the names of files that are no longer used by the index, but which could not be deleted. This is only used on Win32, where a file may not be deleted while it is still open. On other platforms the file contains only null bytes.
Deletable --> DeletableCount, <DelableName>DeletableCount
DeletableCount --> UInt32
DeletableName --> String
MultipleTermPositions
专门用于search包中的PhrasePrefixQuery
MultiReader
扩展自IndexReader,用来读取多个索引!添加他们的内容
SegmentInfo
一些关于Segment的信息,是一个三元组<segmentname,docCount,dir>
SegmentInfos
扩展自Vector,就是一个向量组,其中任意成员为SegmentInfo!用来构建segments文件,每个Index有且只有一个这样的文件,此类提供了read和write的方法.
其内容如下:
Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>SegCount
Format, NameCounter, SegCount, SegSize --> UInt32
Version --> UInt64
SegName --> String
Format is -1 in Lucene 1.4.
Version counts how often the index has been changed by adding or deleting documents.
NameCounter is used to generate names for new segment files.
SegName is the name of the segment, and is used as the file name prefix for all of the files that compose the segment's index.
SegSize is the number of documents contained in the segment index.
SegmentMergeInfo
用来记录segment合并信息.
SegmentMergeQueue
扩展自PriorityQueue(按升序排列)
SegmentMerger
此类合并多个Segment为一个Segment,被IndexWriter.addIndexes()创建此类对象
如果compoundFile为True即可以合并了,创建.cfs文件,并且把其余的几乎所有文件全部合并到.cfs文件中!
SegmentReader
扩展自IndexReader,提供了很多读取Index的方法
SegmentTermDocs
扩展自TermDocs
SegmentTermEnum
扩展自TermEnum
SegmentTermPositions
扩展自TermPositions
SegmentTermVector
扩展自TermFreqVector
Term
Term是一个<fieldName,text>对.而Field由于分多种,但是至少都含有<fieldName,fieldValue>这样二者就可以建立关联了.Term是一个搜索单元.Term的text都是诸如dates,email address,urls等等.
TermDocs
TermDocs是一个Interface. TermDocs提供一个接口,用来列举<document,frequency>,以共Term使用
在<document,frequency>对中,document部分给每一个含有term的document命名.document根据其document number进行标引.frequency部分列举在每一个document中term的数量.<document,frequency>对根据document number排序.
TermEnum
此类为抽象类,用来enumerate term.Term enumerations 由Term.compareTo()进行排序此enumeration中的每一个term都要大于所有在此enumeration之前的term.
TermFreqVector
此Interface用来访问一个document的Field的Term Vector
TermInfo
此类主要用来存储Term信息.其可以说为一个五元组<Term,docFreq,freqPointer,proxPointer,skipOffset>
TermInfoReader
未细读,待读完SegmentTermEnum
TermInfoWriter
此类用来构建(.tis)和(.tii)文件.这些构成了term dictionary
1. The term infos, or tis file.
TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
TIVersion --> UInt32
TermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
TermInfos --> <TermInfo>TermCount
TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
Term --> <PrefixLength, Suffix, FieldNum>
Suffix --> String
PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt
This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text.
TIVersion names the version of the format of this file and is -2 in Lucene 1.4.
Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
FieldNumber determines the term's field, whose name is stored in the .fdt file.
DocFreq is the count of documents which contain the term.
FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).
ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file.
SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data.
2. The term info index, or .tii file.
This contains every IndexIntervalth entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.
The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.
TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices
TIVersion --> UInt32
IndexTermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
TermIndices --> <TermInfo, IndexDelta>IndexTermCount
IndexDelta --> VLong
IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.
TODO: document skipInterval information
其中IndexDelta是.tii文件,比之.tis文件多的东西.
TermPosition
此类扩展自TermDocs,是一个Interface,用来enumerate<document,frequency,<position>*>三元组,
以供term使用.在此三元组中document和frequency于TernDocs中的相同.postions部分列出了在一个document中,一个term每次出现的顺序位置此三元组为倒排文档的事件表表示.
TermPositionVector
扩展自TermFreqVector.比之TermFreqVector扩展了功能,可以提供term所在的位置
TermVectorReader
用来读取.tvd,.tvf.tvx三个文件.
TermVectorWriter
用于构建.tvd, .tvf,.tvx文件,这三个文件构成TermVector
1. The Document Index or .tvx file.
This contains, for each document, a pointer to the document data in the Document (.tvd) file.
DocumentIndex (.tvx) --> TVXVersion<DocumentPosition>NumDocs
TVXVersion --> Int
DocumentPosition --> UInt64
This is used to find the position of the Document in the .tvd file.
2. The Document or .tvd file.
This contains, for each document, the number of fields, a list of the fields with term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file.
Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,>NumDocs
TVDVersion --> Int
NumFields --> VInt
FieldNums --> <FieldNumDelta>NumFields
FieldNumDelta --> VInt
FieldPositions --> <FieldPosition>NumFields
FieldPosition --> VLong
The .tvd file is used to map out the fields that have term vectors stored and where the field information is in the .tvf file.
3. The Field or .tvf file.
This file contains, for each field that has a term vector stored, a list of the terms and their frequencies.
Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs>NumFields
TVFVersion --> Int
NumTerms --> VInt
NumDistinct --> VInt -- Future Use
TermFreqs --> <TermText, TermFreq>NumTerms
TermText --> <PrefixLength, Suffix>
PrefixLength --> VInt
Suffix --> String
TermFreq --> VInt
Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
好的,整个Index包所有类都讲解了,下边咱们开始来编码重新审视一下!
下边来编制一个程序来结束本章的讨论。
package org.apache.lucene.index;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.store.*;
import org.apache.lucene.document.*;
import org.apache.lucene.demo.*;
import org.apache.lucene.search.*;
import java.io.*;
/**在使用此程序时,会尽量用到Lucene Index中的每一个类,尽量将其展示个大家
*使用的Index包中类有
*DocumentWriter(提供给用用户使用的为IndexWriter)
*FieldInfo(和FieldInfos)
* SegmentDocs(扩展自TermDocs)
*SegmentReader(扩展自IndexReader,提供给用户使用的是IndexReader)
*SegmentMerger
*segmentTermEnum(扩展自TermEnum)
*segmentTermPositions(扩展自TermPositions)
*segmentTermVector(扩展自TermFreqVector)
*/
public class TestIndexPackage
{
//用于将Document加入索引
public static void indexDocument(String segment,String fileName) throws Exception
{
//第二个参数用来控制,如果获得不了目录是否创建
Directory directory = FSDirectory.getDirectory("testIndexPackage",false);
Analyzer analyzer = new SimpleAnalyzer();
//第三个参数为每一个Field最多拥有的Token个数
DocumentWriter writer = new DocumentWriter(directory,analyzer,Similarity.getDefault(),1000);
File file = new File(fileName);
//由于使用FileDocument将file包装成了Docuement,会在document中创建三个field(path,modified,contents)
Document doc = FileDocument.Document(file);
writer.addDocument(segment,doc);
directory.close();
}
//将多个segment进行合并
public static void merge(String segment1,String segment2,String segmentMerged)throws Exception
{
Directory directory = FSDirectory.getDirectory("testIndexPackage",false);
SegmentReader segmentReader1=new SegmentReader(new SegmentInfo(segment1,1,directory));
SegmentReader segmentReader2=new SegmentReader(new SegmentInfo(segment2,1,directory));
//第三个参数为是否创建.cfs文件
SegmentMerger segmentMerger =new SegmentMerger(directory,segmentMerged,false);
segmentMerger.add(segmentReader1);
segmentMerger.add(segmentReader2);
segmentMerger.merge();
segmentMerger.closeReaders();
directory.close();
}
//将segment即Index的子索引的所有内容展示给你看。
public static void printSegment(String segment) throws Exception
{
Directory directory =FSDirectory.getDirectory("testIndexPackage",false);
SegmentReader segmentReader = new SegmentReader(new SegmentInfo(segment,1,directory));
//display documents
for(int i=0;i<segmentReader.numDocs();i++)
System.out.println(segmentReader.document(i));
TermEnum termEnum = segmentReader.terms();//此处实际为SegmentTermEnum
//display term and term positions,termDocs
while(termEnum.next())
{
System.out.print(termEnum.term().toString2());
System.out.println(" DocumentFrequency=" + termEnum.docFreq());
TermPositions termPositions= segmentReader.termPositions(termEnum.term());
int i=0;
while(termPositions.next())
{
System.out.println((i++)+"->"+termPositions);
}
TermDocs termDocs=segmentReader.termDocs(termEnum.term());//实际为segmentDocs
while (termDocs.next())
{
System.out.println((i++)+"->"+termDocs);
}
}
//display field info
FieldInfos fieldInfos= segmentReader.fieldInfos;
FieldInfo pathFieldInfo = fieldInfos.fieldInfo("path");
FieldInfo modifiedFieldInfo = fieldInfos.fieldInfo("modified");
FieldInfo contentsFieldInfo =fieldInfos.fieldInfo("contents");
System.out.println(pathFieldInfo);
System.out.println(modifiedFieldInfo);
System.out.println(contentsFieldInfo);
//display TermFreqVector
for(int i=0;i<segmentReader.numDocs();i++)
{
//对contents的token之后的term存于了TermFreqVector
TermFreqVector termFreqVector=segmentReader.getTermFreqVector(i,"contents");
System.out.println(termFreqVector);
}
}
public static void main(String [] args)
{
try
{
Directory directory = FSDirectory.getDirectory("testIndexPackage",true);
directory.close();
indexDocument("segmentOne","e:\\lucene\\test.txt");
//printSegment("segmentOne");
indexDocument("segmentTwo","e:\\lucene\\test2.txt");
// printSegment("segmentTwo");
merge("segmentOne","segmentTwo","merge");
printSegment("merge");
}
catch(Exception e)
{
System.out.println("caught a "+e.getCause()+"\n with message:"+e.getMessage());
e.printStackTrace();
}
}
}
看看其结果如下:
Document<Text<path:e:\lucene\test.txt> Keyword<modified:0eg4e221c>>
Document<Text<path:e:\lucene\test2.txt> Keyword<modified:0eg4ee8b4>>
<Term:FieldName,text>=<contents,china> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=2>
1-><docNumber,freq>=<0,1>
<Term:FieldName,text>=<contents,i> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=2 Pos=0,3>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>
2-><docNumber,freq>=<0,2>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<contents,love> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=2 Pos=1,4>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=1>
2-><docNumber,freq>=<0,2>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<contents,nankai> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=2>
1-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<contents,tianjin> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=5>
1-><docNumber,freq>=<0,1>
<Term:FieldName,text>=<modified,0eg4e221c> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=0>
1-><docNumber,freq>=<0,1>
<Term:FieldName,text>=<modified,0eg4ee8b4> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>
1-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<path,e> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=0>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>
2-><docNumber,freq>=<0,1>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<path,lucene> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=1>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=1>
2-><docNumber,freq>=<0,1>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<path,test> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=2>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=2>
2-><docNumber,freq>=<0,1>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<path,txt> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=3>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=3>
2-><docNumber,freq>=<0,1>
3-><docNumber,freq>=<1,1>
<fieldName,isIndexed,fieldNumber,storeTermVector>=path,true,3,false>
<fieldName,isIndexed,fieldNumber,storeTermVector>=modified,true,2,false>
<fieldName,isIndexed,fieldNumber,storeTermVector>=contents,true,1,true>
{contents: china/1, i/2, love/2, tianjin/1}
{contents: i/1, love/1, nankai/1}
认真审视其结果,你就会更加明白Lucene底层的索引结构如何。
参考资料:Lucene File Format
原创:windshow TjuAILab
Lucene索引中有几个最基础的概念,索引(index),文档(document),域(field),和项(或者译为语词term)
其中Index为Document的序列
Document为Field的序列
Field为Term的序列
Term就是一个子串.
存在于不同的Field中的同一个子串被认为是不同的Term.因此Term实际上是用一对子串表示的,第一个子串为Field的name,第二个为Field中的子串.既然Term这么重要,我们先来认识一下Term.
认识Term
最好的方法就是看其源码表示.
public final class Term implements Comparable, java.io.Serializable {
String field;
String text;
public Term(String fld, String txt) {this(fld, txt, true);}
public final String field() { return field; }
public final String text() { return text; }
//overwrite equals()
public final boolean equals(Object o) { }
//overwrite hashCode()
public final int hashCode() {return field.hashCode() + text.hashCode();
}
public int compareTo(Object other) {return compareTo((Term)other);}
public final int compareTo(Term other)
final void set(String fld, String txt) public final String toString() { return field + ":" + text; }
private void readObject(java.io.ObjectInputStream in)
}
从代码中我们可以大体看出Tern其实是一个二元组<FieldName,text>
倒排索引
为了使得基于项的搜索更有效率,索引中项是静态存储的。Lucene的索引属于索引方式中的倒排索引,因为对于一个项这种索引可以列出包含它的文档。这刚好是文档与项自然联系的倒置。
Field的类型
Lucene中,Field的文本可能以逐字的非倒排的方式存储在索引中。而倒排过的Field称为被索引过了。Field也可能同时被存储和被索引。Field的文本可能被分解许多Term而被索引,或者就被用作一个Term而被索引。大多数的Field是被分解过的,但是有些时候某些标识符域被当做一个Term索引是很有用的。
Index包中的每个类解析
CompoundFileReader
提供读取.cfs文件的方法.
CompoundFileWriter
用来构建.cfs文件,从Lucene1.4开始,会将下面要提到的各类文件,譬如.tii,.tis等合并成一个.cfs文件!
其结构如下
Compound (.cfs) --> FileCount, <DataOffset, FileName>FileCount, FileDataFileCount
FileCount --> VInt
DataOffset --> Long
FileName --> String
FileData --> raw file data
DocumentWriter
构建.frq,.prx,.f文件
1.FreqFile (.frq) --> <TermFreqs, SkipData>TermCount
TermFreqs --> <TermFreq>DocFreq
TermFreq --> DocDelta, Freq?
SkipData --> <SkipDatum>DocFreq/SkipInterval
SkipDatum --> DocSkip,FreqSkip,ProxSkip
DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> VInt
2.The .prx file contains the lists of positions that each term occurs at within documents.
ProxFile (.prx) --> <TermPositions>TermCount
TermPositions --> <Positions>DocFreq
Positions --> <PositionDelta>Freq
PositionDelta --> VInt
3.There's a norm file for each indexed field with a byte for each document. The .f[0-9]* file contains, for each document, a byte that encodes a value that is multiplied into the score for hits on that field:
Norms (.f[0-9]*) --> <Byte>SegSize
Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent.
These are converted to an IEEE single float value as follows:
1. If the byte is zero, use a zero float.
2. Otherwise, set the sign bit of the float to zero;
3. add 48 to the exponent and use this as the float's exponent;
4. map the mantissa to the high-order 3 bits of the float's mantissa; and
5. set the low-order 21 bits of the float's mantissa to zero.
FieldInfo
里边有Field的部分信息,是一个四元组<name,isIndexed,num, storeTermVector>
FieldInfos
此类用来描述Document的fields是否被索引.每个Segment有一个单独的FieldInfo 文件.对于多线程,此类的对象为线程安全的.但是某一时刻,只允许一个线程添加document.别的reader和writer不允许进入.此类维护两个容器ArrayList和HashMap,这两个容器都不是synchronized,何言线程安全,不解??
观察write函数可知 .fnm文件的构成为
FieldInfos (.fnm) --> FieldsCount, <FieldName, FieldBits>FieldsCount
FieldsCount --> VInt
FieldName --> String
FieldBits --> Byte
FieldReader
用来读取.fdx文件和.fdt文件
FieldWriter
此类创建两个文件.fdx和.fdt文件
FieldIndex(.fdx)对于每一个Document,里面都含有一个指向Field的指针(其实是整数)
<FieldValuesPosition>SegSize
FieldValuesPosition --> Uint64
则第n个document的Field pointer为n*8
FieldData(.fdt)里面包含了每一个文档包含的存储的field信息.内容如下:
<DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, Bits, Value>FieldCount
FieldCount --> VInt
FieldNum --> VInt
Lucene <= 1.4:
Bits --> Byte
Value --> String
Only the low-order bit of Bits is used. It is one for tokenized fields, and zero for non-tokenized fields.
FilterIndexReader
扩展自IndexReader,提供了具体的方法.
IndexReader
为abstract class!用来读取建完索引的Directory,并可以返回各种信息,譬如Term,TermPosition等等.
IndexWriter
IndexWriter用来创建和维护索引。
IndexWriter构造函数中的第三个参数决定一个新的索引是否被创建,或者一个存在的索引是否开放给欲新加入的新的document
通过addDocument()0函数加入新的documents,当添加完document之后,调用close()函数
如果一个Index没有document需要加入并且需要优化查询性能。则在索引close()之前,调用optimize()函数进行优化。
Deleteable文件结构:
A file named "deletable" contains the names of files that are no longer used by the index, but which could not be deleted. This is only used on Win32, where a file may not be deleted while it is still open. On other platforms the file contains only null bytes.
Deletable --> DeletableCount, <DelableName>DeletableCount
DeletableCount --> UInt32
DeletableName --> String
MultipleTermPositions
专门用于search包中的PhrasePrefixQuery
MultiReader
扩展自IndexReader,用来读取多个索引!添加他们的内容
SegmentInfo
一些关于Segment的信息,是一个三元组<segmentname,docCount,dir>
SegmentInfos
扩展自Vector,就是一个向量组,其中任意成员为SegmentInfo!用来构建segments文件,每个Index有且只有一个这样的文件,此类提供了read和write的方法.
其内容如下:
Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>SegCount
Format, NameCounter, SegCount, SegSize --> UInt32
Version --> UInt64
SegName --> String
Format is -1 in Lucene 1.4.
Version counts how often the index has been changed by adding or deleting documents.
NameCounter is used to generate names for new segment files.
SegName is the name of the segment, and is used as the file name prefix for all of the files that compose the segment's index.
SegSize is the number of documents contained in the segment index.
SegmentMergeInfo
用来记录segment合并信息.
SegmentMergeQueue
扩展自PriorityQueue(按升序排列)
SegmentMerger
此类合并多个Segment为一个Segment,被IndexWriter.addIndexes()创建此类对象
如果compoundFile为True即可以合并了,创建.cfs文件,并且把其余的几乎所有文件全部合并到.cfs文件中!
SegmentReader
扩展自IndexReader,提供了很多读取Index的方法
SegmentTermDocs
扩展自TermDocs
SegmentTermEnum
扩展自TermEnum
SegmentTermPositions
扩展自TermPositions
SegmentTermVector
扩展自TermFreqVector
Term
Term是一个<fieldName,text>对.而Field由于分多种,但是至少都含有<fieldName,fieldValue>这样二者就可以建立关联了.Term是一个搜索单元.Term的text都是诸如dates,email address,urls等等.
TermDocs
TermDocs是一个Interface. TermDocs提供一个接口,用来列举<document,frequency>,以共Term使用
在<document,frequency>对中,document部分给每一个含有term的document命名.document根据其document number进行标引.frequency部分列举在每一个document中term的数量.<document,frequency>对根据document number排序.
TermEnum
此类为抽象类,用来enumerate term.Term enumerations 由Term.compareTo()进行排序此enumeration中的每一个term都要大于所有在此enumeration之前的term.
TermFreqVector
此Interface用来访问一个document的Field的Term Vector
TermInfo
此类主要用来存储Term信息.其可以说为一个五元组<Term,docFreq,freqPointer,proxPointer,skipOffset>
TermInfoReader
未细读,待读完SegmentTermEnum
TermInfoWriter
此类用来构建(.tis)和(.tii)文件.这些构成了term dictionary
1. The term infos, or tis file.
TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
TIVersion --> UInt32
TermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
TermInfos --> <TermInfo>TermCount
TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
Term --> <PrefixLength, Suffix, FieldNum>
Suffix --> String
PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt
This file is sorted by Term. Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text.
TIVersion names the version of the format of this file and is -2 in Lucene 1.4.
Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
FieldNumber determines the term's field, whose name is stored in the .fdt file.
DocFreq is the count of documents which contain the term.
FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file).
ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file.
SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data.
2. The term info index, or .tii file.
This contains every IndexIntervalth entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file.
The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.
TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices
TIVersion --> UInt32
IndexTermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
TermIndices --> <TermInfo, IndexDelta>IndexTermCount
IndexDelta --> VLong
IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry.
TODO: document skipInterval information
其中IndexDelta是.tii文件,比之.tis文件多的东西.
TermPosition
此类扩展自TermDocs,是一个Interface,用来enumerate<document,frequency,<position>*>三元组,
以供term使用.在此三元组中document和frequency于TernDocs中的相同.postions部分列出了在一个document中,一个term每次出现的顺序位置此三元组为倒排文档的事件表表示.
TermPositionVector
扩展自TermFreqVector.比之TermFreqVector扩展了功能,可以提供term所在的位置
TermVectorReader
用来读取.tvd,.tvf.tvx三个文件.
TermVectorWriter
用于构建.tvd, .tvf,.tvx文件,这三个文件构成TermVector
1. The Document Index or .tvx file.
This contains, for each document, a pointer to the document data in the Document (.tvd) file.
DocumentIndex (.tvx) --> TVXVersion<DocumentPosition>NumDocs
TVXVersion --> Int
DocumentPosition --> UInt64
This is used to find the position of the Document in the .tvd file.
2. The Document or .tvd file.
This contains, for each document, the number of fields, a list of the fields with term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file.
Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,>NumDocs
TVDVersion --> Int
NumFields --> VInt
FieldNums --> <FieldNumDelta>NumFields
FieldNumDelta --> VInt
FieldPositions --> <FieldPosition>NumFields
FieldPosition --> VLong
The .tvd file is used to map out the fields that have term vectors stored and where the field information is in the .tvf file.
3. The Field or .tvf file.
This file contains, for each field that has a term vector stored, a list of the terms and their frequencies.
Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs>NumFields
TVFVersion --> Int
NumTerms --> VInt
NumDistinct --> VInt -- Future Use
TermFreqs --> <TermText, TermFreq>NumTerms
TermText --> <PrefixLength, Suffix>
PrefixLength --> VInt
Suffix --> String
TermFreq --> VInt
Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
好的,整个Index包所有类都讲解了,下边咱们开始来编码重新审视一下!
下边来编制一个程序来结束本章的讨论。
package org.apache.lucene.index;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.store.*;
import org.apache.lucene.document.*;
import org.apache.lucene.demo.*;
import org.apache.lucene.search.*;
import java.io.*;
/**在使用此程序时,会尽量用到Lucene Index中的每一个类,尽量将其展示个大家
*使用的Index包中类有
*DocumentWriter(提供给用用户使用的为IndexWriter)
*FieldInfo(和FieldInfos)
* SegmentDocs(扩展自TermDocs)
*SegmentReader(扩展自IndexReader,提供给用户使用的是IndexReader)
*SegmentMerger
*segmentTermEnum(扩展自TermEnum)
*segmentTermPositions(扩展自TermPositions)
*segmentTermVector(扩展自TermFreqVector)
*/
public class TestIndexPackage
{
//用于将Document加入索引
public static void indexDocument(String segment,String fileName) throws Exception
{
//第二个参数用来控制,如果获得不了目录是否创建
Directory directory = FSDirectory.getDirectory("testIndexPackage",false);
Analyzer analyzer = new SimpleAnalyzer();
//第三个参数为每一个Field最多拥有的Token个数
DocumentWriter writer = new DocumentWriter(directory,analyzer,Similarity.getDefault(),1000);
File file = new File(fileName);
//由于使用FileDocument将file包装成了Docuement,会在document中创建三个field(path,modified,contents)
Document doc = FileDocument.Document(file);
writer.addDocument(segment,doc);
directory.close();
}
//将多个segment进行合并
public static void merge(String segment1,String segment2,String segmentMerged)throws Exception
{
Directory directory = FSDirectory.getDirectory("testIndexPackage",false);
SegmentReader segmentReader1=new SegmentReader(new SegmentInfo(segment1,1,directory));
SegmentReader segmentReader2=new SegmentReader(new SegmentInfo(segment2,1,directory));
//第三个参数为是否创建.cfs文件
SegmentMerger segmentMerger =new SegmentMerger(directory,segmentMerged,false);
segmentMerger.add(segmentReader1);
segmentMerger.add(segmentReader2);
segmentMerger.merge();
segmentMerger.closeReaders();
directory.close();
}
//将segment即Index的子索引的所有内容展示给你看。
public static void printSegment(String segment) throws Exception
{
Directory directory =FSDirectory.getDirectory("testIndexPackage",false);
SegmentReader segmentReader = new SegmentReader(new SegmentInfo(segment,1,directory));
//display documents
for(int i=0;i<segmentReader.numDocs();i++)
System.out.println(segmentReader.document(i));
TermEnum termEnum = segmentReader.terms();//此处实际为SegmentTermEnum
//display term and term positions,termDocs
while(termEnum.next())
{
System.out.print(termEnum.term().toString2());
System.out.println(" DocumentFrequency=" + termEnum.docFreq());
TermPositions termPositions= segmentReader.termPositions(termEnum.term());
int i=0;
while(termPositions.next())
{
System.out.println((i++)+"->"+termPositions);
}
TermDocs termDocs=segmentReader.termDocs(termEnum.term());//实际为segmentDocs
while (termDocs.next())
{
System.out.println((i++)+"->"+termDocs);
}
}
//display field info
FieldInfos fieldInfos= segmentReader.fieldInfos;
FieldInfo pathFieldInfo = fieldInfos.fieldInfo("path");
FieldInfo modifiedFieldInfo = fieldInfos.fieldInfo("modified");
FieldInfo contentsFieldInfo =fieldInfos.fieldInfo("contents");
System.out.println(pathFieldInfo);
System.out.println(modifiedFieldInfo);
System.out.println(contentsFieldInfo);
//display TermFreqVector
for(int i=0;i<segmentReader.numDocs();i++)
{
//对contents的token之后的term存于了TermFreqVector
TermFreqVector termFreqVector=segmentReader.getTermFreqVector(i,"contents");
System.out.println(termFreqVector);
}
}
public static void main(String [] args)
{
try
{
Directory directory = FSDirectory.getDirectory("testIndexPackage",true);
directory.close();
indexDocument("segmentOne","e:\\lucene\\test.txt");
//printSegment("segmentOne");
indexDocument("segmentTwo","e:\\lucene\\test2.txt");
// printSegment("segmentTwo");
merge("segmentOne","segmentTwo","merge");
printSegment("merge");
}
catch(Exception e)
{
System.out.println("caught a "+e.getCause()+"\n with message:"+e.getMessage());
e.printStackTrace();
}
}
}
看看其结果如下:
Document<Text<path:e:\lucene\test.txt> Keyword<modified:0eg4e221c>>
Document<Text<path:e:\lucene\test2.txt> Keyword<modified:0eg4ee8b4>>
<Term:FieldName,text>=<contents,china> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=2>
1-><docNumber,freq>=<0,1>
<Term:FieldName,text>=<contents,i> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=2 Pos=0,3>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>
2-><docNumber,freq>=<0,2>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<contents,love> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=2 Pos=1,4>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=1>
2-><docNumber,freq>=<0,2>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<contents,nankai> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=2>
1-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<contents,tianjin> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=5>
1-><docNumber,freq>=<0,1>
<Term:FieldName,text>=<modified,0eg4e221c> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=0>
1-><docNumber,freq>=<0,1>
<Term:FieldName,text>=<modified,0eg4ee8b4> DocumentFrequency=1
0-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>
1-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<path,e> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=0>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=0>
2-><docNumber,freq>=<0,1>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<path,lucene> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=1>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=1>
2-><docNumber,freq>=<0,1>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<path,test> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=2>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=2>
2-><docNumber,freq>=<0,1>
3-><docNumber,freq>=<1,1>
<Term:FieldName,text>=<path,txt> DocumentFrequency=2
0-><doc,TermFrequency,Pos>:< doc=0, TermFrequency=1 Pos=3>
1-><doc,TermFrequency,Pos>:< doc=1, TermFrequency=1 Pos=3>
2-><docNumber,freq>=<0,1>
3-><docNumber,freq>=<1,1>
<fieldName,isIndexed,fieldNumber,storeTermVector>=path,true,3,false>
<fieldName,isIndexed,fieldNumber,storeTermVector>=modified,true,2,false>
<fieldName,isIndexed,fieldNumber,storeTermVector>=contents,true,1,true>
{contents: china/1, i/2, love/2, tianjin/1}
{contents: i/1, love/1, nankai/1}
认真审视其结果,你就会更加明白Lucene底层的索引结构如何。
参考资料:Lucene File Format
发表评论
-
Nutch1.0开源搜索引擎与Paoding在eclipse中用plugin方式集成(终极篇)
2009-09-14 13:15 4342本文主要描述的是如何将paoding分词用plugi ... -
Nutch1.0的那些事
2009-09-10 12:37 2204很久没有更新博客了,应该快一年了。现在呢,我把去年 ... -
配置linux服务器之间ssh不用密码访问
2008-11-05 13:55 3916在配置nutch的时候,我 ... -
搜索引擎术语
2008-10-15 15:30 2553最近monner兄共享了一篇 ... -
搜索引擎机器人研究报告
2008-10-13 15:35 1933从该文对googlebot的分析看,googlebot似乎是想 ... -
搜索引擎算法研究
2008-10-13 15:11 21251.引言 万维网WWW(World Wide Web ... -
谁说搜索引擎只关注结果-看我viewzi的72变
2008-10-04 20:15 1843搜索引擎给大家的感觉,就是用起来简单,以google为首,一个 ... -
《Lucene+Nutch搜索引擎》看过以后。。。
2008-10-03 23:42 7643研究搜索引擎技术快一 ... -
微软有趣的人物关系搜索引擎——人立方
2008-10-03 20:00 3977最近,微软亚洲研究院 ... -
Nutch开源搜索引擎增量索引recrawl的终极解决办法(续)
2008-09-28 19:30 3481十一要放假了,先祝广大同学们节日快乐! 在之前的一篇文章中, ... -
Nutch:一个灵活可扩展的开源web搜索引擎
2008-09-28 11:46 2272在网上找到一篇于2004年11月由CommerceNet La ... -
Google公司都是些什么牛人?
2008-09-27 17:31 2082Google公司都是些什么牛人? 1 Vi ... -
搜索引擎名人堂之Doug Cutting
2008-09-27 11:41 2647Doug Cutting是一个开源搜索技术的提倡者和创造者。他 ... -
Nutch开源搜索引擎增量索引recrawl的终极解决办法
2008-09-26 19:12 5182本文重点是介绍Nutch开源搜索引擎如何在Hadoop分布式计 ... -
Nutch开源搜索引擎与Paoding中文分词用plugin方式集成
2008-09-26 15:31 4598本文是我在集成中文分词paoding时积累的经验,单独成一篇文 ... -
关于Hadoop的MapReduce纯技术点文章
2008-09-24 18:10 3523本文重点讲述Hadoop的整 ... -
MapReduce-免费午餐还没有结束?
2008-09-24 09:57 1489Map Reduce - the Free Lunch is ... -
搜索引擎名人堂之Jeff Dean
2008-09-22 15:09 14983最近一直在研究Nutch,所以关注到了搜索引擎界的一些名人,以 ... -
Lucene于搜索引擎技术(Analysis包详解)
2008-09-22 14:55 2228Analysis 包分析 ... -
Lucene与搜索引擎技术(Document包详解)
2008-09-22 14:54 1725Document 包分析 理解 ...
相关推荐
**Lucene 搜索引擎开发包详解** Lucene 是一个开源的全文检索库,由 Apache 软件基金会维护。它提供了高性能、可扩展的搜索功能,广泛应用于网站、应用程序及大数据分析等领域。作为 Java 语言编写的核心库,Lucene...
**基于Java的Lucene全文搜索引擎资源简单实例** Lucene是一个由Apache软件基金会开发的开源全文检索库,它为Java开发者提供了强大的文本搜索功能。Lucene是高性能、可扩展的信息检索库,可以集成到各种Java应用中,...
**Lucene搜索引擎Demo详解** Lucene是一个高性能、全文本搜索库,由Apache软件基金会开发,是Java编程语言中广泛使用的搜索引擎框架。它提供了强大的索引和搜索功能,使得开发者能够轻松地在应用中集成高级的文本...
**基于Lucene的简单搜索引擎构建详解** Lucene是Apache软件基金会的一个开源项目,它是一个高性能、全文本搜索库,提供了一个强大的信息检索引擎框架。这个压缩包“基于lucene 的简单搜索引擎.rar”很可能是为了...
《基于Lucene 3.6平台的搜索工具详解与应用指南》 Lucene是一个高性能、全文本搜索引擎库,由Apache软件基金会开发并维护。在3.6版本中,Lucene提供了一套完整的搜索解决方案,包括索引构建、查询解析、结果排序等...
**Lucene分词与查询详解** Lucene是一个高性能、全文本搜索库,广泛应用于各种搜索引擎的开发中。它提供了一套强大的API,用于索引文本数据,并执行复杂的查询操作。在深入理解Lucene的分词与查询机制之前,我们...
**Lucene.NET 2.9 搜索引擎源代码解析** Lucene.NET 是一个基于 Apache Lucene 的全文搜索引擎库,它是用 C# 实现的。Apache Lucene 是一个高性能、可扩展的信息检索库,广泛用于构建复杂的搜索功能。Lucene.NET ...
总的来说,"lucene_jar包"是实现高效全文检索的关键工具,它提供了强大的搜索功能,让开发者能够轻松地在Java应用中集成搜索引擎。通过深入理解和熟练使用Lucene,可以提升应用的用户体验,实现更智能、更快速的信息...
Lucene 是一个开源的全文搜索引擎库,由 Apache 软件基金会开发并维护。作为Java平台上的一个高性能、可扩展的信息检索库,Lucene 提供了强大的文本分析、索引构建、搜索功能,并且能够轻松集成到各种应用中。此次...
**搜索引擎Lucene详解** 搜索引擎Lucene是Apache软件基金会下的一个开放源代码全文检索库,它提供了高效的、可扩展的文本搜索功能。Lucene的核心功能包括文档索引、搜索以及相关的排序算法,使得开发者能够轻松地在...
Lucene是一个高性能、全文本搜索库,由Apache软件基金会开发,被广泛应用于各种搜索引擎和站内搜索解决方案中。它提供了丰富的文本分析、索引和搜索功能,使得开发者能够轻松地在自己的应用程序中实现复杂的全文检索...
Lucene是一个由Java编写的高性能、可扩展的全文搜索引擎库。它提供了一种机制,使得开发者能够轻松地为自己的应用程序添加索引和搜索功能。作为Apache软件基金会的项目,Lucene具有开源和免费的特性,受到Apache软件...
**Lucene 搜索引擎实现详解** Lucene 是一个开源全文搜索引擎库,由 Apache 软件基金会维护。它提供了一套强大的数据结构和算法,用于高效地存储和检索大量文本数据。通过 Lucene,开发者可以轻松地在自己的应用...
Hibernate 和 Lucene 分别作为对象关系映射(ORM)工具和全文搜索引擎,在各自领域内都有广泛的应用。将两者进行整合,可以实现对数据库中数据的有效索引和快速搜索,极大地提高应用程序的性能。 #### 二、整合框架...
在实际应用中,开发人员通常会结合 Lucene 与其他技术,例如使用 Nutch 或 Solr 这样的框架来构建完整的搜索引擎解决方案。Nutch 是一个开源网络爬虫项目,而 Solr 是一个基于 Lucene 的搜索服务器,提供了更高级的...
《Lucene 4.7:官方完整包详解》 Lucene是一个开源的全文搜索引擎库,由Apache软件基金会开发并维护。作为Java平台上的一个高性能、可扩展的信息检索库,Lucene为开发者提供了强大的文本搜索功能。本文将深入探讨...
**Lucene 开发包详解** Lucene 是一个高性能、全文本搜索库,由 Apache 软件基金会开发并维护。这个开发包包含了两个版本:lucene-1.4.3 和 lucene-1.4.1,分别代表了 Lucene 的不同迭代阶段,它们为开发者提供了...
1. **compass-2.1.0.jar**:Compass 是基于 Lucene 的一个高级搜索引擎框架,提供了与 ORM 框架(如 Hibernate)的集成,使得数据库中的数据可以被全文搜索。 2. **lucene-core.jar**:Lucene 的核心库,包含所有...
Apache Lucene,作为一个开源全文检索库,被广泛应用于构建高效的搜索引擎。然而,传统的Lucene主要针对纯文本文件进行操作,而现代业务中,如doc、xlsx等办公文档的处理需求日益增加。本文将深入探讨如何使用Lucene...