`
liuxinglanyue
  • 浏览: 565393 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

Lucene 3.0.2 代码 分析(转)

阅读更多
持续更新
Document 和 Field
IndexWriter
IndexReader
Lucenen中的倒排实现
IndexSearcher
Analyzer
Sort  Filter
Lucene中的Ranking算法以及改进



1. Document 和 Field 

Document和Field在索引创建的过程中必不可少。而Document和Field可以理解成传统的关系型数据库中的记录和字段的关系,而字段可以有很多个,那么Document中可以添加很多个Field,方便满足各种不同的查询。如Field可以是文件内容、文件名称、创建时间或者是修改时间等等。而Field中的属性有:是否存储(this.isStored = store.isStored())  是否索引( this.isIndexed = index.isIndexed())、是否分词(this.isTokenized = index.isAnalyzed()),根据不同的需要来进行选择。如文档内容不需要存储,但需要被索引。根据底层的源代码知道有一些限制的,比如不能有这样一个个Field,既不index也不store。 
    
Document中的主要方法就是对Field的增删查操作,3.0.2中的主要API如下: 
Java代码 
void    add(Fieldable field)   
         Adds a field to a document.  
String  get(String name)   
         Returns the string value of the field with the given name if any exist in this document, or null.  
Field   getField(String name)   
         Returns a field with the given name if any exist in this document, or null.  
List<Fieldable>   getFields()   
         Returns a List of all the fields in a document.  
Field[] getFields(String name)   
         Returns an array of Fields with the given name.  
void    removeField(String name)   
         Removes field with the specified name from the document.  
void    removeFields(String name)   
         Removes all fields with the given name from the document.  
String  toString()   
         Prints the fields of a document for human consumption.  
...  


在Field中,主要的两个构造函数如下,帮助理解Field属性(可以自行查看源文件进行阅读) 
Java代码 
/** 
 * Create a field by specifying its name, value and how it will 
 * be saved in the index. 
 *  
 * @param name The name of the field 
 * @param internName Whether to .intern() name or not 
 * @param value The string to process 
 * @param store Whether <code>value</code> should be stored in the index 
 * @param index Whether the field should be indexed, and if so, if it should 
 *  be tokenized before indexing  
 * @param termVector Whether term vector should be stored 
 * @throws NullPointerException if name or value is <code>null</code> 
 * @throws IllegalArgumentException in any of the following situations: 
 * <ul>  
 *  <li>the field is neither stored nor indexed</li>  
 *  <li>the field is not indexed but termVector is <code>TermVector.YES</code></li> 
 * </ul>  
 */   
public Field(String name, boolean internName, String value, Store store, Index index, TermVector termVector) {  
  if (name == null)  
    throw new NullPointerException("name cannot be null");  
  if (value == null)  
    throw new NullPointerException("value cannot be null");  
  if (name.length() == 0 && value.length() == 0)  
    throw new IllegalArgumentException("name and value cannot both be empty");  
  if (index == Index.NO && store == Store.NO)  
    throw new IllegalArgumentException("it doesn't make sense to have a field that "  
       + "is neither indexed nor stored");  
  if (index == Index.NO && termVector != TermVector.NO)  
    throw new IllegalArgumentException("cannot store term vector information "  
       + "for a field that is not indexed");  
          
  if (internName) // field names are optionally interned  
    name = StringHelper.intern(name);  
    
  this.name = name;   
    
  this.fieldsData = value;  
  
  this.isStored = store.isStored();  
   
  this.isIndexed = index.isIndexed();  
  this.isTokenized = index.isAnalyzed();  
  this.omitNorms = index.omitNorms();  
  if (index == Index.NO) {  
    this.omitTermFreqAndPositions = false;  
  }      
  
  this.isBinary = false;  
  
  setStoreTermVector(termVector);  
}  


Java代码 
/** 
  * Create a tokenized and indexed field that is not stored, optionally with  
  * storing term vectors.  The Reader is read only when the Document is added to the index, 
  * i.e. you may not close the Reader until {@link IndexWriter#addDocument(Document)} 
  * has been called. 
  *  
  * @param name The name of the field 
  * @param reader The reader with the content 
  * @param termVector Whether term vector should be stored 
  * @throws NullPointerException if name or reader is <code>null</code> 
  */   
 public Field(String name, Reader reader, TermVector termVector) {  
   if (name == null)  
     throw new NullPointerException("name cannot be null");  
   if (reader == null)  
     throw new NullPointerException("reader cannot be null");  
     
   this.name = StringHelper.intern(name);        // field names are interned  
   this.fieldsData = reader;  
     
   this.isStored = false;  
   this.isIndexed = true;  
   this.isTokenized = true;  
   this.isBinary = false;  
     
   setStoreTermVector(termVector);  
 }  



而其他的构造函数也只是调用这两个个主要的构造函数。如几个比较常用的构造函数; 
Java代码 
public Field(String name, String value, Store store, Index index) {  
  this(name, value, store, index, TermVector.NO);  
}  

Java代码 
public Field(String name, Reader reader) {  
  this(name, reader, TermVector.NO);  
}  

Java代码 
  


不过读读源代码中Field中的三个静态枚举变量Store、Index和TermVector的话,可以更清楚的理解Field中各个属性值是如何设置的(而以前的版本是三个静态常量内部类)。 

2. IndexWriter 
可以参考我之前的一个博客:http://hanyuanbo.iteye.com/blog/812135 
下面这段摘自JavaDoc中IndexWriter的前三段: 
引用
An IndexWriter creates and maintains an index. 

The create argument to the constructor determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with create=true even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also constructors with no create argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index. 

In either case, documents are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updateDocument (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, close should be called. 

(其中有一点说明了如果没有指明是否是创建还是追加index的时候,采取不存在则创建,存在则打开已经存在的index策略) 
引用

Expert: IndexWriter allows an optional IndexDeletionPolicy implementation to be specified. 

Expert: IndexWriter allows you to separately change the MergePolicy and the MergeScheduler. 


之下的五个构造函数中Expert有三个,正常用另外两个就够了。 
IndexWriter(Directory d, Analyzer a, boolean create, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)	          Expert: constructs an IndexWriter with a custom IndexDeletionPolicy, for the index in d.
IndexWriter(Directory d, Analyzer a, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)	          Expert: constructs an IndexWriter with a custom IndexDeletionPolicy, for the index in d, first creating it if it does not already exist.
IndexWriter(Directory d, Analyzer a, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl, IndexCommit commit)	          Expert: constructs an IndexWriter on specific commit point, with a custom IndexDeletionPolicy, for the index in d.
IndexWriter(Directory d, Analyzer a, IndexWriter.MaxFieldLength mfl)	          Constructs an IndexWriter for the index in d, first creating it if it does not already exist.
IndexWriter(Directory d, Analyzer a, boolean create, IndexWriter.MaxFieldLength mfl)	          Constructs an IndexWriter for the index in d.


而实际上在源代码中,都调用了一个私有的init的方法。 
Java代码 
private void init(Directory d, Analyzer a, final boolean create,    
                    IndexDeletionPolicy deletionPolicy, int maxFieldLength,  
                    IndexingChain indexingChain, IndexCommit commit)  
    throws CorruptIndexException, LockObtainFailedException, IOException {  
        ...//在以前的版本中,是调用了一个私有的构造函数。  
}  


在IndexWriter中,用来创建index的方法 
void	addDocument(Document doc)	          Adds a document to this index.
void	addDocument(Document doc, Analyzer analyzer)	          Adds a document to this index, using the provided analyzer instead of the value of getAnalyzer().


3. IndexReader 

帮助来重新处理索引文件。包括更新、删除等操作。构造函数有如下: 
static IndexReader	open(Directory directory)	          Returns a IndexReader reading the index in the given Directory, with readOnly=true.
static IndexReader	open(Directory directory, boolean readOnly)	          Returns an IndexReader reading the index in the given Directory.
static IndexReader	open(Directory directory, IndexDeletionPolicy deletionPolicy, boolean readOnly)	          Expert: returns an IndexReader reading the index in the given Directory, with a custom IndexDeletionPolicy.
static IndexReader	open(Directory directory, IndexDeletionPolicy deletionPolicy, boolean readOnly, int termInfosIndexDivisor)	          Expert: returns an IndexReader reading the index in the given Directory, with a custom IndexDeletionPolicy.
static IndexReader	open(IndexCommit commit, boolean readOnly)	          Expert: returns an IndexReader reading the index in the given IndexCommit.
static IndexReader	open(IndexCommit commit, IndexDeletionPolicy deletionPolicy, boolean readOnly)	          Expert: returns an IndexReader reading the index in the given Directory, using a specific commit and with a     custom IndexDeletionPolicy.
static IndexReader	open(IndexCommit commit, IndexDeletionPolicy deletionPolicy, boolean readOnly, int  termInfosIndexDivisor)	          Expert: returns an IndexReader reading the index in the given Directory, using a specific commit and with a  custom IndexDeletionPolicy.



里面会涉及到Term这个类,Term类的构造函数很简单,如下: 

Term(String fld)	          Constructs a Term with the given field and empty text.
Term(String fld, String txt)	          Constructs a Term with the given field and text.



在IndexReader中常用到的,而且好理解的方法如下: 


Document	document(int n)	          Returns the stored fields of the nth Document in this index.
abstract  int	numDocs()	          Returns the number of documents in this index.
abstract  TermDocs	termDocs()	          Returns an unpositioned TermDocs enumerator.
TermDocs	termDocs(Term term)	          Returns an enumeration of all the documents which contain term.
abstract  TermPositions	termPositions()	          Returns an unpositioned TermPositions enumerator.
TermPositions	termPositions(Term term)	          Returns an enumeration of all the documents which contain term.
abstract  TermEnum	terms()	          Returns an enumeration of all the terms in the index.
abstract  TermEnum	terms(Term t)	          Returns an enumeration of all terms starting at a given term.
void	deleteDocument(int docNum)	          Deletes the document numbered docNum.
int	deleteDocuments(Term term)	          Deletes all documents that have a given term indexed.



如下代码帮助理解如何操作IndexReader对其中的Term进行访问,并进行删除操作(但进行删除的时候,切记要记得将reader关掉) 

Java代码 
package com.eric.lucene;  
  
import java.io.File;  
import java.io.IOException;  
  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.index.CorruptIndexException;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.index.TermDocs;  
import org.apache.lucene.index.TermPositions;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.store.LockObtainFailedException;  
import org.apache.lucene.util.Version;  
  
public class IndexReaderTest {  
    private File path ;  
      
      
    public IndexReaderTest(String path) {  
        this.path = new File(path);  
    }  
  
    public void createIndex(){  
        try {  
            IndexWriter writer = new IndexWriter(FSDirectory.open(this.path),new StandardAnalyzer(  
                    Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);  
            Document doc1 = new Document();  
            Document doc2 = new Document();  
            Document doc3 = new Document();  
            doc1.add(new Field("bookname", "thinking in java -- java 4", Field.Store.YES, Field.Index.ANALYZED));  
            doc2.add(new Field("bookname", "java core 2", Field.Store.YES, Field.Index.ANALYZED));  
            doc3.add(new Field("bookname", "thinking in c++", Field.Store.YES, Field.Index.ANALYZED));  
            writer.addDocument(doc1);  
            writer.addDocument(doc2);  
            writer.addDocument(doc3);  
            writer.close();  
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (LockObtainFailedException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
      
    public void test1(){  
        try {  
            IndexReader reader = IndexReader.open(FSDirectory.open(this.path));  
            System.out.println("version:\t" + reader.getVersion());  
            int num = reader.numDocs();  
            for(int i=0;i<num;i++){  
                Document doc = reader.document(i);  
                System.out.println(doc);  
            }  
              
            Term term = new Term("bookname","java");  
            TermDocs docs = reader.termDocs(term);  
            while(docs.next()){  
                System.out.print("doc num:\t" + docs.doc() + "\t\t");  
                System.out.println("frequency:\t" + docs.freq());  
            }  
              
            reader.close();  
              
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
//  version:    1289906350314  
//  Document<stored,indexed,tokenized<bookname:thinking in java -- java 4>>  
//  Document<stored,indexed,tokenized<bookname:java core 2>>  
//  Document<stored,indexed,tokenized<bookname:thinking in c++>>  
//  doc num:    0       frequency:  2  
//  doc num:    1       frequency:  1  
      
    public void test2(){  
        try {  
            IndexReader reader = IndexReader.open(FSDirectory.open(this.path));  
            System.out.println("version:\t" + reader.getVersion());  
              
            Term term = new Term("bookname","java");  
            TermPositions pos = reader.termPositions(term);  
            while(pos.next()){  
                System.out.print("frequency: " + pos.freq() + "\t");  
                for(int i=0;i<pos.freq();i++){  
                    System.out.print("pos: " + pos.nextPosition() + "\t");  
                }  
                System.out.println();  
            }  
            reader.close();  
              
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
//  version:    1289906350314  
//  frequency: 2    pos: 2  pos: 3    
//  frequency: 1    pos: 0  
//  第二次的时候没有调用createIndex() 所以版本号还是相同的  
      
    public void delete1(){  
        try {  
            IndexReader reader = IndexReader.open(FSDirectory.open(this.path), false);//必须指定readonly 为 false  
            System.out.println("version:\t" + reader.getVersion());  
            System.out.println("num:\t" + reader.numDocs());  
            reader.deleteDocument(2);//删除c++的那个Document  
            reader.close();  
              
              
            reader = IndexReader.open(FSDirectory.open(this.path), false);  
            System.out.println("version:\t" + reader.getVersion());  
            System.out.println("num:\t" + reader.numDocs());  
            reader.close();  
              
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
//  version:    1289906350314  
//  num:    3  
//  version:    1289906350315  
//  num:    2  
  
    public void delete2(){  
        try {  
            IndexReader reader = IndexReader.open(FSDirectory.open(this.path), false);//必须指定readonly 为 false  
            System.out.println("version:\t" + reader.getVersion());  
            System.out.println("num:\t" + reader.numDocs());  
            Term term = new Term("bookname","java");  
            reader.deleteDocuments(term);//删除c++的那个Document  
            reader.close();  
              
              
            reader = IndexReader.open(FSDirectory.open(this.path), false);  
            System.out.println("version:\t" + reader.getVersion());  
            System.out.println("num:\t" + reader.numDocs());  
            reader.close();  
              
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
//  version:    1289906350315  
//  num:    2  
//  version:    1289906350316  
//  num:    0  
  
      
    public static void main(String[] args) {  
        String path = "E:\\indexReaderTest";  
        IndexReaderTest test = new IndexReaderTest(path);  
//      test.createIndex();  
//      test.test1();  
//      test.test2();  
//      test.delete1();  
        test.delete2();  
    }  
}  

注释: 
先调用 
Java代码 
String path = "E:\\indexReaderTest";  
IndexReaderTest test = new IndexReaderTest(path);  
test.createIndex();  
test.test1();  

然后再调用: 
Java代码 
String path = "E:\\indexReaderTest";  
IndexReaderTest test = new IndexReaderTest(path);  
test.test2();  

然后再调用: 
Java代码 
String path = "E:\\indexReaderTest";  
IndexReaderTest test = new IndexReaderTest(path);  
test.delete1();  

然后再调用: 
Java代码 
String path = "E:\\indexReaderTest";  
IndexReaderTest test = new IndexReaderTest(path);  
test.delete2();  


4. Lucenen中的倒排实现 
以下的这个博客,简单的说明了倒排索引的原理。 
http://jackyrong.iteye.com/blog/238940 
通过阅读源代码可以找到在IndexWriter中有个静态的常量static final IndexingChain DefaultIndexingChain,如下: 
Java代码 
static final IndexingChain DefaultIndexingChain = new IndexingChain() {  
  
  @Override  
  DocConsumer getChain(DocumentsWriter documentsWriter) {  
    /* 
    This is the current indexing chain: 
 
    DocConsumer / DocConsumerPerThread 
      --> code: DocFieldProcessor / DocFieldProcessorPerThread 
        --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField 
          --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField 
            --> code: DocInverter / DocInverterPerThread / DocInverterPerField 
              --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField 
                --> code: TermsHash / TermsHashPerThread / TermsHashPerField 
                  --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField 
                    --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField 
                    --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField 
              --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField 
                --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField 
            --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField 
  */  
  
  // Build up indexing chain:  
  
    final TermsHashConsumer termVectorsWriter = new TermVectorsTermsWriter(documentsWriter);  
    final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();  
  
    final InvertedDocConsumer  termsHash = new TermsHash(documentsWriter, true, freqProxWriter,  
                                                         new TermsHash(documentsWriter, false, termVectorsWriter, null));  
    final NormsWriter normsWriter = new NormsWriter();  
    final DocInverter docInverter = new DocInverter(termsHash, normsWriter);  
    return new DocFieldProcessor(documentsWriter, docInverter);  
  }  
};  

这里的注释清晰的给出了整个处理的链是怎样进行的。在Doc文档中是没有这些invertXXX类的说明,必须到源文件中进行阅读。 

4. IndexSearcher 
Searcher中的接口实现与类继承关系如下(摘自API文档。简单的使用方法参见我之前的一个博客http://hanyuanbo.iteye.com/blog/812135) 
引用
org.apache.lucene.search 
Class Searcher 
java.lang.Object 
        org.apache.lucene.search.Searcher 
All Implemented Interfaces: 
        Closeable, Searchable 
Direct Known Subclasses: 
        IndexSearcher, MultiSearcher


其中用到的search函数有很多重载版本,以下摘自API文档。 
void	search(Query query, Collector results)	          Lower-level search API.
void	search(Query query, Filter filter, Collector results)	          Lower-level search API.
TopDocs	search(Query query, Filter filter, int n)	          Finds the top n hits for query, applying filter if non-null.
TopFieldDocs	search(Query query, Filter filter, int n, Sort sort)	          Search implementation with arbitrary sorting.
TopDocs	search(Query query, int n)	          Finds the top n hits for query.
abstract  void	search(Weight weight, Filter filter, Collector results)	          Lower-level search API.
abstract  TopDocs	search(Weight weight, Filter filter, int n)	          Expert: Low-level search implementation.
abstract  TopFieldDocs	search(Weight weight, Filter filter, int n, Sort sort)	          Expert: Low-level search implementation with arbitrary sorting.

还有一个非常有用的函数(在Searcher中为抽象方法,具体实现在子类中) 
abstract  Document	doc(int i)	          Returns the stored fields of document i.


在源代码中的Searcher抽象类中的search函数的重载版本如下: 
Java代码 
/** Search implementation with arbitrary sorting.  Finds 
   * the top <code>n</code> hits for <code>query</code>, applying 
   * <code>filter</code> if non-null, and sorting the hits by the criteria in 
   * <code>sort</code>. 
   *  
   * <p>NOTE: this does not compute scores by default; use 
   * {@link IndexSearcher#setDefaultFieldSortScoring} to 
   * enable scoring. 
   * 
   * @throws BooleanQuery.TooManyClauses 
   */  
  public TopFieldDocs search(Query query, Filter filter, int n,  
                             Sort sort) throws IOException {  
    return search(createWeight(query), filter, n, sort);  
  }  
  
  /** Lower-level search API. 
  * 
  * <p>{@link Collector#collect(int)} is called for every matching document. 
  * 
  * <p>Applications should only use this if they need <i>all</i> of the 
  * matching documents.  The high-level search API ({@link 
  * Searcher#search(Query, int)}) is usually more efficient, as it skips 
  * non-high-scoring hits. 
  * <p>Note: The <code>score</code> passed to this method is a raw score. 
  * In other words, the score will not necessarily be a float whose value is 
  * between 0 and 1. 
  * @throws BooleanQuery.TooManyClauses 
  */  
 public void search(Query query, Collector results)  
   throws IOException {  
   search(createWeight(query), null, results);  
 }  
  
  /** Lower-level search API. 
   * 
   * <p>{@link Collector#collect(int)} is called for every matching 
   * document. 
   * <br>Collector-based access to remote indexes is discouraged. 
   * 
   * <p>Applications should only use this if they need <i>all</i> of the 
   * matching documents.  The high-level search API ({@link 
   * Searcher#search(Query, Filter, int)}) is usually more efficient, as it skips 
   * non-high-scoring hits. 
   * 
   * @param query to match documents 
   * @param filter if non-null, used to permit documents to be collected. 
   * @param results to receive hits 
   * @throws BooleanQuery.TooManyClauses 
   */  
  public void search(Query query, Filter filter, Collector results)  
  throws IOException {  
    search(createWeight(query), filter, results);  
  }  
  
  /** Finds the top <code>n</code> 
   * hits for <code>query</code>, applying <code>filter</code> if non-null. 
   * 
   * @throws BooleanQuery.TooManyClauses 
   */  
  public TopDocs search(Query query, Filter filter, int n)  
    throws IOException {  
    return search(createWeight(query), filter, n);  
  }  
  
  /** Finds the top <code>n</code> 
   * hits for <code>query</code>. 
   * 
   * @throws BooleanQuery.TooManyClauses 
   */  
  public TopDocs search(Query query, int n)  
    throws IOException {  
    return search(query, null, n);  
  }  
  ...  
  abstract public void search(Weight weight, Filter filter, Collector results) throws IOException;  

实际上的search函数在Searcher类中并没有实现,留在了子类中来实现,而且最终使用的函数都是 
Java代码 
earch(Weight weight, Filter filter, Collector results)  

版本的。其他传入的query参数的搜索函数,都隐含的调用了createWeight(query)方法。 

至于到IndexSearcher类中,搜索函数主要有两个(其他的重载版本,都调用了两个中的一个) 
Java代码 
  @Override  
  public void search(Weight weight, Filter filter, Collector collector)  
      throws IOException {  
      
    if (filter == null) {  
      for (int i = 0; i < subReaders.length; i++) { // search each subreader  
        collector.setNextReader(subReaders[i], docStarts[i]);  
        Scorer scorer = weight.scorer(subReaders[i], !collector.acceptsDocsOutOfOrder(), true);  
        if (scorer != null) {  
          scorer.score(collector);  
        }  
      }  
    } else {  
      for (int i = 0; i < subReaders.length; i++) { // search each subreader  
        collector.setNextReader(subReaders[i], docStarts[i]);  
        searchWithFilter(subReaders[i], weight, filter, collector);  
      }  
    }  
  }  
  
  ...  
  
private void searchWithFilter(IndexReader reader, Weight weight,  
      final Filter filter, final Collector collector) throws IOException {  
  ...  
}  

可以看到,在其中最主要的区别是是否使用了Filter来进行搜索。而对于有返回类型的search函数,也是调用了上面所说的两个中的一个,只是在结尾返回了 
Java代码 
return (TopFieldDocs) collector.topDocs();  

而对于简单的使用,调用前面Searcher抽象类(父类)中申明的函数即可。 

而在其中还使用到了其他的类来进行辅助搜索,有: 
QueryParser
Query
TopScoreDocCollector
TopDocs
ScoreDoc
Document


需要注意的是其中的那个TopScoreDocCollector类,用来存储搜索的结果。这个类的继承关系如下(摘自API文档): 
引用

org.apache.lucene.search 
    Class TopScoreDocCollector 
java.lang.Object 
  org.apache.lucene.search.Collector 
      org.apache.lucene.search.TopDocsCollector<ScoreDoc> 
          org.apache.lucene.search.TopScoreDocCollector 

其中比较常用的函数包括(摘自API文档): 
int	getTotalHits()	          The total number of documents that matched this query.
TopDocs	topDocs()	          Returns the top docs that were collected by this collector.
TopDocs	topDocs(int start)	          Returns the documents in the rage [start ..
TopDocs	topDocs(int start, int howMany)	          Returns the documents in the rage [start ..

而其中的topDocs()的返回类型TopDocs类中,有如下两个属性 
ScoreDoc[]	scoreDocs	          The top hits for the query.
int	totalHits	          The total number of hits for the query.

而其中的ScoreDoc类中有两个属性,如下: 
int	doc	          Expert: A hit document's number.
float	score	          Expert: The score of this document for the query.

这样便可以得到doc(文档号)和score(得分) 

5. Analyzer 
6. Sort  Filter 
7. Lucene中的Ranking算法以及改进 
 
分享到:
评论

相关推荐

    lucene3.0.2 jar包

    博客上的例子用到的LUCENE3.0.2版本的jar包

    lucene-3.0.2.zip

    lucene-3.0.2.zip lucene-3.0.2.zip

    Lucene 3.0.2 API DOC

    Lucene 3.0.2 API DOC CHM 是开发的必备工具之一

    lucene3.0.2jar包

    《深入解析Lucene 3.0.2:Java全文搜索引擎的核心技术》 Lucene是一个开源的、基于Java的全文搜索引擎库,它为开发者提供了构建高效、可扩展的搜索功能所需要的核心工具。在3.0.2这个版本中,Lucene已经经过了多次...

    lucene 3.0.2 core+src+javadoc

    lucene-core-3.0.2-sources.jar 包含了 Lucene 的源代码,这为开发者提供了深入了解 Lucene 内部机制的机会。通过阅读源码,可以学习到以下内容: 1. 数据结构与算法:了解 Lucene 如何使用倒排索引来高效地存储和...

    lucene3.0.2

    lucene3.0.2包含lucene-analyzers-3.0.2.jar,lucene-core-3.0.2.jar,lucene-highlighter-3.0.2.jar,lucene-memory-3.0.2.jar等jar包使用lucene实现分词搜索

    lucene 3.0.2

    lucene library. lucene-demos-XX.jar The compiled simple example code. luceneweb.war The compiled simple example Web Application. contrib/* Contributed code which extends and enhances Lucene, but...

    Lucene-3.0.2 API 下载

    《深入理解Lucene 3.0.2 API》 Lucene是一个开源的全文搜索引擎库,由Apache软件基金会开发并维护。在2010年9月28日,Lucene发布了其3.0.2版本的API,为开发者提供了一个强大而灵活的搜索功能框架。这个API的更新...

    lucene-core-3.0.2.jar,lucene-demos-3.0.2.jar

    这里我们主要聚焦于Lucene 3.0.2版本,通过分析其核心组件和示例演示,来深入探讨这个版本的特性与应用。 首先,我们来看看`lucene-core-3.0.2.jar`。这是Lucene的核心库,包含了所有用于创建、索引和搜索文档的...

    Lucene SpellChecker3.0.2

    Lucene SpellChecker for Lucene 3.0.2

    Lucene.3.0.2版本的相关文件

    包括了lucene-core-3.0.2.jar,IKAnalyzer3.2.0Stable.jar,lucene-analyzers-2.3.0.jar,lucene-highlighter-3.0.2-sources.jar,lucene-memory-3.0.2.jar,最新的停词字典stopword.rar

    Lucene 原理与代码分析完整版

    本文将对Lucene的基本原理及其实现的代码进行分析。 首先,全文检索的基本原理包括索引的创建与搜索过程。在索引创建过程中,文档首先经过分词组件Tokenizer拆分成词元Token,然后经过语言处理组件...

    lucene-3.0.2-dev-src

    《深入剖析Lucene 3.0.2开发源码》 Lucene,作为Apache软件基金会的一个顶级项目,是Java语言实现的全文检索引擎库。它提供了高性能、可扩展的信息检索服务,广泛应用于搜索引擎、信息过滤和数据分析等领域。本文将...

    lucene 原理与代码分析

    《Lucene原理与代码分析》深入探讨了几乎最新版本的Lucene的工作机制和代码实现细节,为理解全文搜索引擎的核心技术提供了宝贵的资源。以下是对该文件关键知识点的详细解析: ### 全文检索的基本原理 #### 总论 ...

    .NET lucene 源代码

    标题中的".NET Lucene 源代码"表明我们将探讨的是如何在.NET环境下利用Lucene进行搜索引擎的开发,并且会涉及到源代码层面的解析。描述中提到的“简单易用”,揭示了Lucene的核心特性之一,即它对开发者友好,易于...

    Lucene原理与代码分析完整版

    资源名称:Lucene 原理与代码分析完整版资源截图: 资源太大,传百度网盘了,链接在附件中,有需要的同学自取。

    Lucene 原理与代码分析.pdf

    该文档《Lucene原理与代码分析》深入探讨了Lucene的工作原理及其实现机制,内容涵盖了从理论到实践的各个层面。 在原理篇中,首先对全文检索的基本原理进行了介绍。全文检索是指对文档集合进行建索引,以便快速检索...

    lucene原理与代码分析完整版

    ### Lucene原理与代码分析概览 #### 一、全文检索基本原理 全文检索是一种能够检索文档中任意词语的信息检索技术。与简单的关键词查询不同,全文检索不仅关注文档标题、元数据,还深入到文档的实际内容中去。这种...

    lucene-3.0.2-src.zip 源码

    lucene-3.0.2-src.zip 源码

Global site tag (gtag.js) - Google Analytics