Lucene 3.0.2 代码分析（转）

liuxinglanyue
浏览: 565393 次
性别:
来自: 杭州
最近访客更多访客>>

hui963966800
lhc98
guoshun0321
kidding87
博主相关

博客
微博
相册
留言
关于我
文章分类

社区版块

存档分类

2011-02 ( 10)
2011-01 ( 22)
2010-12 ( 165)
更多存档...
博客分类：
lucene
lucene Apache 算法 Blog Access
持续更新
Document 和 Field
IndexWriter
IndexReader
Lucenen中的倒排实现
IndexSearcher
Analyzer
Sort  Filter
Lucene中的Ranking算法以及改进



1. Document 和 Field 

Document和Field在索引创建的过程中必不可少。而Document和Field可以理解成传统的关系型数据库中的记录和字段的关系，而字段可以有很多个，那么Document中可以添加很多个Field，方便满足各种不同的查询。如Field可以是文件内容、文件名称、创建时间或者是修改时间等等。而Field中的属性有：是否存储(this.isStored = store.isStored())  是否索引( this.isIndexed = index.isIndexed())、是否分词(this.isTokenized = index.isAnalyzed())，根据不同的需要来进行选择。如文档内容不需要存储，但需要被索引。根据底层的源代码知道有一些限制的，比如不能有这样一个个Field，既不index也不store。 
    
Document中的主要方法就是对Field的增删查操作，3.0.2中的主要API如下： 
Java代码 
void    add(Fieldable field)   
         Adds a field to a document.  
String  get(String name)   
         Returns the string value of the field with the given name if any exist in this document, or null.  
Field   getField(String name)   
         Returns a field with the given name if any exist in this document, or null.  
List<Fieldable>   getFields()   
         Returns a List of all the fields in a document.  
Field[] getFields(String name)   
         Returns an array of Fields with the given name.  
void    removeField(String name)   
         Removes field with the specified name from the document.  
void    removeFields(String name)   
         Removes all fields with the given name from the document.  
String  toString()   
         Prints the fields of a document for human consumption.  
...  


在Field中，主要的两个构造函数如下，帮助理解Field属性(可以自行查看源文件进行阅读) 
Java代码 
/** 
 * Create a field by specifying its name, value and how it will 
 * be saved in the index. 
 *  
 * @param name The name of the field 
 * @param internName Whether to .intern() name or not 
 * @param value The string to process 
 * @param store Whether <code>value</code> should be stored in the index 
 * @param index Whether the field should be indexed, and if so, if it should 
 *  be tokenized before indexing  
 * @param termVector Whether term vector should be stored 
 * @throws NullPointerException if name or value is <code>null</code> 
 * @throws IllegalArgumentException in any of the following situations: 
 * <ul>  
 *  <li>the field is neither stored nor indexed</li>  
 *  <li>the field is not indexed but termVector is <code>TermVector.YES</code></li> 
 * </ul>  
 */   
public Field(String name, boolean internName, String value, Store store, Index index, TermVector termVector) {  
  if (name == null)  
    throw new NullPointerException("name cannot be null");  
  if (value == null)  
    throw new NullPointerException("value cannot be null");  
  if (name.length() == 0 && value.length() == 0)  
    throw new IllegalArgumentException("name and value cannot both be empty");  
  if (index == Index.NO && store == Store.NO)  
    throw new IllegalArgumentException("it doesn't make sense to have a field that "  
       + "is neither indexed nor stored");  
  if (index == Index.NO && termVector != TermVector.NO)  
    throw new IllegalArgumentException("cannot store term vector information "  
       + "for a field that is not indexed");  
          
  if (internName) // field names are optionally interned  
    name = StringHelper.intern(name);  
    
  this.name = name;   
    
  this.fieldsData = value;  
  
  this.isStored = store.isStored();  
   
  this.isIndexed = index.isIndexed();  
  this.isTokenized = index.isAnalyzed();  
  this.omitNorms = index.omitNorms();  
  if (index == Index.NO) {  
    this.omitTermFreqAndPositions = false;  
  }      
  
  this.isBinary = false;  
  
  setStoreTermVector(termVector);  
}  


Java代码 
/** 
  * Create a tokenized and indexed field that is not stored, optionally with  
  * storing term vectors.  The Reader is read only when the Document is added to the index, 
  * i.e. you may not close the Reader until {@link IndexWriter#addDocument(Document)} 
  * has been called. 
  *  
  * @param name The name of the field 
  * @param reader The reader with the content 
  * @param termVector Whether term vector should be stored 
  * @throws NullPointerException if name or reader is <code>null</code> 
  */   
 public Field(String name, Reader reader, TermVector termVector) {  
   if (name == null)  
     throw new NullPointerException("name cannot be null");  
   if (reader == null)  
     throw new NullPointerException("reader cannot be null");  
     
   this.name = StringHelper.intern(name);        // field names are interned  
   this.fieldsData = reader;  
     
   this.isStored = false;  
   this.isIndexed = true;  
   this.isTokenized = true;  
   this.isBinary = false;  
     
   setStoreTermVector(termVector);  
 }  



而其他的构造函数也只是调用这两个个主要的构造函数。如几个比较常用的构造函数; 
Java代码 
public Field(String name, String value, Store store, Index index) {  
  this(name, value, store, index, TermVector.NO);  
}  

Java代码 
public Field(String name, Reader reader) {  
  this(name, reader, TermVector.NO);  
}  

Java代码 
  


不过读读源代码中Field中的三个静态枚举变量Store、Index和TermVector的话，可以更清楚的理解Field中各个属性值是如何设置的（而以前的版本是三个静态常量内部类）。 

2. IndexWriter 
可以参考我之前的一个博客：http://hanyuanbo.iteye.com/blog/812135 
下面这段摘自JavaDoc中IndexWriter的前三段： 
引用
An IndexWriter creates and maintains an index. 

The create argument to the constructor determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with create=true even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also constructors with no create argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index. 

In either case, documents are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updateDocument (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, close should be called. 

(其中有一点说明了如果没有指明是否是创建还是追加index的时候，采取不存在则创建，存在则打开已经存在的index策略) 
引用

Expert: IndexWriter allows an optional IndexDeletionPolicy implementation to be specified. 

Expert: IndexWriter allows you to separately change the MergePolicy and the MergeScheduler. 


之下的五个构造函数中Expert有三个，正常用另外两个就够了。 
IndexWriter(Directory d, Analyzer a, boolean create, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)	          Expert: constructs an IndexWriter with a custom IndexDeletionPolicy, for the index in d.
IndexWriter(Directory d, Analyzer a, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)	          Expert: constructs an IndexWriter with a custom IndexDeletionPolicy, for the index in d, first creating it if it does not already exist.
IndexWriter(Directory d, Analyzer a, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl, IndexCommit commit)	          Expert: constructs an IndexWriter on specific commit point, with a custom IndexDeletionPolicy, for the index in d.
IndexWriter(Directory d, Analyzer a, IndexWriter.MaxFieldLength mfl)	          Constructs an IndexWriter for the index in d, first creating it if it does not already exist.
IndexWriter(Directory d, Analyzer a, boolean create, IndexWriter.MaxFieldLength mfl)	          Constructs an IndexWriter for the index in d.


而实际上在源代码中，都调用了一个私有的init的方法。 
Java代码 
private void init(Directory d, Analyzer a, final boolean create,    
                    IndexDeletionPolicy deletionPolicy, int maxFieldLength,  
                    IndexingChain indexingChain, IndexCommit commit)  
    throws CorruptIndexException, LockObtainFailedException, IOException {  
        ...//在以前的版本中，是调用了一个私有的构造函数。  
}  


在IndexWriter中，用来创建index的方法 
void	addDocument(Document doc)	          Adds a document to this index.
void	addDocument(Document doc, Analyzer analyzer)	          Adds a document to this index, using the provided analyzer instead of the value of getAnalyzer().


3. IndexReader 

帮助来重新处理索引文件。包括更新、删除等操作。构造函数有如下： 
static IndexReader	open(Directory directory)	          Returns a IndexReader reading the index in the given Directory, with readOnly=true.
static IndexReader	open(Directory directory, boolean readOnly)	          Returns an IndexReader reading the index in the given Directory.
static IndexReader	open(Directory directory, IndexDeletionPolicy deletionPolicy, boolean readOnly)	          Expert: returns an IndexReader reading the index in the given Directory, with a custom IndexDeletionPolicy.
static IndexReader	open(Directory directory, IndexDeletionPolicy deletionPolicy, boolean readOnly, int termInfosIndexDivisor)	          Expert: returns an IndexReader reading the index in the given Directory, with a custom IndexDeletionPolicy.
static IndexReader	open(IndexCommit commit, boolean readOnly)	          Expert: returns an IndexReader reading the index in the given IndexCommit.
static IndexReader	open(IndexCommit commit, IndexDeletionPolicy deletionPolicy, boolean readOnly)	          Expert: returns an IndexReader reading the index in the given Directory, using a specific commit and with a     custom IndexDeletionPolicy.
static IndexReader	open(IndexCommit commit, IndexDeletionPolicy deletionPolicy, boolean readOnly, int  termInfosIndexDivisor)	          Expert: returns an IndexReader reading the index in the given Directory, using a specific commit and with a  custom IndexDeletionPolicy.



里面会涉及到Term这个类，Term类的构造函数很简单，如下： 

Term(String fld)	          Constructs a Term with the given field and empty text.
Term(String fld, String txt)	          Constructs a Term with the given field and text.



在IndexReader中常用到的，而且好理解的方法如下： 


Document	document(int n)	          Returns the stored fields of the nth Document in this index.
abstract  int	numDocs()	          Returns the number of documents in this index.
abstract  TermDocs	termDocs()	          Returns an unpositioned TermDocs enumerator.
TermDocs	termDocs(Term term)	          Returns an enumeration of all the documents which contain term.
abstract  TermPositions	termPositions()	          Returns an unpositioned TermPositions enumerator.
TermPositions	termPositions(Term term)	          Returns an enumeration of all the documents which contain term.
abstract  TermEnum	terms()	          Returns an enumeration of all the terms in the index.
abstract  TermEnum	terms(Term t)	          Returns an enumeration of all terms starting at a given term.
void	deleteDocument(int docNum)	          Deletes the document numbered docNum.
int	deleteDocuments(Term term)	          Deletes all documents that have a given term indexed.



如下代码帮助理解如何操作IndexReader对其中的Term进行访问，并进行删除操作(但进行删除的时候，切记要记得将reader关掉) 

Java代码 
package com.eric.lucene;  
  
import java.io.File;  
import java.io.IOException;  
  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.index.CorruptIndexException;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.index.TermDocs;  
import org.apache.lucene.index.TermPositions;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.store.LockObtainFailedException;  
import org.apache.lucene.util.Version;  
  
public class IndexReaderTest {  
    private File path ;  
      
      
    public IndexReaderTest(String path) {  
        this.path = new File(path);  
    }  
  
    public void createIndex(){  
        try {  
            IndexWriter writer = new IndexWriter(FSDirectory.open(this.path),new StandardAnalyzer(  
                    Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);  
            Document doc1 = new Document();  
            Document doc2 = new Document();  
            Document doc3 = new Document();  
            doc1.add(new Field("bookname", "thinking in java -- java 4", Field.Store.YES, Field.Index.ANALYZED));  
            doc2.add(new Field("bookname", "java core 2", Field.Store.YES, Field.Index.ANALYZED));  
            doc3.add(new Field("bookname", "thinking in c++", Field.Store.YES, Field.Index.ANALYZED));  
            writer.addDocument(doc1);  
            writer.addDocument(doc2);  
            writer.addDocument(doc3);  
            writer.close();  
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (LockObtainFailedException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
      
    public void test1(){  
        try {  
            IndexReader reader = IndexReader.open(FSDirectory.open(this.path));  
            System.out.println("version:\t" + reader.getVersion());  
            int num = reader.numDocs();  
            for(int i=0;i<num;i++){  
                Document doc = reader.document(i);  
                System.out.println(doc);  
            }  
              
            Term term = new Term("bookname","java");  
            TermDocs docs = reader.termDocs(term);  
            while(docs.next()){  
                System.out.print("doc num:\t" + docs.doc() + "\t\t");  
                System.out.println("frequency:\t" + docs.freq());  
            }  
              
            reader.close();  
              
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
//  version:    1289906350314  
//  Document<stored,indexed,tokenized<bookname:thinking in java -- java 4>>  
//  Document<stored,indexed,tokenized<bookname:java core 2>>  
//  Document<stored,indexed,tokenized<bookname:thinking in c++>>  
//  doc num:    0       frequency:  2  
//  doc num:    1       frequency:  1  
      
    public void test2(){  
        try {  
            IndexReader reader = IndexReader.open(FSDirectory.open(this.path));  
            System.out.println("version:\t" + reader.getVersion());  
              
            Term term = new Term("bookname","java");  
            TermPositions pos = reader.termPositions(term);  
            while(pos.next()){  
                System.out.print("frequency: " + pos.freq() + "\t");  
                for(int i=0;i<pos.freq();i++){  
                    System.out.print("pos: " + pos.nextPosition() + "\t");  
                }  
                System.out.println();  
            }  
            reader.close();  
              
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
//  version:    1289906350314  
//  frequency: 2    pos: 2  pos: 3    
//  frequency: 1    pos: 0  
//  第二次的时候没有调用createIndex() 所以版本号还是相同的  
      
    public void delete1(){  
        try {  
            IndexReader reader = IndexReader.open(FSDirectory.open(this.path), false);//必须指定readonly 为 false  
            System.out.println("version:\t" + reader.getVersion());  
            System.out.println("num:\t" + reader.numDocs());  
            reader.deleteDocument(2);//删除c++的那个Document  
            reader.close();  
              
              
            reader = IndexReader.open(FSDirectory.open(this.path), false);  
            System.out.println("version:\t" + reader.getVersion());  
            System.out.println("num:\t" + reader.numDocs());  
            reader.close();  
              
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
//  version:    1289906350314  
//  num:    3  
//  version:    1289906350315  
//  num:    2  
  
    public void delete2(){  
        try {  
            IndexReader reader = IndexReader.open(FSDirectory.open(this.path), false);//必须指定readonly 为 false  
            System.out.println("version:\t" + reader.getVersion());  
            System.out.println("num:\t" + reader.numDocs());  
            Term term = new Term("bookname","java");  
            reader.deleteDocuments(term);//删除c++的那个Document  
            reader.close();  
              
              
            reader = IndexReader.open(FSDirectory.open(this.path), false);  
            System.out.println("version:\t" + reader.getVersion());  
            System.out.println("num:\t" + reader.numDocs());  
            reader.close();  
              
        } catch (CorruptIndexException e) {  
            e.printStackTrace();  
        } catch (IOException e) {  
            e.printStackTrace();  
        }  
    }  
//  version:    1289906350315  
//  num:    2  
//  version:    1289906350316  
//  num:    0  
  
      
    public static void main(String[] args) {  
        String path = "E:\\indexReaderTest";  
        IndexReaderTest test = new IndexReaderTest(path);  
//      test.createIndex();  
//      test.test1();  
//      test.test2();  
//      test.delete1();  
        test.delete2();  
    }  
}  

注释： 
先调用 
Java代码 
String path = "E:\\indexReaderTest";  
IndexReaderTest test = new IndexReaderTest(path);  
test.createIndex();  
test.test1();  

然后再调用： 
Java代码 
String path = "E:\\indexReaderTest";  
IndexReaderTest test = new IndexReaderTest(path);  
test.test2();  

然后再调用： 
Java代码 
String path = "E:\\indexReaderTest";  
IndexReaderTest test = new IndexReaderTest(path);  
test.delete1();  

然后再调用： 
Java代码 
String path = "E:\\indexReaderTest";  
IndexReaderTest test = new IndexReaderTest(path);  
test.delete2();  


4. Lucenen中的倒排实现 
以下的这个博客，简单的说明了倒排索引的原理。 
http://jackyrong.iteye.com/blog/238940 
通过阅读源代码可以找到在IndexWriter中有个静态的常量static final IndexingChain DefaultIndexingChain，如下： 
Java代码 
static final IndexingChain DefaultIndexingChain = new IndexingChain() {  
  
  @Override  
  DocConsumer getChain(DocumentsWriter documentsWriter) {  
    /* 
    This is the current indexing chain: 
 
    DocConsumer / DocConsumerPerThread 
      --> code: DocFieldProcessor / DocFieldProcessorPerThread 
        --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField 
          --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField 
            --> code: DocInverter / DocInverterPerThread / DocInverterPerField 
              --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField 
                --> code: TermsHash / TermsHashPerThread / TermsHashPerField 
                  --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField 
                    --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField 
                    --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField 
              --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField 
                --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField 
            --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField 
  */  
  
  // Build up indexing chain:  
  
    final TermsHashConsumer termVectorsWriter = new TermVectorsTermsWriter(documentsWriter);  
    final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();  
  
    final InvertedDocConsumer  termsHash = new TermsHash(documentsWriter, true, freqProxWriter,  
                                                         new TermsHash(documentsWriter, false, termVectorsWriter, null));  
    final NormsWriter normsWriter = new NormsWriter();  
    final DocInverter docInverter = new DocInverter(termsHash, normsWriter);  
    return new DocFieldProcessor(documentsWriter, docInverter);  
  }  
};  

这里的注释清晰的给出了整个处理的链是怎样进行的。在Doc文档中是没有这些invertXXX类的说明，必须到源文件中进行阅读。 

4. IndexSearcher 
Searcher中的接口实现与类继承关系如下(摘自API文档。简单的使用方法参见我之前的一个博客http://hanyuanbo.iteye.com/blog/812135) 
引用
org.apache.lucene.search 
Class Searcher 
java.lang.Object 
        org.apache.lucene.search.Searcher 
All Implemented Interfaces: 
        Closeable, Searchable 
Direct Known Subclasses: 
        IndexSearcher, MultiSearcher


其中用到的search函数有很多重载版本，以下摘自API文档。 
void	search(Query query, Collector results)	          Lower-level search API.
void	search(Query query, Filter filter, Collector results)	          Lower-level search API.
TopDocs	search(Query query, Filter filter, int n)	          Finds the top n hits for query, applying filter if non-null.
TopFieldDocs	search(Query query, Filter filter, int n, Sort sort)	          Search implementation with arbitrary sorting.
TopDocs	search(Query query, int n)	          Finds the top n hits for query.
abstract  void	search(Weight weight, Filter filter, Collector results)	          Lower-level search API.
abstract  TopDocs	search(Weight weight, Filter filter, int n)	          Expert: Low-level search implementation.
abstract  TopFieldDocs	search(Weight weight, Filter filter, int n, Sort sort)	          Expert: Low-level search implementation with arbitrary sorting.

还有一个非常有用的函数(在Searcher中为抽象方法，具体实现在子类中) 
abstract  Document	doc(int i)	          Returns the stored fields of document i.


在源代码中的Searcher抽象类中的search函数的重载版本如下： 
Java代码 
/** Search implementation with arbitrary sorting.  Finds 
   * the top <code>n</code> hits for <code>query</code>, applying 
   * <code>filter</code> if non-null, and sorting the hits by the criteria in 
   * <code>sort</code>. 
   *  
   * <p>NOTE: this does not compute scores by default; use 
   * {@link IndexSearcher#setDefaultFieldSortScoring} to 
   * enable scoring. 
   * 
   * @throws BooleanQuery.TooManyClauses 
   */  
  public TopFieldDocs search(Query query, Filter filter, int n,  
                             Sort sort) throws IOException {  
    return search(createWeight(query), filter, n, sort);  
  }  
  
  /** Lower-level search API. 
  * 
  * <p>{@link Collector#collect(int)} is called for every matching document. 
  * 
  * <p>Applications should only use this if they need <i>all</i> of the 
  * matching documents.  The high-level search API ({@link 
  * Searcher#search(Query, int)}) is usually more efficient, as it skips 
  * non-high-scoring hits. 
  * <p>Note: The <code>score</code> passed to this method is a raw score. 
  * In other words, the score will not necessarily be a float whose value is 
  * between 0 and 1. 
  * @throws BooleanQuery.TooManyClauses 
  */  
 public void search(Query query, Collector results)  
   throws IOException {  
   search(createWeight(query), null, results);  
 }  
  
  /** Lower-level search API. 
   * 
   * <p>{@link Collector#collect(int)} is called for every matching 
   * document. 
   * <br>Collector-based access to remote indexes is discouraged. 
   * 
   * <p>Applications should only use this if they need <i>all</i> of the 
   * matching documents.  The high-level search API ({@link 
   * Searcher#search(Query, Filter, int)}) is usually more efficient, as it skips 
   * non-high-scoring hits. 
   * 
   * @param query to match documents 
   * @param filter if non-null, used to permit documents to be collected. 
   * @param results to receive hits 
   * @throws BooleanQuery.TooManyClauses 
   */  
  public void search(Query query, Filter filter, Collector results)  
  throws IOException {  
    search(createWeight(query), filter, results);  
  }  
  
  /** Finds the top <code>n</code> 
   * hits for <code>query</code>, applying <code>filter</code> if non-null. 
   * 
   * @throws BooleanQuery.TooManyClauses 
   */  
  public TopDocs search(Query query, Filter filter, int n)  
    throws IOException {  
    return search(createWeight(query), filter, n);  
  }  
  
  /** Finds the top <code>n</code> 
   * hits for <code>query</code>. 
   * 
   * @throws BooleanQuery.TooManyClauses 
   */  
  public TopDocs search(Query query, int n)  
    throws IOException {  
    return search(query, null, n);  
  }  
  ...  
  abstract public void search(Weight weight, Filter filter, Collector results) throws IOException;  

实际上的search函数在Searcher类中并没有实现，留在了子类中来实现，而且最终使用的函数都是 
Java代码 
earch(Weight weight, Filter filter, Collector results)  

版本的。其他传入的query参数的搜索函数，都隐含的调用了createWeight(query)方法。 

至于到IndexSearcher类中，搜索函数主要有两个(其他的重载版本，都调用了两个中的一个) 
Java代码 
  @Override  
  public void search(Weight weight, Filter filter, Collector collector)  
      throws IOException {  
      
    if (filter == null) {  
      for (int i = 0; i < subReaders.length; i++) { // search each subreader  
        collector.setNextReader(subReaders[i], docStarts[i]);  
        Scorer scorer = weight.scorer(subReaders[i], !collector.acceptsDocsOutOfOrder(), true);  
        if (scorer != null) {  
          scorer.score(collector);  
        }  
      }  
    } else {  
      for (int i = 0; i < subReaders.length; i++) { // search each subreader  
        collector.setNextReader(subReaders[i], docStarts[i]);  
        searchWithFilter(subReaders[i], weight, filter, collector);  
      }  
    }  
  }  
  
  ...  
  
private void searchWithFilter(IndexReader reader, Weight weight,  
      final Filter filter, final Collector collector) throws IOException {  
  ...  
}  

可以看到，在其中最主要的区别是是否使用了Filter来进行搜索。而对于有返回类型的search函数，也是调用了上面所说的两个中的一个，只是在结尾返回了 
Java代码 
return (TopFieldDocs) collector.topDocs();  

而对于简单的使用，调用前面Searcher抽象类(父类)中申明的函数即可。 

而在其中还使用到了其他的类来进行辅助搜索，有： 
QueryParser
Query
TopScoreDocCollector
TopDocs
ScoreDoc
Document


需要注意的是其中的那个TopScoreDocCollector类，用来存储搜索的结果。这个类的继承关系如下(摘自API文档)： 
引用

org.apache.lucene.search 
    Class TopScoreDocCollector 
java.lang.Object 
  org.apache.lucene.search.Collector 
      org.apache.lucene.search.TopDocsCollector<ScoreDoc> 
          org.apache.lucene.search.TopScoreDocCollector 

其中比较常用的函数包括(摘自API文档)： 
int	getTotalHits()	          The total number of documents that matched this query.
TopDocs	topDocs()	          Returns the top docs that were collected by this collector.
TopDocs	topDocs(int start)	          Returns the documents in the rage [start ..
TopDocs	topDocs(int start, int howMany)	          Returns the documents in the rage [start ..

而其中的topDocs()的返回类型TopDocs类中，有如下两个属性 
ScoreDoc[]	scoreDocs	          The top hits for the query.
int	totalHits	          The total number of hits for the query.

而其中的ScoreDoc类中有两个属性，如下： 
int	doc	          Expert: A hit document's number.
float	score	          Expert: The score of this document for the query.

这样便可以得到doc(文档号)和score(得分) 

5. Analyzer 
6. Sort  Filter 
7. Lucene中的Ranking算法以及改进
分享到：
HTMLParser 解析html字符串，提取纯文本 | lucene3.0 分页显示与高亮显示（转）
2010-11-16 22:26
浏览 1478
评论(0)
分类:互联网
查看更多
发表评论

您还没有登录,请您登录后再发表评论
最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene 3.0.2 代码分析（转）

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene 3.0.2 代码 分析（转）

评论

发表评论

相关推荐

关于Lucene的讨论

有关Lucene的问题（收藏）推荐

Lucene 学习总结（收藏）推荐

基于Lucene的Compass 资源（收藏）

Lucene 3.0.2索引文件官方文档（二）

Lucene 3.0.2索引文件官方文档（一）

Lucene 3.0 索引文件学习总结（收藏）

Lucene 字符编码问题

Lucene 字符编码问题

Annotated Lucene(源码剖析中文版)

Lucene 学习推荐博客

Lucene3.0 初窥 总结（收藏）

转：基于lucene实现自己的推荐引擎

加速 lucene 的搜索速度 ImproveSearchingSpeed（二）

加速 lucene 索引建立速度 ImproveIndexingSpeed

lucene 3.0 中的demo项目部署

Lucene 3.0.2 源码 - final class Document

Lucene 3.0.2 源码 - final class Field

Lucene 3.0.2 源码 - abstract class AbstractField

Lucene 3.0.2 源码 - interface Fieldable

最近访客更多访客>>

Lucene 3.0.2 代码分析（转）

Lucene3.0 初窥总结（收藏）