
Lucene 3.0.2 代码 分析

1 Document 和 Field
2 IndexWriter
3 IndexReader
4 Lucene中的倒排实现
5 IndexSearcher
6 Analyzer
7 Directory
8 Query、Sort和Filter
9 Lucene中的Ranking算法以及改进

1. Document 和 Field

Document和Field在索引创建的过程中必不可少。而Document和Field可以理解成传统的关系型数据库中的记录和字段的关系,而字段可以有很多个,那么Document中可以添加很多个Field,方便满足各种不同的查询。如Field可以是文件内容、文件名称、创建时间或者是修改时间等等。而Field中的属性有:是否存储(this.isStored = store.isStored())  是否索引( this.isIndexed = index.isIndexed())、是否分词(this.isTokenized = index.isAnalyzed()),根据不同的需要来进行选择。如文档内容不需要存储,但需要被索引。根据底层的源代码知道有一些限制的,比如不能有这样一个个Field,既不index也不store。
 void	add(Fieldable field) 
          Adds a field to a document.
 String	get(String name) 
          Returns the string value of the field with the given name if any exist in this document, or null.
 Field	getField(String name) 
          Returns a field with the given name if any exist in this document, or null.
 List<Fieldable>	getFields() 
          Returns a List of all the fields in a document.
 Field[]	getFields(String name) 
          Returns an array of Fields with the given name.
 void	removeField(String name) 
          Removes field with the specified name from the document.
 void	removeFields(String name) 
          Removes all fields with the given name from the document.
 String	toString() 
          Prints the fields of a document for human consumption.


   * Create a field by specifying its name, value and how it will
   * be saved in the index.
   * @param name The name of the field
   * @param internName Whether to .intern() name or not
   * @param value The string to process
   * @param store Whether <code>value</code> should be stored in the index
   * @param index Whether the field should be indexed, and if so, if it should
   *  be tokenized before indexing 
   * @param termVector Whether term vector should be stored
   * @throws NullPointerException if name or value is <code>null</code>
   * @throws IllegalArgumentException in any of the following situations:
   * <ul> 
   *  <li>the field is neither stored nor indexed</li> 
   *  <li>the field is not indexed but termVector is <code>TermVector.YES</code></li>
   * </ul> 
  public Field(String name, boolean internName, String value, Store store, Index index, TermVector termVector) {
    if (name == null)
      throw new NullPointerException("name cannot be null");
    if (value == null)
      throw new NullPointerException("value cannot be null");
    if (name.length() == 0 && value.length() == 0)
      throw new IllegalArgumentException("name and value cannot both be empty");
    if (index == Index.NO && store == Store.NO)
      throw new IllegalArgumentException("it doesn't make sense to have a field that "
         + "is neither indexed nor stored");
    if (index == Index.NO && termVector != TermVector.NO)
      throw new IllegalArgumentException("cannot store term vector information "
         + "for a field that is not indexed");
    if (internName) // field names are optionally interned
      name = StringHelper.intern(name);
    this.name = name; 
    this.fieldsData = value;

    this.isStored = store.isStored();
    this.isIndexed = index.isIndexed();
    this.isTokenized = index.isAnalyzed();
    this.omitNorms = index.omitNorms();
    if (index == Index.NO) {
      this.omitTermFreqAndPositions = false;

    this.isBinary = false;


   * Create a tokenized and indexed field that is not stored, optionally with 
   * storing term vectors.  The Reader is read only when the Document is added to the index,
   * i.e. you may not close the Reader until {@link IndexWriter#addDocument(Document)}
   * has been called.
   * @param name The name of the field
   * @param reader The reader with the content
   * @param termVector Whether term vector should be stored
   * @throws NullPointerException if name or reader is <code>null</code>
  public Field(String name, Reader reader, TermVector termVector) {
    if (name == null)
      throw new NullPointerException("name cannot be null");
    if (reader == null)
      throw new NullPointerException("reader cannot be null");
    this.name = StringHelper.intern(name);        // field names are interned
    this.fieldsData = reader;
    this.isStored = false;
    this.isIndexed = true;
    this.isTokenized = true;
    this.isBinary = false;

  public Field(String name, String value, Store store, Index index) {
    this(name, value, store, index, TermVector.NO);

  public Field(String name, Reader reader) {
    this(name, reader, TermVector.NO);


2. IndexWriter
An IndexWriter creates and maintains an index.

The create argument to the constructor determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with create=true even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also constructors with no create argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index.

In either case, documents are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updateDocument (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, close should be called.


Expert: IndexWriter allows an optional IndexDeletionPolicy implementation to be specified.

Expert: IndexWriter allows you to separately change the MergePolicy and the MergeScheduler.

IndexWriter(Directory d, Analyzer a, boolean create, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)           Expert: constructs an IndexWriter with a custom IndexDeletionPolicy, for the index in d.
IndexWriter(Directory d, Analyzer a, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)           Expert: constructs an IndexWriter with a custom IndexDeletionPolicy, for the index in d, first creating it if it does not already exist.
IndexWriter(Directory d, Analyzer a, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl, IndexCommit commit)           Expert: constructs an IndexWriter on specific commit point, with a custom IndexDeletionPolicy, for the index in d.
IndexWriter(Directory d, Analyzer a, IndexWriter.MaxFieldLength mfl)           Constructs an IndexWriter for the index in d, first creating it if it does not already exist.
IndexWriter(Directory d, Analyzer a, boolean create, IndexWriter.MaxFieldLength mfl)           Constructs an IndexWriter for the index in d.


private void init(Directory d, Analyzer a, final boolean create,  
                    IndexDeletionPolicy deletionPolicy, int maxFieldLength,
                    IndexingChain indexingChain, IndexCommit commit)
    throws CorruptIndexException, LockObtainFailedException, IOException {

void addDocument(Document doc)           Adds a document to this index.
void addDocument(Document doc, Analyzer analyzer)           Adds a document to this index, using the provided analyzer instead of the value of getAnalyzer().

3. IndexReader

static IndexReader open(Directory directory)           Returns a IndexReader reading the index in the given Directory, with readOnly=true.
static IndexReader open(Directory directory, boolean readOnly)           Returns an IndexReader reading the index in the given Directory.
static IndexReader open(Directory directory, IndexDeletionPolicy deletionPolicy, boolean readOnly)           Expert: returns an IndexReader reading the index in the given Directory, with a custom IndexDeletionPolicy.
static IndexReader open(Directory directory, IndexDeletionPolicy deletionPolicy, boolean readOnly, int termInfosIndexDivisor)           Expert: returns an IndexReader reading the index in the given Directory, with a custom IndexDeletionPolicy.
static IndexReader open(IndexCommit commit, boolean readOnly)           Expert: returns an IndexReader reading the index in the given IndexCommit.
static IndexReader open(IndexCommit commit, IndexDeletionPolicy deletionPolicy, boolean readOnly)           Expert: returns an IndexReader reading the index in the given Directory, using a specific commit and with a     custom IndexDeletionPolicy.
static IndexReader open(IndexCommit commit, IndexDeletionPolicy deletionPolicy, boolean readOnly, int  termInfosIndexDivisor)           Expert: returns an IndexReader reading the index in the given Directory, using a specific commit and with a  custom IndexDeletionPolicy.


Term(String fld)           Constructs a Term with the given field and empty text.
Term(String fld, String txt)           Constructs a Term with the given field and text.


Document document(int n)           Returns the stored fields of the nth Document in this index.
abstract  int numDocs()           Returns the number of documents in this index.
abstract  TermDocs termDocs()           Returns an unpositioned TermDocs enumerator.
TermDocs termDocs(Term term)           Returns an enumeration of all the documents which contain term.
abstract  TermPositions termPositions()           Returns an unpositioned TermPositions enumerator.
TermPositions termPositions(Term term)           Returns an enumeration of all the documents which contain term.
abstract  TermEnum terms()           Returns an enumeration of all the terms in the index.
abstract  TermEnum terms(Term t)           Returns an enumeration of all terms starting at a given term.
void deleteDocument(int docNum)           Deletes the document numbered docNum.
int deleteDocuments(Term term)           Deletes all documents that have a given term indexed.


package com.eric.lucene;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;

public class IndexReaderTest {
	private File path ;
	public IndexReaderTest(String path) {
		this.path = new File(path);

	public void createIndex(){
		try {
			IndexWriter writer = new IndexWriter(FSDirectory.open(this.path),new StandardAnalyzer(
					Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);
			Document doc1 = new Document();
			Document doc2 = new Document();
			Document doc3 = new Document();
			doc1.add(new Field("bookname", "thinking in java -- java 4", Field.Store.YES, Field.Index.ANALYZED));
			doc2.add(new Field("bookname", "java core 2", Field.Store.YES, Field.Index.ANALYZED));
			doc3.add(new Field("bookname", "thinking in c++", Field.Store.YES, Field.Index.ANALYZED));
		} catch (CorruptIndexException e) {
		} catch (LockObtainFailedException e) {
		} catch (IOException e) {
	public void test1(){
		try {
			IndexReader reader = IndexReader.open(FSDirectory.open(this.path));
			System.out.println("version:\t" + reader.getVersion());
			int num = reader.numDocs();
			for(int i=0;i<num;i++){
				Document doc = reader.document(i);
			Term term = new Term("bookname","java");
			TermDocs docs = reader.termDocs(term);
				System.out.print("doc num:\t" + docs.doc() + "\t\t");
				System.out.println("frequency:\t" + docs.freq());
		} catch (CorruptIndexException e) {
		} catch (IOException e) {
//	version:	1289906350314
//	Document<stored,indexed,tokenized<bookname:thinking in java -- java 4>>
//	Document<stored,indexed,tokenized<bookname:java core 2>>
//	Document<stored,indexed,tokenized<bookname:thinking in c++>>
//	doc num:	0		frequency:	2
//	doc num:	1		frequency:	1
	public void test2(){
		try {
			IndexReader reader = IndexReader.open(FSDirectory.open(this.path));
			System.out.println("version:\t" + reader.getVersion());
			Term term = new Term("bookname","java");
			TermPositions pos = reader.termPositions(term);
				System.out.print("frequency: " + pos.freq() + "\t");
				for(int i=0;i<pos.freq();i++){
					System.out.print("pos: " + pos.nextPosition() + "\t");
		} catch (CorruptIndexException e) {
		} catch (IOException e) {
//	version:	1289906350314
//	frequency: 2	pos: 2	pos: 3	
//	frequency: 1	pos: 0
//	第二次的时候没有调用createIndex() 所以版本号还是相同的
	public void delete1(){
		try {
			IndexReader reader = IndexReader.open(FSDirectory.open(this.path), false);//必须指定readonly 为 false
			System.out.println("version:\t" + reader.getVersion());
			System.out.println("num:\t" + reader.numDocs());
			reader = IndexReader.open(FSDirectory.open(this.path), false);
			System.out.println("version:\t" + reader.getVersion());
			System.out.println("num:\t" + reader.numDocs());
		} catch (CorruptIndexException e) {
		} catch (IOException e) {
//	version:	1289906350314
//	num:	3
//	version:	1289906350315
//	num:	2

	public void delete2(){
		try {
			IndexReader reader = IndexReader.open(FSDirectory.open(this.path), false);//必须指定readonly 为 false
			System.out.println("version:\t" + reader.getVersion());
			System.out.println("num:\t" + reader.numDocs());
			Term term = new Term("bookname","java");
			reader = IndexReader.open(FSDirectory.open(this.path), false);
			System.out.println("version:\t" + reader.getVersion());
			System.out.println("num:\t" + reader.numDocs());
		} catch (CorruptIndexException e) {
		} catch (IOException e) {
//	version:	1289906350315
//	num:	2
//	version:	1289906350316
//	num:	0

	public static void main(String[] args) {
		String path = "E:\\indexReaderTest";
		IndexReaderTest test = new IndexReaderTest(path);
//		test.createIndex();
//		test.test1();
//		test.test2();
//		test.delete1();

4. Lucene中的倒排实现
附件中的《Lucene 3.0 原理与代码分析完整版.pdf》的前面有介绍信息检索的基本原理,大概也就几页,很容易理解,Lucene只是对这个原理进行了自己的实现,对于理解Lucene的倒排索引的建立有很大帮助。
通过阅读源代码可以找到在IndexWriter中有个静态的常量static final IndexingChain DefaultIndexingChain,如下:
  static final IndexingChain DefaultIndexingChain = new IndexingChain() {

    DocConsumer getChain(DocumentsWriter documentsWriter) {
      This is the current indexing chain:

      DocConsumer / DocConsumerPerThread
        --> code: DocFieldProcessor / DocFieldProcessorPerThread
          --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField
            --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField
              --> code: DocInverter / DocInverterPerThread / DocInverterPerField
                --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
                  --> code: TermsHash / TermsHashPerThread / TermsHashPerField
                    --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField
                      --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField
                      --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField
                --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
                  --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField
              --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField

    // Build up indexing chain:

      final TermsHashConsumer termVectorsWriter = new TermVectorsTermsWriter(documentsWriter);
      final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();

      final InvertedDocConsumer  termsHash = new TermsHash(documentsWriter, true, freqProxWriter,
                                                           new TermsHash(documentsWriter, false, termVectorsWriter, null));
      final NormsWriter normsWriter = new NormsWriter();
      final DocInverter docInverter = new DocInverter(termsHash, normsWriter);
      return new DocFieldProcessor(documentsWriter, docInverter);


5. IndexSearcher
Class Searcher
All Implemented Interfaces:
        Closeable, Searchable
Direct Known Subclasses:
        IndexSearcher, MultiSearcher

void search(Query query, Collector results)           Lower-level search API.
void search(Query query, Filter filter, Collector results)           Lower-level search API.
TopDocs search(Query query, Filter filter, int n)           Finds the top n hits for query, applying filter if non-null.
TopFieldDocs search(Query query, Filter filter, int n, Sort sort)           Search implementation with arbitrary sorting.
TopDocs search(Query query, int n)           Finds the top n hits for query.
abstract  void search(Weight weight, Filter filter, Collector results)           Lower-level search API.
abstract  TopDocs search(Weight weight, Filter filter, int n)           Expert: Low-level search implementation.
abstract  TopFieldDocs search(Weight weight, Filter filter, int n, Sort sort)           Expert: Low-level search implementation with arbitrary sorting.

abstract  Document doc(int i)           Returns the stored fields of document i.
Explanation explain(Weight weight, int doc)           Expert: low-level implementation method Returns an Explanation that describes how doc scored against weight.

/** Search implementation with arbitrary sorting.  Finds
   * the top <code>n</code> hits for <code>query</code>, applying
   * <code>filter</code> if non-null, and sorting the hits by the criteria in
   * <code>sort</code>.
   * <p>NOTE: this does not compute scores by default; use
   * {@link IndexSearcher#setDefaultFieldSortScoring} to
   * enable scoring.
   * @throws BooleanQuery.TooManyClauses
  public TopFieldDocs search(Query query, Filter filter, int n,
                             Sort sort) throws IOException {
    return search(createWeight(query), filter, n, sort);

  /** Lower-level search API.
  * <p>{@link Collector#collect(int)} is called for every matching document.
  * <p>Applications should only use this if they need <i>all</i> of the
  * matching documents.  The high-level search API ({@link
  * Searcher#search(Query, int)}) is usually more efficient, as it skips
  * non-high-scoring hits.
  * <p>Note: The <code>score</code> passed to this method is a raw score.
  * In other words, the score will not necessarily be a float whose value is
  * between 0 and 1.
  * @throws BooleanQuery.TooManyClauses
 public void search(Query query, Collector results)
   throws IOException {
   search(createWeight(query), null, results);

  /** Lower-level search API.
   * <p>{@link Collector#collect(int)} is called for every matching
   * document.
   * <br>Collector-based access to remote indexes is discouraged.
   * <p>Applications should only use this if they need <i>all</i> of the
   * matching documents.  The high-level search API ({@link
   * Searcher#search(Query, Filter, int)}) is usually more efficient, as it skips
   * non-high-scoring hits.
   * @param query to match documents
   * @param filter if non-null, used to permit documents to be collected.
   * @param results to receive hits
   * @throws BooleanQuery.TooManyClauses
  public void search(Query query, Filter filter, Collector results)
  throws IOException {
    search(createWeight(query), filter, results);

  /** Finds the top <code>n</code>
   * hits for <code>query</code>, applying <code>filter</code> if non-null.
   * @throws BooleanQuery.TooManyClauses
  public TopDocs search(Query query, Filter filter, int n)
    throws IOException {
    return search(createWeight(query), filter, n);

  /** Finds the top <code>n</code>
   * hits for <code>query</code>.
   * @throws BooleanQuery.TooManyClauses
  public TopDocs search(Query query, int n)
    throws IOException {
    return search(query, null, n);
  abstract public void search(Weight weight, Filter filter, Collector results) throws IOException;

search(Weight weight, Filter filter, Collector results)


  public void search(Weight weight, Filter filter, Collector collector)
      throws IOException {
    if (filter == null) {
      for (int i = 0; i < subReaders.length; i++) { // search each subreader
        collector.setNextReader(subReaders[i], docStarts[i]);
        Scorer scorer = weight.scorer(subReaders[i], !collector.acceptsDocsOutOfOrder(), true);
        if (scorer != null) {
    } else {
      for (int i = 0; i < subReaders.length; i++) { // search each subreader
        collector.setNextReader(subReaders[i], docStarts[i]);
        searchWithFilter(subReaders[i], weight, filter, collector);


private void searchWithFilter(IndexReader reader, Weight weight,
      final Filter filter, final Collector collector) throws IOException {

 return (TopFieldDocs) collector.topDocs();




    Class TopScoreDocCollector

int getTotalHits()           The total number of documents that matched this query.
TopDocs topDocs()           Returns the top docs that were collected by this collector.
TopDocs topDocs(int start)           Returns the documents in the rage [start ..
TopDocs topDocs(int start, int howMany)           Returns the documents in the rage [start ..

ScoreDoc[] scoreDocs           The top hits for the query.
int totalHits           The total number of hits for the query.

int doc           Expert: A hit document's number.
float score           Expert: The score of this document for the query.


6. Analyzer
在Lucene 3.0.2中的Analyzer实现中,集成结构如下(摘自API文档):
    Class Analyzer

java.lang.Object    org.apache.lucene.analysis.Analyzer
All Implemented Interfaces:    Closeable
Direct Known Subclasses:ArabicAnalyzer, BrazilianAnalyzer, ChineseAnalyzer, CJKAnalyzer, CollationKeyAnalyzer, CzechAnalyzer, DutchAnalyzer, FrenchAnalyzer, GermanAnalyzer, GreekAnalyzer, ICUCollationKeyAnalyzer, KeywordAnalyzer, PatternAnalyzer, PerFieldAnalyzerWrapper, PersianAnalyzer, QueryAutoStopWordAnalyzer, RussianAnalyzer, ShingleAnalyzerWrapper, SimpleAnalyzer, SmartChineseAnalyzer, SnowballAnalyzer, StandardAnalyzer, StopAnalyzer, ThaiAnalyzer, WhitespaceAnalyzer


  private Set<?> stopSet;
  private final Version matchVersion;

   * Specifies whether deprecated acronyms should be replaced with HOST type.
   * See {@linkplain https://issues.apache.org/jira/browse/LUCENE-1068}
  private final boolean replaceInvalidAcronym,enableStopPositionIncrements;

  /** An unmodifiable set containing some common English words that are usually not
  useful for searching. */
  public static final Set<?> STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET; 

我在设置属性的时候,基本上使用了Version.LUCENE_30,所以这两个属性不考虑,也没做进一步研究。其中的STOP_WORDS_SET是为了来虑词操作的,而StopAnalyzer.ENGLISH_STOP_WORDS_SET; 的内容如下:
  public static final Set<?> ENGLISH_STOP_WORDS_SET;
  static {
    final List<String> stopWords = Arrays.asList(
      "a", "an", "and", "are", "as", "at", "be", "but", "by",
      "for", "if", "in", "into", "is", "it",
      "no", "not", "of", "on", "or", "such",
      "that", "the", "their", "then", "there", "these",
      "they", "this", "to", "was", "will", "with"
    final CharArraySet stopSet = new CharArraySet(stopWords.size(), false);
    ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet); 


StandardAnalyzer(Version matchVersion)           Builds an analyzer with the default stop words (STOP_WORDS_SET).
StandardAnalyzer(Version matchVersion, File stopwords)           Builds an analyzer with the stop words from the given file.
StandardAnalyzer(Version matchVersion, Reader stopwords)           Builds an analyzer with the stop words from the given reader.
StandardAnalyzer(Version matchVersion, Set<?> stopWords)           Builds an analyzer with the given stop words.

  /** Builds an analyzer with the given stop words.
   * @param matchVersion Lucene version to match See {@link
   * <a href="#version">above</a>}
   * @param stopWords stop words */
  public StandardAnalyzer(Version matchVersion, Set<?> stopWords) {
    stopSet = stopWords;
    enableStopPositionIncrements = StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion);
    replaceInvalidAcronym = matchVersion.onOrAfter(Version.LUCENE_24);
    this.matchVersion = matchVersion;

  /** Builds an analyzer with the stop words from the given file.
   * @see WordlistLoader#getWordSet(File)
   * @param matchVersion Lucene version to match See {@link
   * <a href="#version">above</a>}
   * @param stopwords File to read stop words from */
  public StandardAnalyzer(Version matchVersion, File stopwords) throws IOException {
    this(matchVersion, WordlistLoader.getWordSet(stopwords));

   * Loads a text file and adds every line as an entry to a HashSet (omitting
   * leading and trailing whitespace). Every line of the file should contain only
   * one word. The words need to be in lowercase if you make use of an
   * Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
   * @param wordfile File containing the wordlist
   * @return A HashSet with the file's words
  public static HashSet<String> getWordSet(File wordfile) throws IOException {
    HashSet<String> result = new HashSet<String>();
    FileReader reader = null;
    try {
      reader = new FileReader(wordfile);
      result = getWordSet(reader);
    finally {
      if (reader != null)
    return result;

   * Reads lines from a Reader and adds every line as an entry to a HashSet (omitting
   * leading and trailing whitespace). Every line of the Reader should contain only
   * one word. The words need to be in lowercase if you make use of an
   * Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
   * @param reader Reader containing the wordlist
   * @return A HashSet with the reader's words
  public static HashSet<String> getWordSet(Reader reader) throws IOException {
    HashSet<String> result = new HashSet<String>();
    BufferedReader br = null;
    try {
      if (reader instanceof BufferedReader) {
        br = (BufferedReader) reader;
      } else {
        br = new BufferedReader(reader);
      String word = null;
      while ((word = br.readLine()) != null) {
    finally {
      if (br != null)
    return result;


7. Directory
org.apache.lucene.store     Class Directory
java.lang.Object    org.apache.lucene.store.Directory
All Implemented Interfaces:    Closeable
Direct Known Subclasses:DbDirectory, FileSwitchDirectory, FSDirectory, JEDirectory, RAMDirectory

其中用到的比较多的是RAMDirectory和FSDirectory。RAMDirectory是将索引存储在内存中(如果数据量很大,用RAMDirectory将是可怕的,会有OutOfMemoryErr: Heap space error),FSDirectory是将索引文件存储到本地硬盘中。大致意思是这样,具体的实现起来的时候,一定要注意IndexWriter和IndexReader操作时,所指向的是同一个Directory,否则将会出现error(这个是RAMDirectory的不指向同一个Directory的错误):no segments* file found in org.apache.lucene.store.RAMDirectory@765291: files: []


static FSDirectory open(File path)           Creates an FSDirectory instance, trying to pick the best implementation given the current environment.
static FSDirectory open(File path, LockFactory lockFactory)           Just like open(File), but allows you to also specify a custom LockFactory.

 /** Creates an FSDirectory instance, trying to pick the
   *  best implementation given the current environment.
   *  The directory returned uses the {@link NativeFSLockFactory}.
   *  <p>Currently this returns {@link NIOFSDirectory}
   *  on non-Windows JREs and {@link SimpleFSDirectory}
   *  on Windows.
   * <p><b>NOTE</b>: this method may suddenly change which
   * implementation is returned from release to release, in
   * the event that higher performance defaults become
   * possible; if the precise implementation is important to
   * your application, please instantiate it directly,
   * instead. On 64 bit systems, it may also good to
   * return {@link MMapDirectory}, but this is disabled
   * because of officially missing unmap support in Java.
   * For optimal performance you should consider using
   * this implementation on 64 bit JVMs.
   * <p>See <a href="#subclasses">above</a> */
  public static FSDirectory open(File path) throws IOException {
    return open(path, null);

  /** Just like {@link #open(File)}, but allows you to
   *  also specify a custom {@link LockFactory}. */
  public static FSDirectory open(File path, LockFactory lockFactory) throws IOException {
    /* For testing:
    MMapDirectory dir=new MMapDirectory(path, lockFactory);
    return dir;

    if (Constants.WINDOWS) {
      return new SimpleFSDirectory(path, lockFactory);
    } else {
      return new NIOFSDirectory(path, lockFactory);


8. Query、Sort 和 Filter
Class Query

All Implemented Interfaces:
    Serializable, Cloneable
Direct Known Subclasses:
    BooleanQuery, BoostingQuery, ConstantScoreQuery, CustomScoreQuery, DisjunctionMaxQuery, FilteredQuery, FuzzyLikeThisQuery, MatchAllDocsQuery, MoreLikeThisQuery, MultiPhraseQuery, MultiTermQuery, PhraseQuery, SpanQuery, TermQuery, ValueSourceQuery

TermQuery(Term t)           Constructs a query for the term t.

TermQuery query = new TermQuery(new Term("bookname","java"));

A Query that matches documents matching boolean combinations of other queries, e.g. TermQuerys, PhraseQuerys or other BooleanQuerys.

BooleanQuery()           Constructs an empty boolean query.
BooleanQuery(boolean disableCoord)           Constructs an empty boolean query.

private static int maxClauseCount = 1024;//最大数量限制。默认是1024
this.disableCoord = disableCoord;//第二个构造函数中。是用来在search中的Similarity类中使用的
protected int minNrShouldMatch = 0;//在setMinimumNumberShouldMatch(int)函数中
private ArrayList<BooleanClause> clauses = new ArrayList<BooleanClause>();//用来存放BooleanClause的容器

void add(BooleanClause clause)           Adds a clause to a boolean query.
void add(Query query, BooleanClause.Occur occur)           Adds a clause to a boolean query.

 public static enum Occur {

    /** Use this operator for clauses that <i>must</i> appear in the matching documents. */
    MUST     { @Override public String toString() { return "+"; } },

    /** Use this operator for clauses that <i>should</i> appear in the 
     * matching documents. For a BooleanQuery with no <code>MUST</code> 
     * clauses one or more <code>SHOULD</code> clauses must match a document 
     * for the BooleanQuery to match.
     * @see BooleanQuery#setMinimumNumberShouldMatch
    SHOULD   { @Override public String toString() { return "";  } },

    /** Use this operator for clauses that <i>must not</i> appear in the matching documents.
     * Note that it is not possible to search for queries that only consist
     * of a <code>MUST_NOT</code> clause. */
    MUST_NOT { @Override public String toString() { return "-"; } };


  /** The query whose matching documents are combined by the boolean query.
  private Query query;
  private Occur occur;

  /** Constructs a BooleanClause.
  public BooleanClause(Query query, Occur occur) {
    this.query = query;
    this.occur = occur;

PhraseQuery()           Constructs an empty phrase query.

  private String field;//field在这个PhraseQuery中必须是相同的
  private ArrayList<Term> terms = new ArrayList<Term>(4);//来存储Term的集合
  private ArrayList<Integer> positions = new ArrayList<Integer>(4);//来存储位置的集合
  private int maxPosition = 0;//maxPosition
  private int slop = 0;//用来说明Term之间距离的变量。如果为0,则表示是一个phrase

    public void setSlop(int s) { slop = s; }
   * Adds a term to the end of the query phrase.
   * The relative position of the term is the one immediately after the last term added.
  public void add(Term term) {
    int position = 0;
    if(positions.size() > 0)
        position = positions.get(positions.size()-1).intValue() + 1;

    add(term, position);

   * Adds a term to the end of the query phrase.
   * The relative position of the term within the phrase is specified explicitly.
   * This allows e.g. phrases with more than one term at the same position
   * or phrases with gaps (e.g. in connection with stopwords).
   * @param term
   * @param position
  public void add(Term term, int position) {
      if (terms.size() == 0)
          field = term.field();
      else if (term.field() != field)
          throw new IllegalArgumentException("All phrase terms must be in the same field: " + term);//field必须相同

      if (position > maxPosition) maxPosition = position;

WildcardQuery(Term term)

/** Implements the wildcard search query. Supported wildcards are <code>*</code>, which
 * matches any character sequence (including the empty one), and <code>?</code>,
 * which matches any single character. Note this query can be slow, as it
 * needs to iterate over many terms. In order to prevent extremely slow WildcardQueries,
 * a Wildcard term should not start with one of the wildcards <code>*</code> or
 * <code>?</code>.
 * <p>This query uses the {@link
 * rewrite method.
 * @see WildcardTermEnum */
public class WildcardQuery extends MultiTermQuery {
  private boolean termContainsWildcard;//如果含有*或者?,则为true
  private boolean termIsPrefix;//如果只含有*且*在最后。为了来处理仅仅含有*且在最后的这种情况,来提高检索速度。因为使用WildcardQuery,速度有慢很多
  protected Term term;
  public WildcardQuery(Term term) {
    this.term = term;
    String text = term.text();
    this.termContainsWildcard = (text.indexOf('*') != -1)
        || (text.indexOf('?') != -1);
    this.termIsPrefix = termContainsWildcard 
        && (text.indexOf('?') == -1) 
        && (text.indexOf('*') == text.length() - 1);




PrefixQuery(Term prefix)           Constructs a query for terms starting with prefix.

FuzzyQuery(Term term)           Calls FuzzyQuery(term, 0.5f, 0).
FuzzyQuery(Term term, float minimumSimilarity)           Calls FuzzyQuery(term, minimumSimilarity, 0).
FuzzyQuery(Term term, float minimumSimilarity, int prefixLength)           Create a new FuzzyQuery that will match terms with a similarity of at least minimumSimilarity to term.

  public final static float defaultMinSimilarity = 0.5f;
  public final static int defaultPrefixLength = 0;
  private float minimumSimilarity;
  private int prefixLength;
  private boolean termLongEnough = false;
  protected Term term;
   * Create a new FuzzyQuery that will match terms with a similarity 
   * of at least <code>minimumSimilarity</code> to <code>term</code>.
   * If a <code>prefixLength</code> &gt; 0 is specified, a common prefix
   * of that length is also required.
   * @param term the term to search for
   * @param minimumSimilarity a value between 0 and 1 to set the required similarity
   *  between the query term and the matching terms. For example, for a
   *  <code>minimumSimilarity</code> of <code>0.5</code> a term of the same length
   *  as the query term is considered similar to the query term if the edit distance
   *  between both terms is less than <code>length(term)*0.5</code>
   * @param prefixLength length of common (non-fuzzy) prefix
   * @throws IllegalArgumentException if minimumSimilarity is &gt;= 1 or &lt; 0
   * or if prefixLength &lt; 0
  public FuzzyQuery(Term term, float minimumSimilarity, int prefixLength) throws IllegalArgumentException {
    this.term = term;
    if (minimumSimilarity >= 1.0f)
      throw new IllegalArgumentException("minimumSimilarity >= 1");
    else if (minimumSimilarity < 0.0f)
      throw new IllegalArgumentException("minimumSimilarity < 0");
    if (prefixLength < 0)
      throw new IllegalArgumentException("prefixLength < 0");
    if (term.text().length() > 1.0f / (1.0f - minimumSimilarity)) {
      this.termLongEnough = true;
    this.minimumSimilarity = minimumSimilarity;
    this.prefixLength = prefixLength;
   * Calls {@link #FuzzyQuery(Term, float) FuzzyQuery(term, minimumSimilarity, 0)}.
  public FuzzyQuery(Term term, float minimumSimilarity) throws IllegalArgumentException {
      this(term, minimumSimilarity, defaultPrefixLength);

   * Calls {@link #FuzzyQuery(Term, float) FuzzyQuery(term, 0.5f, 0)}.
  public FuzzyQuery(Term term) {
    this(term, defaultMinSimilarity, defaultPrefixLength);



levenshtein算法。此返回两个字符串之间的 Levenshtein 距离。Levenshtein 距离,又称编辑距离,指的是两个字符串之间,由一个转换成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。例如把 kitten 转换为 sitting:  
sitten (k→s)    sittin (e→i)    sitting (→g)
levenshtein() 函数给每个操作(替换、插入和删除)相同的权重。不过,您可以通过设置可选的 insert、replace、delete 参数,来定义每个操作的代价。

TermRangeQuery(String field, String lowerTerm, String upperTerm, boolean includeLower, boolean includeUpper)           Constructs a query selecting all terms greater/equal than lowerTerm but less/equal than upperTerm.
TermRangeQuery(String field, String lowerTerm, String upperTerm, boolean includeLower, boolean includeUpper, Collator collator)           Constructs a query selecting all terms greater/equal than lowerTerm but less/equal than upperTerm.

(摘自API文档)(我没有来改变precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4).没有进行深入研究)
A Query that matches numeric values within a specified range. To use this, you must first index the numeric values using NumericField (expert: NumericTokenStream). If your terms are instead textual, you should use TermRangeQuery. NumericRangeFilter is the filter equivalent of this query.

NOTE: This API is experimental and might change in incompatible ways in the next release.

static NumericRangeQuery<Double> newDoubleRange(String field, Double min, Double max, boolean minInclusive, boolean maxInclusive)           Factory that creates a NumericRangeQuery, that queries a double range using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4).
static NumericRangeQuery<Double> newDoubleRange(String field, int precisionStep, Double min, Double max, boolean minInclusive, boolean maxInclusive)           Factory that creates a NumericRangeQuery, that queries a double range using the given precisionStep.
static NumericRangeQuery<Float> newFloatRange(String field, Float min, Float max, boolean minInclusive, boolean maxInclusive)           Factory that creates a NumericRangeQuery, that queries a float range using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4).
static NumericRangeQuery<Float> newFloatRange(String field, int precisionStep, Float min, Float max, boolean minInclusive, boolean maxInclusive)           Factory that creates a NumericRangeQuery, that queries a float range using the given precisionStep.
static NumericRangeQuery<Integer> newIntRange(String field, Integer min, Integer max, boolean minInclusive, boolean maxInclusive)           Factory that creates a NumericRangeQuery, that queries a int range using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4).
static NumericRangeQuery<Integer> newIntRange(String field, int precisionStep, Integer min, Integer max, boolean minInclusive, boolean maxInclusive)           Factory that creates a NumericRangeQuery, that queries a int range using the given precisionStep.
static NumericRangeQuery<Long> newLongRange(String field, int precisionStep, Long min, Long max, boolean minInclusive, boolean maxInclusive)           Factory that creates a NumericRangeQuery, that queries a long range using the given precisionStep.
static NumericRangeQuery<Long> newLongRange(String field, Long min, Long max, boolean minInclusive, boolean maxInclusive)           Factory that creates a NumericRangeQuery, that queries a long range using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4).

NumericField(String name)           Creates a field for numeric values using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4).
NumericField(String name, Field.Store store, boolean index)           Creates a field for numeric values using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4).
NumericField(String name, int precisionStep)           Creates a field for numeric values with the specified precisionStep.
NumericField(String name, int precisionStep, Field.Store store, boolean index)           Creates a field for numeric values with the specified precisionStep.

NumericField setDoubleValue(double value)           Initializes the field with the supplied double value.
NumericField setFloatValue(float value)           Initializes the field with the supplied float value.
NumericField setIntValue(int value)           Initializes the field with the supplied int value.
NumericField setLongValue(long value)           Initializes the field with the supplied long value.


API文档中有叙述,但是在Lucene 3.0.2中没有这个类。不知道为什么。可能是实现出来的性能不够满意,所以没有随着3.0.2一起发布吧,不太清楚。
package com.eric.lucene;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TermRangeQuery;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

 * 注释是运行结果
 * @author Yuanbo Han
public class QueryTest {
	public static Query getTermQuery(){
		TermQuery query = new TermQuery(new Term("bookname","java"));
		return query;
//		thinking in java		0.625
//		thinking in java IV(Java Classic)		0.61871845
	public static Query getBooleanQuery(){
		TermQuery termQuery2 = new TermQuery(new Term("bookname", "thinking"));
		TermQuery termQuery1 = new TermQuery(new Term("bookname", "java"));
		BooleanQuery query = new BooleanQuery();
		query.add(termQuery1, BooleanClause.Occur.SHOULD);
		query.add(termQuery2, BooleanClause.Occur.SHOULD);
		return query;
//		thinking in java		0.76735055
//		thinking in java IV(Java Classic)		0.68474615
//		thinking in c++		0.12914689
	public static Query getPhraseQuery(){
		PhraseQuery query = new PhraseQuery();

		//thinking in java		0.75674474
		//thinking in java IV(Java Classic)		0.5297213
		query.setSlop(0);// no result. 说明没有thinking java存在
		query.add(new Term("bookname", "thinking"));
		query.add(new Term("bookname", "java"));
		return query;
	public static Query getWildcardQuery(){
		//WildcardQuery query = new WildcardQuery(new Term("bookname","think*"));
		//thinking in java		1.0
		//thinking in java IV(Java Classic)		1.0
		//thinking in c++		1.0
		//WildcardQuery query = new WildcardQuery(new Term("bookname","ja?a"));
		//thinking in java		1.0
		//thinking in java IV(Java Classic)		1.0
		WildcardQuery query = new WildcardQuery(new Term("bookname","ja?a*"));
		//thinking in java		1.0
		//thinking in java IV(Java Classic)		1.0
		return query;

	public static Query getPrefixQuery(){
		PrefixQuery query = new PrefixQuery(new Term("bookname","java"));//以java为前缀的匹配
		//thinking in java		1.0
		//thinking in java IV(Java Classic)		1.0

		return query;
	public static Query getFuzzyQuery(){
		//FuzzyQuery query = new FuzzyQuery(new Term("bookname","jama"));// default: similarity = 0.5, prefixLength = 0.
		/*具体的edit distance 不知道怎么计算的,但是觉得源代码的注意有些问题。解释如下:相似度越高,说明需要做的修改的操作也少,但是它注释中如是说:“For example, for a minimumSimilarity of 0.5,		a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term)*0.5”但是这说明Similarity越高的话,可以做的操作可以越多,代码中也试过了,如果将similarity设置为0.9的话,是没有结果的。*/
		//thinking in java		0.625
		//thinking in java IV(Java Classic)		0.61871845

//		FuzzyQuery query = new FuzzyQuery(new Term("bookname","jama"),0.9f);//no result
//		FuzzyQuery query = new FuzzyQuery(new Term("bookname","jama"),0.5f,3);//no result
		FuzzyQuery query = new FuzzyQuery(new Term("bookname","jama"),0.5f,2);
		//thinking in java		0.625
		//thinking in java IV(Java Classic)		0.61871845

		return query;
	public static Query getTermRangeQuery(){
//		TermRangeQuery query = new TermRangeQuery("bookname", "jama", "jaza", true, true);
		//thinking in java		1.0
		//thinking in java IV(Java Classic)		1.0

		TermRangeQuery query = new TermRangeQuery("bookname", "jama", "jana", true, true);// no result
		return query;
	public static Query getNumericRangeQuery(){
//		Query query = NumericRangeQuery.newFloatRange("bookname", 0.3f, 0.10f, true, true);// no result
		/* if let the document add the fields below,(if you want to use NumericRangeQuery, you should create the index using the NumericField)
		doc1.add(new NumericField("value", Field.Store.YES, true).setFloatValue(0.1f));
		doc2.add(new NumericField("value", Field.Store.YES, true).setFloatValue(0.5f));
		doc3.add(new NumericField("value", Field.Store.YES, true).setFloatValue(0.1f));
		将结果输出中的那句改成System.out.print(doc.get("value") + "\t\t");
		0.1		1.0
		0.5		1.0
		0.1		1.0

		Query query = NumericRangeQuery.newFloatRange("value", null, null, true, true);// no result
		return query;
	 * maybe some reasons.
	 * the api contains the RegexQuery, and other interfaces relevant to the class. 
	 * but in Lucene 3.0.2, the class has not been contained.
	 * maybe its performance is not satisfying.
	 * @return
	public static Query getRegexQuery(){
		return null;
	public static void main(String[] args) throws Exception {
		Directory dir = new RAMDirectory();
		IndexWriter writer = new IndexWriter(
				dir, new StandardAnalyzer(Version.LUCENE_30), true,
		Document doc1 = new Document();
		Document doc2 = new Document();
		Document doc3 = new Document();
		doc1.add(new Field("bookname","thinking in java", Field.Store.YES, Field.Index.ANALYZED));
		doc2.add(new Field("bookname","thinking in java IV(Java Classic)", Field.Store.YES, Field.Index.ANALYZED));
		doc3.add(new Field("bookname","thinking in c++", Field.Store.YES, Field.Index.ANALYZED));
		IndexSearcher searcher = new IndexSearcher(dir);
		Query query = QueryTest.getNumericRangeQuery();
		TopScoreDocCollector collector = TopScoreDocCollector.create(100, false);
		searcher.search(query, collector);
		ScoreDoc[] hits = collector.topDocs().scoreDocs;
		for(int i=0; i<hits.length;i++){
			Document doc = searcher.doc(hits[i].doc);
			System.out.print(doc.get("bookname") + "\t\t");

9. Lucene中的Ranking算法以及改进
