Lucene入門草稿

cleaneyes

浏览: 348645 次
性别:
来自: 深圳

最近访客更多访客>>

张中文

u012363178

amo

muyuan

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Lucene

lucene F#

索引3種方式：倒排、後綴數組和簽名文件.

一段讀寫文本文件的代碼：

BufferWriter writer = new BufferWriter(new FileWriter(destFile)); 
BufferReader reader = new BufferReader(new FileReader(readFile)); 
String line = reader.readLine(); 
while (line != null){ 
writer.write(newline); 
writer.newLine();//寫入行分割符 
} 
reader.close(); 
writer.close();

Store類的3個屬性：
Store.NO 不需存儲
Store.YES 需存儲
Store.COMPRESS 壓縮存儲

Index類的4個屬性
Index.NO 不需索引
Index.TOKENIZED 分詞索引
Index.UN_TOKENIZED 不分詞索引
Index.NO_NORMS 索引，但不使用Analyzer，且禁止參加評分

IndexWriter構造方法：

private IndexWriter(Directory d, Analyzer a, final boolean create, boolean closeDir) 

public IndexWriter(File path, Analyzer a, boolean create) 
throws IOException { 
this(FSDirectory.getDirectory(path, create), a, create, true); 
} 

public IndexWriter(Directory d, Analyzer a, boolean create) 
throws IOException { 
this(d, a, create, false); 
}

往IndexWriter中添加Document

public void addDocument(Document doc) throws IOException {
    addDocument(doc, analyzer);
  }

public void addDocument(Document doc, Analyzer analyzer) throws IOException {
    DocumentWriter dw =
      new DocumentWriter(ramDirectory, analyzer, this);
    dw.setInfoStream(infoStream);
    String segmentName = newSegmentName();
    dw.addDocument(segmentName, doc);
    synchronized (this) {
      segmentInfos.addElement(new SegmentInfo(segmentName, 1, ramDirectory));
      maybeMergeSegments();
    }
  }

注意：在使用addDocument方法後，一定要使用IndexWriter的close方法關閉索引器。否則，索引不會被最終建立，同時可能出現下次加入索引時目錄鎖定的問題。

DocumentWriter的構造方法：

  DocumentWriter(Directory directory, Analyzer analyzer,
                 Similarity similarity, int maxFieldLength) {
    this.directory = directory;
    this.analyzer = analyzer;
    this.similarity = similarity;
    this.maxFieldLength = maxFieldLength;
  }

  DocumentWriter(Directory directory, Analyzer analyzer, IndexWriter writer) {
    this.directory = directory;
    this.analyzer = analyzer;
    this.similarity = writer.getSimilarity();
    this.maxFieldLength = writer.getMaxFieldLength();
    this.termIndexInterval = writer.getTermIndexInterval();
  }

DocumentWriter的addDocument方法：

final void addDocument(String segment, Document doc)
          throws IOException {
    // write field names
    fieldInfos = new FieldInfos();
    fieldInfos.add(doc);
    fieldInfos.write(directory, segment + ".fnm");

    // write field values負責寫入.fdx和.fdt文件
    FieldsWriter fieldsWriter =
            new FieldsWriter(directory, segment, fieldInfos);
    try {
      fieldsWriter.addDocument(doc);
    } finally {
      fieldsWriter.close();
    }

    // invert doc into postingTable
    postingTable.clear();			  // clear postingTable初始化存儲所有詞條的HashTable
    fieldLengths = new int[fieldInfos.size()];    // init fieldLengths
    fieldPositions = new int[fieldInfos.size()];  // init fieldPositions所有Field在分析完畢後的最終Position
    fieldOffsets = new int[fieldInfos.size()];    // init fieldOffsets

    fieldBoosts = new float[fieldInfos.size()];	  // init fieldBoosts
    Arrays.fill(fieldBoosts, doc.getBoost());

    invertDocument(doc);//倒排Document中每Field

    // sort postingTable into an array對詞條進行排序
    Posting[] postings = sortPostingTable();

    // write postings把詞條信息寫入索引，主要是向.frq和.prx文件中寫入詞條的頻率和位置信息
    writePostings(postings, segment);

    // write norms of indexed fields把得分信息寫入索引，主要是向.f文件中寫入
    writeNorms(segment);

  }

索引目錄內的文件：

segment，是一個邏輯概念，在每個segment時，有許多的Document。每個segment內的所有索引文件都具有相同的前綴，但後綴不同。每個segment的名稱都是由segmentInfos.counter先加1，再轉成36進制，再在前面加上_而成。segmentInfos.counter的值其實就是當前segemnt中總共的文檔數量。

而一個目錄下隻有一個segments和deleable文件

分享到：

推薦圖書館買的書 | English Songs

2008-04-15 14:42
浏览 1483
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论