索引基本索引操作

xusulong

浏览: 81484 次
性别:
来自: 南京

最近访客更多访客>>

哈十九点

zhezhiren

ZhangRuiQ

chenjinjun40422p

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Lucene

lucene Apache junit

主要内容为读Lucene in action 所得

基本的索引操作包括：

向索引添加文档
删除索引中的文档
恢复被删除的文档
更新索引中的文档

下面就这个四个方面分别详述(以测试类来进行示例)

1.向索引添加文档

这个抽象类作为基础，此后2，3，4中都会实现此抽象类，其中setUp函数为JUnit测试时候首先调用的函数，在此函数中做了初始化工作

注意 Field的构造函数的不同用法，我根据自己的理解加了注释

import java.io.IOException;

import junit.framework.TestCase;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;


/**
 * @author xusulong
 *
 */
public abstract class BaseIndexingTestCase extends TestCase {
	protected String[] keywords = {"1", "2"};
	protected String[] unindexed = {"Netherlands", "Italy"};
	protected String[] unstored = {"Amsterdam has lots of bridges", "Venice has lots of canals"};
	protected String[] text = {"Amsterdam", "Venice"};
	protected Directory dir;
	
	protected void setUp() throws IOException {
		String indexDir = 
			System.getProperty("java.io.tmpdir", "tmp") + 
			System.getProperty("file.separator") + "index-dir";
		dir = FSDirectory.getDirectory(indexDir);
		addDocument(dir);
	}
	
	protected void addDocument(Directory dir) throws IOException{
		IndexWriter writer = new IndexWriter(dir, getAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
		writer.setUseCompoundFile(isCompound());
		
		for(int i = 0; i < keywords.length; i ++) {
			Document doc = new Document();
			//不分析，索引，存储
			doc.add(new Field("id", keywords[i], Field.Store.YES, Field.Index.NOT_ANALYZED));
			//不索引，存储
			doc.add(new Field("country", unindexed[i], Field.Store.YES, Field.Index.NO));
			//分析，索引，不存储
			doc.add(new Field("contents", unstored[i], Field.Store.NO, Field.Index.ANALYZED));
			//分析，索引，存储
			doc.add(new Field("city", text[i], Field.Store.YES, Field.Index.ANALYZED));
			writer.addDocument(doc);
		}
		
		writer.optimize();
		writer.close();
	}
	/**
	 * 默认分析器
	 * @return
	 */
	protected Analyzer getAnalyzer() {
		return new SimpleAnalyzer();
	}
	
	protected boolean isCompound() {
		return true;
	}
}

2.删除索引中的文档

文档的删除是通过IndexReader类实现的（从某种程度上说，这个名字显得不太合适，因为没有完全表达出该类真实的功能）。这个类并没有立即从索引中删除文档，而只是在这些文档上加一个删除标记，直到IndexReader调用close()后才真正将它们删除

这个类继承自1.中的BaseIndexTestCase

import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;


/**
 * @author xusulong
 *
 */
public class DocumentDeleteTest extends BaseIndexingTestCase {

	
	public void testDeleteBeforeIndexMerge() throws IOException {
		IndexReader reader = IndexReader.open(dir);
		assertEquals(2, reader.maxDoc());
		assertEquals(2, reader.numDocs());
		reader.deleteDocument(1);
		
		//isDelete(int)用于检查一个特定变化的文档的状态
		assertTrue(reader.isDeleted(1));
		//hasDeletions()用于检查一个所有是否包含了带有删除标记的文档
		assertTrue(reader.hasDeletions());
		assertEquals(2, reader.maxDoc());
		assertEquals(1, reader.numDocs());
		
		reader.close();
		
		reader = IndexReader.open(dir);
		assertEquals(2, reader.maxDoc());
		assertEquals(1, reader.numDeletedDocs());
		
		reader.close();
	}
	
	public void testDeleteAfterIndexMerge() throws IOException {
		IndexReader reader = IndexReader.open(dir);
		assertEquals(2, reader.maxDoc());
		assertEquals(2, reader.numDocs());
		reader.deleteDocument(1);
		reader.close();
		
		IndexWriter writer = new IndexWriter(dir, getAnalyzer(), false, IndexWriter.MaxFieldLength.LIMITED);
		writer.optimize();
		writer.close();
		
		reader = IndexReader.open(dir);
		//删除标记reader close后没有了，所有false
		assertTrue(reader.isDeleted(1));
		assertTrue(reader.hasDeletions());
		assertEquals(1, reader.maxDoc());
		assertEquals(1, reader.numDocs());
		
		reader.close();
		
	}
}

IndexReader中两个易混淆的方法maxDoc() 和 numDocs()的区别：前者返回下一个可得到的文档的内部编号，后者返回索引中的文档数量。因为索引仅仅包含两个文档，所以numDocs()返回 2；而文档编号是从0开始的，所以maxDoc()也返回2。

注：每一个Lucene文档都有一个惟一的内部编号。这些编号的分配不是永久的，因为当索引中的段合并时，Lucene会重新分配其编号。因此，你不能认为某一特定的文档会一直拥有某一个固定的内部编号。

3.恢复被删除的文档

由于文档的删除是推迟到IndexReader实例关闭的时候才进行的，所以Lucene允许在文档被标记删除但还没有执行最后的删除操作之前，恢复被标记为删除的文档。可以通过调用undelete()方法移除索引目录中的.del文件来恢复被删除的文件。随后，关闭IndexReader实例，这样就保留了索引中所有标记为删除的文档。如果用IndexReader实例标记了删除文档，那么只有调用同一个IndexReader实例的 undeleteAll()方法，才能在最初的位置恢复各个被标记为删除的文档（也就是说，IndexReadel实例只能处理自身标记的删除文档，无法恢复其他实例的文档）。

4.更新索引中的文档

Lucene不提供update(Document)方法。为了达到更新的目的，必须首先从一个索引中删除待更新的文档，然后将修改过的文档重新添加到索引中

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;


/**
 * @author xusulong
 *
 */
public class DocumentUpdateTest extends BaseIndexingTestCase {
	public void testUpdate() throws IOException {
		assertEquals(1, getHitCount("city", "Amsterdam"));
		
		IndexReader reader = IndexReader.open(dir);
		//删除在city域中包含Amsterdam的文档
		reader.deleteDocuments(new Term("city", "Amsterdam"));
		reader.close();
		//核实该文档是否被删除
		assertEquals(0, getHitCount("city", "Amsterdam"));
		
		IndexWriter writer = new IndexWriter(dir, getAnalyzer(), false, IndexWriter.MaxFieldLength.LIMITED);
		
		Document doc = new Document();
		//重新添加city域的值为"Haag"的文档
		doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
		doc.add(new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO));
		doc.add(new Field("contents", "Amsterdam has lots of bridges", Field.Store.NO, Field.Index.ANALYZED));
		doc.add(new Field("city", "Haag", Field.Store.YES, Field.Index.ANALYZED));
		writer.addDocument(doc);
		writer.optimize();
		writer.close();
		//确认文档已经更新
		assertEquals(1, getHitCount("city", "Haag"));
	}
	
	protected Analyzer getAnalyzer() {
		return new WhitespaceAnalyzer();
	}
	
	private int getHitCount(String filedName, String searchString) throws IOException{
		IndexSearcher searcher = new IndexSearcher(dir);
		Term t = new Term(filedName, searchString);
		Query query = new TermQuery(t);
		
		TopDocs topDocs = searcher.search(query, 100);
		ScoreDoc[] hits = topDocs.scoreDocs;
		
		int hitCount = hits.length;
		searcher.close();
		return hitCount;
	}
}

通过批量删除的方式进行更新操作

我们的例子删除和重新添加的只是一个文档对象。如果你需要删除和添加多个文档对象，最好批量地进行这些操作。步骤如下：

1．打开IndexReader对象。

2．删除所有需要删除的Document对象。

3．关闭IndexReader对象。

4．打开IndexWriter对象。

5．添加所有需要添加的Document对象。

6．关闭IndexWriter对象。

要记住：对文档对象的批量删除和索引操作，总要比交替地删除和添加操作速度快。

0
顶

0
踩

分享到：

php phpeclipse + xampp 配置 | 初始lucene

2009-08-01 20:30
浏览 1377
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论