第一个lucene程序 -

ryxxlong

浏览: 784431 次
性别:
来自: 西安

最近访客更多访客>>

csmnjk

u012363178

wangyy

xuan108

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

第一个lucene程序

博客分类：

lucene

lucene Myeclipse Apache junit F#

本文部分内容来自:http://www.ibm.com/developerworks/cn/java/j-lo-lucene1/ 原文链接

一、Lucene 软件包分析

Lucene 软件包里面的主要的 JAVA 包，使读者对之有个初步的了解：

Package: org.apache.lucene.document

这个包提供了一些为封装要索引的文档所需要的类，比如 Document, Field。这样，每一个文档最终被封装成了一个 Document 对象。

Package: org.apache.lucene.analysis

这个包主要功能是对文档进行分词，因为文档在建立索引之前必须要进行分词，所以这个包的作用可以看成是为建立索引做准备工作。

Package: org.apache.lucene.index

这个包提供了一些类来协助创建索引以及对创建好的索引进行更新。这里面有两个基础的类：IndexWriter 和 IndexReader，其中 IndexWriter 是用来创建索引并添加文档到索引中的，IndexReader 是用来删除索引中的文档的。

Package: org.apache.lucene.search

这个包提供了对在建立好的索引上进行搜索所需要的类。比如 IndexSearcher , IndexSearcher 定义了在指定的索引上进行搜索的方法。

二、建立索引

为了对文档进行索引，Lucene 提供了五个基础的类，他们分别是 Document, Field, IndexWriter, Analyzer, Directory。下面我们分别介绍一下这五个类的用途：

Document

Document 是用来描述文档的，这里的文档可以指一个 HTML 页面，一封电子邮件，或者是一个文本文件。一个 Document 对象由多个 Field 对象组成的。可以把一个 Document 对象想象成数据库中的一个记录，而每个 Field 对象就是记录的一个字段。

Field

Field 对象是用来描述一个文档的某个属性的，比如一封电子邮件的标题和内容可以用两个 Field 对象分别描述。

Analyzer

在一个文档被索引之前，首先需要对文档内容进行分词处理，这部分工作就是由 Analyzer 来做的。Analyzer 类是一个抽象类，它有多个实现。针对不同的语言和应用需要选择适合的 Analyzer。Analyzer 把分词后的内容交给 IndexWriter 来建立索引。注意：在分词时，如果用来进行索引的文档内容不是纯文本形式，首先得转换成纯文本形式才能再进行操作。还有对同一索引,用来分词建立索引的分词器与用来进行查询的分词器必须是同一个，这样才能保证能得到正确的查询结果。

IndexWriter

IndexWriter 是 Lucene 用来创建索引的一个核心的类，他的作用是把一个个的 Document 对象加到索引中来。

Directory

这个类代表了 Lucene 的索引的存储的位置，这是一个抽象类，它目前有两个实现，第一个是 FSDirectory，它表示一个存储在文件系统中的索引的位置。第二个是 RAMDirectory，它表示一个存储在内存当中的索引的位置。

图1描述了这五个类在建立索引时所扮演的作用：

为了更进一步的理解lucene建立索引的过程，请仔细读下图：

一个简单的示例如下所示,运行此程序需要lucene核心jar和junit4jar包:

import static org.junit.Assert.*;

public class IndexingTest {

	protected String[]	ids	      = { "1", "2" };
	protected String[]	unindexed	= { "Netherlands", "Italy" };
	protected String[]	unstored	= { "Amsterdam has lots of bridges",
	                "Venice has lots of canals" };
	protected String[]	text	  = { "Amsterdam", "Venice" };
	// 保存索引的目录
	private Directory	directory;

	@Before
	public void setUp() throws Exception {
		directory = new RAMDirectory();
		IndexWriter writer = getWriter();

		for (int i = 0; i < ids.length; i++) {
			Document doc = new Document();
			//将Field 对象加入到 Document 对象中
			//Field传入四个参数,第一个参数为Field(字段)的名字(name - The name of the field)
			//第二个参数为用来进行处理的文本信息(value - The string to process)
			//第三个参数为表示此文本信息是否要存储在索引中(store - Whether value should be stored in the index)
			//第四个参数表示这个字段是否要被索引(index - Whether the field should be indexed, and if so, if it should be tokenized before indexing)
			//存储此文本,且直接索引,不经过分词器
			doc.add(new Field("id", ids[i], Field.Store.YES,
			                Field.Index.NOT_ANALYZED));
			//此文本只存储,不索引
			doc.add(new Field("country", unindexed[i], Field.Store.YES,
			                Field.Index.NO));
			//此文本不存储,但通过分词器进行索引
			doc.add(new Field("contents", unstored[i], Field.Store.NO,
			                Field.Index.ANALYZED));
			//此文本存储,同时通过分词器进行索引
			doc.add(new Field("city", text[i], Field.Store.YES,
			                Field.Index.ANALYZED));
			//用 IndexWriter 类的 add 方法加入到索引中去。这样我们便完成了索引的创建
			writer.addDocument(doc);
		}
		writer.close();
	}

	private IndexWriter getWriter() throws IOException {
		return new IndexWriter(directory, new WhitespaceAnalyzer(), 
		                IndexWriter.MaxFieldLength.UNLIMITED);
	}

	@Test
	public void testIndexWriter() throws IOException {
		IndexWriter writer = getWriter();
		assertEquals(ids.length, writer.numDocs());
		writer.close();
	}

	@Test
	public void testIndexReader() throws IOException {
		IndexReader reader = IndexReader.open(directory);
		assertEquals(ids.length, reader.maxDoc());
		assertEquals(ids.length, reader.numDocs());
		reader.close();
	}
}

三、搜索文档

利用Lucene进行搜索就像建立索引一样也是非常方便的。Lucene提供了几个基础的类来完成这个过程，它们分别是IndexSearcher, Term, Query, TermQuery, TopDocs. 下面我们分别介绍这几个类的功能。

Query

这是一个抽象类，他有多个实现，比如TermQuery, BooleanQuery, PrefixQuery. 这个类的目的是把用户输入的查询字符串封装成Lucene能够识别的Query。

Term

Term是搜索的基本单位，一个Term对象有两个String类型的域组成，与Field对象非常的相似。生成一个Term对象可以用如下一条语句来完成：Term term = new Term(“fieldName”,”queryWord”); 其中第一个参数代表了要在文档的哪一个Field上进行查找，第二个参数代表了要查询的关键词。它一般结合TermQuery一起使用，如下所示：

Query q = new TermQuery(new Term("contents", "lucene"));
TopDocs hits = searcher.search(q, 10);

上面的这段代码表明了lucene将查询document对象包含为名为contents的Field对象,且contents对象中包含有lucene这个字符串的前十个document对象，然后降序排列这十个document对象。

TermQuery

TermQuery是抽象类Query的一个子类，它同时也是Lucene支持的最为基本的一个查询类。生成一个TermQuery对象由如下语句完成： TermQuery termQuery = new TermQuery(new Term(“fieldName”,”queryWord”)); 它的构造函数只接受一个参数，那就是一个Term对象。

IndexSearcher

IndexSearcher是用来在建立好的索引上进行搜索的。它只能以只读的方式打开一个索引，所以可以有多个IndexSearcher的实例在一个索引上进行操作。一种很曲型的使用方式如下所示：

Directory dir = FSDirectory.open(new File("/tmp/index"));
IndexSearcher searcher = new IndexSearcher(dir);
Query q = new TermQuery(new Term("contents", "lucene"));
TopDocs hits = searcher.search(q, 10);
searcher.close();

在上面的程序中，类IndexSearcher的构造函数接受一个类型为Directory的对象，Directory是一个抽象类，它目前有两个子类：FSDirctory和RAMDirectory. 我们的程序中传入了一个FSDirctory对象作为其参数，代表了一个存储在磁盘上的索引的位置。构造函数执行完成后，代表了这个 IndexSearcher以只读的方式打开了一个索引。然后我们程序构造了一个Term对象，通过这个Term对象，我们指定了要在文档的内容中搜索包含关键词”lucene”的文档。接着利用这个Term对象构造出TermQuery对象并把这个TermQuery对象传入到 IndexSearcher的search方法中进行查询，返回的结果保存在TopDocs对象中。

TopDocs

TopDocs是用来保存搜索的结果的。

一个索引和查询的综合示例,此示例只处理简单的txt文件,HelloWorld程序如下:

public class HelloWorld {

	String filePath = ".\\luceneDatasource\\IndexWriter addDocument's a javadoc .txt";

	String indexPath = ".\\luceneIndex";

	//使用lucene标准的分词器
	Analyzer analyzer = new StandardAnalyzer();

	/**
	 * 创建索引
	 * 
	 * IndexWriter 是用来操作（增、删、改）索引库的
	 */
	@Test
	public void createIndex() throws Exception {
		// file --> doc
		Document doc = File2DocumentUtils.file2Document(filePath);

		// 建立索引
		// 我们注意到类 IndexWriter 的构造函数中传入的四个参数，第一个参数指定了所创建的索引要存放的位置，他可以是一个 File
		// 对象，也可以是一个 FSDirectory 对象或者 RAMDirectory 对象。
		// 第二个参数指定了 Analyzer 类的一个实现，也就是指定这个索引是用哪个分词器对文挡内容进行分词。
		// 第三个参数是一个布尔型的变量，如果为 true 的话就代表创建一个新的索引，为 false 的话就代表在原来索引的基础上进行操作。
		// 第四个参数是一个IndexWriter.MaxFieldLength,表示Field(字段)中的term/token(令牌)的数目,它有UNLIMITED(它的值为2147483647,表示没有限制),LIMITED(值为10000)两个已定义的值,
		//也可new一个新对象,如:new IndexWriter.MaxFieldLength(2000),表示最大数目是2000个
		IndexWriter indexWriter = new IndexWriter(indexPath, analyzer, true,
				MaxFieldLength.LIMITED);
		//最后把document用 IndexWriter 类的 add 方法加入到索引中去。
		indexWriter.addDocument(doc);
		indexWriter.close();
	}

	/**
	 * 搜索
	 * 
	 * IndexSearcher 是用来在索引库中进行查询的
	 */
	@Test
	public void search() throws Exception {
		String queryString = "document";
		//String queryString = "adddocument";

		// 1，把要搜索的文本解析为 Query
		//在名为name和content的字段中搜索queryString
		String[] fields = { "name", "content" };
		QueryParser queryParser = new MultiFieldQueryParser(fields, analyzer);
		Query query = queryParser.parse(queryString);

		// 2，进行查询
		IndexSearcher indexSearcher = new IndexSearcher(indexPath);
		//Filter暂进不使用
		Filter filter = null;
		TopDocs topDocs = indexSearcher.search(query, filter, 10000);
		System.out.println("总共有【" + topDocs.totalHits + "】条匹配结果");

		// 3，打印结果
		for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
			int docSn = scoreDoc.doc; // 文档内部编号
			Document doc = indexSearcher.doc(docSn); // 根据编号取出相应的文档
			File2DocumentUtils.printDocumentInfo(doc); // 打印出文档信息
		}
	}
}

File2DocumentUtils 类如下所示:

public class File2DocumentUtils {

	// 文件：name, content, size, path
	public static Document file2Document(String path) throws IOException {
		File file = new File(path);
		//并为每一个文本文档创建了一个 Document 对象
		Document doc = new Document();
		doc.add(new Field("name", file.getName(), Store.YES, Index.ANALYZED));
		doc.add(new Field("content", readFileContent(file), Store.YES, Index.ANALYZED));
		doc.add(new Field("size", NumberTools.longToString(file.length()), Store.YES, Index.NOT_ANALYZED));
		doc.add(new Field("path", file.getCanonicalPath(), Store.YES, Index.NOT_ANALYZED));
		return doc;
	}

	/**
	 * 读取文件内容
	 */
	public static String readFileContent(File file) {
		try {
			BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
			StringBuffer content = new StringBuffer();
			for (String line = null; (line = reader.readLine()) != null;) {
				content.append(line).append("\n");
			}
			return content.toString();
		} catch (Exception e) {
			throw new RuntimeException(e);
		}
	}

	/**
	 * <pre>
	 * 获取 name 属性的值的两种方法：
	 * 1，Field f = doc.getField(&quot;name&quot;);
	 *    f.stringValue();
	 * 2，doc.get(&quot;name&quot;);
	 * </pre>
	 * 打印Document信息
	 * @param doc
	 */
	public static void printDocumentInfo(Document doc) {
		// Field f = doc.getField("name");
		// f.stringValue();
		System.out.println("------------------------------");
		System.out.println("name     = " + doc.get("name"));
		System.out.println("content  = " + doc.get("content"));
		System.out.println("size     = " + NumberTools.stringToLong(doc.get("size")));
		System.out.println("path     = " + doc.get("path"));
	}

}

在此项目中的luceneDataSource下有一个名为:IndexWriter addDocument's a javadoc .txt的文本文件,内容如下所示:

Adds room a document to this room index. If the room document contains room  more than setMaxFieldLength(int) terms for a given field, the remainder are discarded.
room

1.当搜索程序搜索document时,程序运行结果如下所示,因为在content(文本内容)中有document这个词.

总共有【1】条匹配结果
------------------------------
name     = IndexWriter addDocument's a javadoc .txt
content  = Adds room a document to this room index. If the room document contains room  more than setMaxFieldLength(int) terms for a given field, the remainder are discarded.
room

size     = 169
path     = E:\datas\MyEclipseWorkspace\LuceneDemo\luceneDataSource\IndexWriter addDocument's a javadoc .txt

2.当搜索程序搜索adddocument时,程序运行结果如下所示,因为在name(文本标题)中有addDocument's这个词,而标准分词器StandardAnalyzer对文本文件标题进行分词,首先会得到addDocument's,然后将里面的大写D转化为小写,最后再进行形态还原就能得到adddocument,所以也能得到一条记录结果,如下所示:

总共有【1】条匹配结果
------------------------------
name     = IndexWriter addDocument's a javadoc .txt
content  = Adds room a document to this room index. If the room document contains room  more than setMaxFieldLength(int) terms for a given field, the remainder are discarded.
room

size     = 169
path     = E:\datas\MyEclipseWorkspace\LuceneDemo\luceneDataSource\IndexWriter addDocument's a javadoc .txt

看到上面的内容,有人肯定会想,那进行索引时,将addDocument's索引成了adddocument那么我搜索addDocument或addDocument's能否同样得到一条记录呢,答案是肯定的.因为你索引和搜索是使用的是同一个分词器,所以进行搜索时,也会将你的搜索文本addDocument's转化为adddocument.

附件是整个项目工程文件,是基于MyEclipse的,你只要将其导入MyEclipse的项目空间中就可以使用的,虽然是一个java project工程,我为了大家的方便,已把lucene的核心jar包加入到的工程,你不用再为其导入这些jar,只需把junit4 lib导入即可使用.