Lucene之Helloworld

okwangxing

浏览: 29274 次
性别:
来自: 杭州

最近访客更多访客>>

miao600

眉眼间的绝美

hexiaojiao

h_h_m2632

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索

lucene Apache SVN maven 搜索引擎

Lucene不是一个完整搜索引擎,不具备爬虫功能,管理界面之类的功能,可其部分之项目中实现了网站的搜索引擎,Nutch就是其中的一个,基于Lucene实现的搜索引擎应用. .

本文记录下自己的学习点点滴滴,实现一个简单的程序,
Hello world 之实现文本搜索
这里没应用中文分词的东西,可以参照庖丁解牛的项目,svn中已经上传了代码,上面有针对lucene3.0的.感兴趣的可自行试验.
SVN地址

svn checkout http://paoding.googlecode.com/svn/trunk/ paoding-read-only

项目是利用Maven构建的,自从开始用Maven就是疯狂的爱上了她.个人推荐使用!
Maven pom.xml

<dependency>
	<groupId>log4j</groupId>
	<artifactId>log4j</artifactId>
	<version>1.2.15</version>
</dependency>
<dependency>
	<groupId>commons-logging</groupId>
	<artifactId>commons-logging</artifactId>
	<version>1.1.1</version>
</dependency>
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-core</artifactId>
	<version>3.0.0</version>
</dependency>

对索引的提取,数据是自己造的.下面提供下载.
Index.java

import java.io.File;
import java.io.FileReader;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * @author ruodao
 * @since 1.0 2010-2-23 下午09:39:10
 */
public class Index {
	public static void main(String[] args) throws Exception {
		String indexDir = "E:\\Temp\\index";
		String dataDir = "E:\\Temp\\data";

		long start = System.currentTimeMillis();
		Index indexer = new Index(indexDir);
		int numIndex = indexer.index(dataDir);

		indexer.close();

		long end = System.currentTimeMillis();

		System.out.println("Indexing " + numIndex + " files tooks "
				+ (end - start) + " millisenconds");
	}

	private IndexWriter writer;
	private Analyzer analyzer;
	
	private static final Log logger = LogFactory.getLog(Index.class);

	public Index(String indexDir) throws Exception {
		Directory dir = FSDirectory.open(new File(indexDir));
		analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
		writer = new IndexWriter(dir, analyzer, MaxFieldLength.UNLIMITED);
	}

	public void close() throws Exception {
		writer.close();
	}

	public int index(String dataDir) throws Exception {
		File[] files = new File(dataDir).listFiles();
		for (File f : files) {
			if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead()
					&& acceptFile(f)) {
				indexFile(f);
			}
		}
		return writer.numDocs();
	}

	protected boolean acceptFile(File f) {
		return f.getName().endsWith(".txt");
	}

	protected Document getDocument(File f) throws Exception {
		Document doc = new Document();
		doc.add(new Field("contents", new FileReader(f)));
		doc.add(new Field("filename", f.getCanonicalPath(), Store.YES,
				org.apache.lucene.document.Field.Index.NOT_ANALYZED));
		return doc;
	}

	private void indexFile(File f) throws Exception {
		System.out.println("Index " + f.getCanonicalPath());
		
		Document doc = getDocument(f);
		if (doc != null) {
			writer.addDocument(doc);
		}
		
		
		//查看分词情况  可选代码
		TokenStream ts = analyzer.tokenStream("contents", new FileReader(doc
				.get("filename")));
		ts.addAttribute(TermAttribute.class);

		while (ts.incrementToken()) {
			TermAttribute ta = ts.getAttribute(TermAttribute.class);
			logger.debug("{" + ta.term() + "}");
		}
	}
}

数据准备好了,也该提供给别人使用吧,一个简单的搜索.
Searcher.java

import java.io.File;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * @author ruodao
 * @since 1.0 2010-2-23 下午10:19:06
 */
public class Searcher {
	public static void main(String[] args) throws Exception {
		String indexDir = "E:\\Temp\\index";
		String q = "中";

		searc(indexDir, q);
	}

	private static void searc(String indexDir, String q) throws Exception {
		Directory dir = FSDirectory.open(new File(indexDir), null);
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
		IndexSearcher is = new IndexSearcher(dir);
		QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,
				"contents", analyzer);
		Query query = parser.parse(q);
		long start = System.currentTimeMillis();

		TopDocs hits = is.search(query, 10);

		long end = System.currentTimeMillis();

		System.err.println("Found " + hits.totalHits + " Document(s) (in )"
				+ (end - start) + "milliseconds) that matched query '" + q
				+ "':");
		for (int j = 0; j < hits.scoreDocs.length; j++) {
			ScoreDoc scoreDoc = hits.scoreDocs[j];
			Document doc = is.doc(scoreDoc.doc);
			System.out.println(doc.get("filename"));
		}
		is.close();
	}
}

一个简单而完整的程序已经完成了.可以实验实验.

--EOF--

data.rar (1.1 KB)
下载次数: 34

分享到：

Lucene之索引建立 | Zookeeper Java API

2010-02-24 17:22
浏览 1675
评论(3)
分类:企业架构
查看更多

3 楼 java-xb 2012-10-19

夜神月写道

按照楼主的文章，建立项目，将代码跑起来，建立索引，而后查询，总是查找为0,求答案

中文的检索内容好像不应试试英文的

2 楼夜神月 2011-08-12

按照楼主的文章，建立项目，将代码跑起来，建立索引，而后查询，总是查找为0,求答案

1 楼 ladybird2010 2010-02-25

求Lucene结合Hibernate的配置实例急。。
您若有Lucene的例子工程，帮忙发一个好吗？最好是可以分词。
Email: gao.guangpei@zte.com.cn 或者ggp123@126.com
非常感谢你！

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论