Lucene 字符编码问题

liuxinglanyue

浏览: 567888 次
性别:
来自: 杭州

最近访客更多访客>>

hui963966800

lhc98

guoshun0321

kidding87

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2011-02 ( 10)
2011-01 ( 22)
2010-12 ( 165)
更多存档...

博客分类：

lucene

lucene Apache

现在如果一个txt文件中包含了ANSI编码的文本文件和Unicode编码的文本文件，如下图这种：

当用Lucene来建索引搜索时，这个文档中的内容是搜索不到的。

需要搜索的文本在附件中提供。

创建索引的源代码：

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class IndexFiles {
	// 主要代码 索引docDir文件夹下文档，索引文件在INDEX_DIR文件夹中
	@SuppressWarnings("deprecation")
	public static void main(String[] args) {

		File indexDir = new File("e:\\Lucene\\index");
		File docDir = new File("e:\\Lucene\\content");

		try {
			// 索引器
			IndexWriter standardWriter = new IndexWriter(FSDirectory
					.open(indexDir), new StandardAnalyzer(
					Version.LUCENE_CURRENT), true,
					IndexWriter.MaxFieldLength.LIMITED);
			// 不建立复合式索引文件，默认的情况下是复合式的索引文件
			standardWriter.setUseCompoundFile(false);
			String[] files = docDir.list();
			for (String fileStr : files) {
				File file = new File(docDir, fileStr);
				if (!file.isDirectory()) {
					Document doc = new Document();
					// 文件名称，可查询，不分词
					String fileName = file.getName().substring(0,
							file.getName().indexOf("."));
					System.out.println("fileName:"+fileName);
					doc.add(new Field("name", fileName, Field.Store.YES,
							Field.Index.NOT_ANALYZED));
					// 文件路径，可查询，不分词
					String filePath = file.getPath();
					doc.add(new Field("path", filePath, Field.Store.YES,
							Field.Index.NOT_ANALYZED));
					// 文件内容，需要检索
					doc.add(new Field("content", new FileReader(file)));
					standardWriter.addDocument(doc);
				}
			}
			standardWriter.optimize();
			// 关闭索引器
			standardWriter.close();
		} catch (IOException e) {
			System.out.println(" caught a " + e.getClass()
					+ "\n with message: " + e.getMessage());
		}
	}
}

搜索的源代码：

import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * 检索索引
 */
public class SearchFiles {

	/** Simple command-line based search demo. */
	@SuppressWarnings("deprecation")
	public static void main(String[] args) throws Exception {

		String index = "E:\\Lucene\\index";
		String field = "content";
		String queries = null;
		boolean raw = false;
		// 要显示条数
		int hitsPerPage = 10;

		// searching, so read-only=true
		IndexReader reader = IndexReader.open(
				FSDirectory.open(new File(index)), true); // only

		Searcher searcher = new IndexSearcher(reader);
		Analyzer standardAnalyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

		BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
		QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, field,
				standardAnalyzer);
		while (true) {
			if (queries == null) // prompt the user
				System.out.println("Enter query: ");

			String line = in.readLine();

			if (line == null || line.length() == -1)
				break;

			line = line.trim();
			if (line.length() == 0)
				break;

			Query query = parser.parse(line);
			System.out.println("Searching for: " + query.toString(field));

			doPagingSearch(in, searcher, query, hitsPerPage, raw,
					queries == null);
		}
		reader.close();
	}

	public static void doPagingSearch(BufferedReader in, Searcher searcher,
			Query query, int hitsPerPage, boolean raw, boolean interactive)
			throws IOException {

		TopScoreDocCollector collector = TopScoreDocCollector.create(
				hitsPerPage, false);
		searcher.search(query, collector);
		ScoreDoc[] hits = collector.topDocs().scoreDocs;

		int end, numTotalHits = collector.getTotalHits();
		System.out.println(numTotalHits + " total matching documents");

		int start = 0;

		end = Math.min(hits.length, start + hitsPerPage);

		for (int i = start; i < end; i++) {
			Document doc = searcher.doc(hits[i].doc);
			String path = doc.get("path");
			if (path != null) {
				System.out.println((i + 1) + ". " + path);
			} else {
				System.out
						.println((i + 1) + ". " + "No path for this document");
			}
		}
	}
}

需要搜索的文本.rar (255.1 KB)
下载次数: 25

查看图片附件

分享到：

Lucene 字符编码问题 | MyEclipse 快捷键（收藏）

2010-12-27 20:20
浏览 1041
评论(1)
论坛回复 / 浏览 (1 / 3500)
分类:编程语言
查看更多

1 楼 ralfbawg 2010-12-29

doc.add(new Field("content", new FileReader(file)));
这个方法换成

doc.add(new Field("contents", new InputStreamReader(new FileInputStream(file.getCanonicalPath()), charset)));

FileReader用的是系统默认的编码,这样就导致一种编码方式的文件可能以另一种编码方式读取进来进行索引,结果导致在检索时,检索不到.

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene 字符编码问题

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene 字符编码问题

评论

发表评论

相关推荐

关于Lucene的讨论

有关Lucene的问题（收藏）推荐

Lucene 学习总结（收藏）推荐

基于Lucene的Compass 资源（收藏）

Lucene 3.0.2索引文件官方文档（二）

Lucene 3.0.2索引文件官方文档（一）

Lucene 3.0 索引文件学习总结（收藏）

Lucene 字符编码问题

Annotated Lucene(源码剖析中文版)

Lucene 学习推荐博客

Lucene3.0 初窥 总结（收藏）

转：基于lucene实现自己的推荐引擎

加速 lucene 的搜索速度 ImproveSearchingSpeed（二）

加速 lucene 索引建立速度 ImproveIndexingSpeed

lucene 3.0 中的demo项目部署

Lucene 3.0.2 源码 - final class Document

Lucene 3.0.2 源码 - final class Field

Lucene 3.0.2 源码 - abstract class AbstractField

Lucene 3.0.2 源码 - interface Fieldable

LinkedIn公司实现的实时搜索引擎Zoie

最近访客更多访客>>

Lucene3.0 初窥总结（收藏）