尝试使用fuzzyquery分析中文字符的相似度

lang

浏览: 28061 次
来自: ...

最近访客更多访客>>

shizhenbang

yyhao067

jeocana

kaixinkele

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (5)

社区版块

存档分类

lucene

    需要对一些数据进行去掉重复处理，规则类似于，两条记录的名称和地址相似度很大，则认为是同样的，应该去掉当中的某一个。昨天晚上翻了lucene得书籍，想找找看有什么好的方法没有，最后决定尝试一下fuzzyquery，大早上的就写了一个，结果很让我莫名其妙！
    代码如下：


public class FuzzyQueryTest {

	public static void main(String[] args) {
		RAMDirectory directory = new RAMDirectory();
		try {
			IndexWriter indexWriter = new IndexWriter(directory,
					new MMAnalyzer(), true);
			Document document1 = new Document();
			Document document2 = new Document();
			Document document3 = new Document();
			Document document4 = new Document();
			Field f1 = new Field("content", "北京科技大学", Field.Store.YES,
					Field.Index.TOKENIZED);
			Field f2 = new Field("content", "北京语言大学", Field.Store.YES,
					Field.Index.TOKENIZED);
			Field f3 = new Field("content", "中国科技大学", Field.Store.YES,
					Field.Index.TOKENIZED);
			Field f4 = new Field("content", "北京大学科技馆", Field.Store.YES,
					Field.Index.TOKENIZED);
			document1.add(f1);
			document2.add(f2);
			document3.add(f3);
			document4.add(f4);
			indexWriter.addDocument(document4);
			indexWriter.addDocument(document3);
			indexWriter.addDocument(document2);
			indexWriter.addDocument(document1);
			indexWriter.close();
			// search
			IndexSearcher indexSearcher = new IndexSearcher(directory);
			Term term = new Term("content", "北京语言大学");
			FuzzyQuery fuzzyQuery = new FuzzyQuery(term);
			Hits hits = indexSearcher.search(fuzzyQuery);
			for (int i = 0; i < hits.length(); i++) {
				System.out.println(hits.doc(i));
				System.out.println(hits.score(i));
				Explanation explanation = indexSearcher.explain(fuzzyQuery, i);
				System.out.println(explanation.toString());
				System.out.println("-----------------");
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (LockObtainFailedException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

结果是空，如果只是搜索大学或者北京，是有结果的。但是，怎么看这个结果都不能够满足我那个排重的需求阿！
各位，有什么好的建议！指导一下！

自己修正一下我自己提出的问题，我提出的问题实际上是比较两个字符的相似度。对于这个问题，doris给出的解释思路是使用lcss找到最大的匹配字符串。我正在考虑，找到最大字符串后，怎么通过某种机制来衡量和原来字符相似的评分！

分享到：

开发一套基于j2me的地图api | hibernate返回的connection在调用oracle存 ...

2008-04-29 08:02
浏览 7780
评论(1)
查看更多

1 楼 hellodesigner 2008-05-11

你这段代码有两个地方可能有问题：
1、你搜索时，需要分词，因为你在索引的时候，你对搜索内容进行了分词。这是最大的问题。
2、你在构建FuzzyQuery对象时，根据你的需要设置一下相似度的阀值，当然了，默认的情况已经能满足你的需要了，如果你需要相似度更高一些的话，你可以调整一下。

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论