智能查询功能lucene使用情况及遇到的问题 -

jklliang

浏览: 40971 次
性别:
来自: 北京

最近访客更多访客>>

wuyu22222

lhwww

temptation

liuyouming

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

智能查询功能lucene使用情况及遇到的问题

博客分类：

lucene

lucene Apache 搜索引擎算法 D语言

[color=orange][/color]系统智能查询功能lucene使用情况及遇到的问题
根据客户反映我们要实现的搜索功能。
【范例】
• 【自然语查询】输入「百草」，可以查到「百草路」
• 【同音字查询】输入「白超路」，可以查到「百草路」
第一．自然语句查询。
这是lucene大众化的搜索功能，但是单纯使用lucene官方包内的提供中文分词器，不能很好的实现我们需要的功能。
但是我通过使用Google IKAnalyzer的一个开源项目，目前已经利用这个分词器基本上将这个功能实现。
IKAnalyzer是一个很好的开源中文分词器，拆分方式例如:

文本原文1:
IKAnalyzer 是一个开源的，基于java 语言开发的轻量级的中文分词工具包。从2006 年12月推出1.0 版开始， IKAnalyzer 已经推出了3 个大版本。
分词结果:
ikanalyzer | 是| 一个| 一| 个| 开源| 的| 基于| java | 语言| 开发| 的| 轻量
级| 量级| 的| 中文| 分词| 工具包| 工具| 从| 2006 | 年| 12 | 月| 推出| 1.0
| 版| 开始| ikanalyzer | 已经| 推出| 出了| 3 | 个大| 个| 版本

IKAnalyzer的是比较智能按照我们的读音进行拆分。

上面的例子当用户输入：一个字 “是” 或是两个字 “开源” 搜索就可以检索出上面的记录。但是如果输入的是“开” 搜索不能检索出上面记录。
问题根源在于：“开”是和“开源”拆分在一起的，并不是一个单独的搜索关键字。

关于这里lucene 的使用：
1．导入lucene-core-3.0.1.jar 和IKAnalyzer3.2.3Stable.jar
2．我改造了的一个demo可以直接运行: IKAnalyzerDemo.Jar(里面代码都有注释)

第二．同音字查询
     同音字查询没有什么进展，首先网上没有这方面的思路介绍，更没有拼音的拆分器。
总体最可行的方案有几种，
     1根据拼音足组合字词
           即：用户输入baicao 我们可以组合出 |百草|白草败草|拜槽|等可能的词来
               这个需要建立以拼音——字符库


例：
ce=册侧策测厕恻側冊厠墄嫧帻幘廁惻憡拺敇柵栅測畟笧筞筴箣簎粣荝萗萴蓛赦齰刂
cen=参岑涔參叄叅嵾梣汵硶穇笒篸膥
ceng=曾层蹭噌僧增層嶒橧竲繒缯驓
当然我们可以根据我们库中的内容建立不用把搜有的字都建立进去。

这样我们通过解析解析拼音组合出响应的词句来，再通过lucene到索引文件中搜索。

实现难度：
拼词的实现难度很大，怎样拼出最合适的词来是关键（拼出过多的组合来在查询的时候很浪费资源并且效率上也需要有好的解决方案）。
3．索引文件建立是建立出响应的全拼单个字的拼音
例如：java 语言开发这句话
我们建立索引时可以这样：java 语言开发 java yu yan kai fa javayuyankaifa
这样可以通过两种方式去搜索
1）。IKAnalyzer中文分词器它会把语句拆分成：
Java| 语言|开发| java | yu | yan | kai | fa | javayuyankaifa|
Lucene中所有的分词器凡是遇到空格道义一个分词点

这样用户输入 “yu “ 或是“yan”或是“javayuyankaifa|
”可以搜出响应的内容
但是输入”javayuyan”便不能搜出响应的记录因为分词中没有“javayuyan”这个词。
2）。解决输入“javayuyan“可以在使用WildcardQuery这个搜索器。
   搜索器分为五种
       布尔操作符
域搜索(Field Search)
通配符搜索(Wildcard Search)
模糊查询
范围搜索(Range Search)
我们使用通配符搜索(Wildcard Search)在输入的内容中格式化为“* javayuyan*”就可以匹配到“javayuyankaifa“
但是如果用户输错了拼音或是输入的是谐音那就什么也查不到了
3）。对于输入的是谐音汉字：可以把谐音汉字拆分成带空格的例如输入“白抄录”
拼音转出来时“bai chao lu”这样可以”bai” 和“lu”两个词时能搜到相关带“bai”和”cao”的。
但是如果输入的三个字三个都是谐音那就又完啦！！！

我想到的就是这两种方案

总体来说第二种方案实现起来难度不大但是bug太多。并且要分几次去搜索效能降低太厉害

第一种方案实现难度大但是是现在输入法和搜索引擎公司采用的方式。他们都对搜索频率大的字词特做记录像搜狗我们经常打的字他会最先出现。实现好了这肯定是最好的方式。

写的太仓促没检查错字错句模糊匹配啦！！！！！！

Lucene相关资料
Lucene搜索使用
http://www.chedong.com/tech/lucene.html
中文分析器
http://www.cnblogs.com/wuxh/articles/77870.html

应经实现的测试类：
package test.insta;

/**
* IK Analyzer Demo
* @param args
*/
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory; // 引用IKAnalyzer3.0的类
import org.apache.lucene.util.Version;
import org.wltea.analyzer.lucene.IKAnalyzer;
import org.wltea.analyzer.lucene.IKQueryParser;
import org.wltea.analyzer.lucene.IKSimilarity;

/**
* @author 同串比较
*/
public class IKAnalyzerDemo {
public static void main(String[] args) {
// Lucene Document的域名
String fieldName = "text";
// 检索内容
String text = "IK Analyzer bai cao lu wang jia da dao baicaoluwangjiadadao 搜索王贾大道是一个结合词典分词和文法分词的中文分词开源工具包。它使用了全新的正向迭代最细粒度切分算法。";
// 实例化IKAnalyzer分词器
Analyzer analyzer = new IKAnalyzer();
Directory directory = null;
IndexWriter iwriter = null;
IndexSearcher isearcher = null;
try {
// 建立内存索引对象
directory = new RAMDirectory();
iwriter = new IndexWriter(directory, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc.add(new Field(fieldName, text, Field.Store.YES,
Field.Index.ANALYZED));
iwriter.addDocument(doc);
iwriter.close();
// 实例化搜索器
isearcher = new IndexSearcher(directory);
// 在索引器中使用IKSimilarity相似度评估器
isearcher.setSimilarity(new IKSimilarity());
String keyword = "是";
keyword=dissect(keyword);
// 使用IKQueryParser查询分析器构造Query对象
Query query = IKQueryParser.parse(fieldName, keyword);
// 搜索相似度最高的5条记录
TopDocs topDocs = isearcher.search(query, 5);
System.out.println("命中：" + topDocs.totalHits);
// 输出结果
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < topDocs.totalHits; i++) {
Document targetDoc = isearcher.doc(scoreDocs[i].doc);
System.out.println("内容：" + targetDoc.toString());
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (isearcher != null) {
try {
isearcher.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (directory != null) {
try {
directory.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

protected static String dissect(String input) {

StringBuilder sb = new StringBuilder();
TokenStream ts = null;
try {
HashSet stopWords = new HashSet();
IKAnalyzer analyzer = new IKAnalyzer();
ts = analyzer.tokenStream("", new StringReader(input));
sb.setLength(0);
while (ts.incrementToken()) {
sb.append("   ");
sb.append(
((TermAttribute) ts.getAttribute(TermAttribute.class))
.term()).append(" ");
;
}

if (sb.length() > 0) {
sb.setLength(sb.length() - 1);
}
return sb.toString();

} catch (Exception e) {

e.printStackTrace();
return input;
} finally {
if (ts != null) {
try {
ts.close();
} catch (IOException e) {
e.printStackTrace();
}
}

}

}
}

通配符测试：
package test;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.WildcardQuery;

public class WildcardQueryTest {
public static void main(String[] args) throws Exception {
//生成Document对象，下同
Document doc1 = new Document();
Document doc2 = new Document();
Document doc3 = new Document();
Document doc4 = new Document();

//添加“content”字段的内容，下同
doc1.add(new Field("content", "whatever ni hao",Field.Store.YES,
Field.Index.ANALYZED));
//添加“title”字段的内容，下同
doc1.add(new Field("title", "doc1 ",Field.Store.YES,
Field.Index.ANALYZED));

//添加“content”字段的内容，下同
doc2.add(new Field("content", "whatever nihao ",Field.Store.YES,
Field.Index.ANALYZED));
//添加“title”字段的内容，下同
doc2.add(new Field("title", "doc2",Field.Store.YES,
Field.Index.ANALYZED));
//添加“content”字段的内容，下同
doc3.add(new Field("content", "whatever nihaoa ",Field.Store.YES,
Field.Index.ANALYZED));
//添加“title”字段的内容，下同
doc3.add(new Field("title", "doc3",Field.Store.YES,
Field.Index.ANALYZED));
//添加“content”字段的内容，下同
doc4.add(new Field("content", "whatever wonihaoma",Field.Store.YES,
Field.Index.ANALYZED));
//添加“title”字段的内容，下同
doc4.add(new Field("title", "doc4",Field.Store.YES,
Field.Index.ANALYZED));

//生成索引书写器
IndexWriter writer = new IndexWriter("c:\\index",
    new StandardAnalyzer(), true);
//将文档对象添加到索引中
writer.addDocument(doc1);
writer.addDocument(doc2);
writer.addDocument(doc3);
writer.addDocument(doc4);
//关闭索引书写器
writer.close();

    //生成索引书写器
IndexSearcher searcher = new IndexSearcher("c:\\index");
//构造词条
Term word1 = new Term("content", "*nihao");
Term word2 = new Term("content", "nihao");
Term word3 = new Term("content", "nihao?");
Term word4 = new Term("content", "nihao*");
//生成WildcardQuery对象，初始化为null
WildcardQuery query = null;
//用于保存检索结果
Hits hits = null;

query = new WildcardQuery(word1);
//开始第一次检索，并返回检索结果
hits = searcher.search(query);
//输出检索结果的相关信息
printResult(hits, "*nihao");

query = new WildcardQuery(word2);
//开始第二次检索，并返回检索结果
hits = searcher.search(query);
//输出检索结果的相关信息
printResult(hits, "nihao");

query = new WildcardQuery(word3);
//开始第三次检索，并返回检索结果
hits = searcher.search(query);
//输出检索结果的相关信息
printResult(hits, "nihao?");

query = new WildcardQuery(word4);
//开始第四次检索，并返回检索结果
hits = searcher.search(query);
//输出检索结果的相关信息
printResult(hits, "nihao*");
}

public static void printResult(Hits hits, String key) throws Exception
{System.out.println("查找 \"" + key + "\" :");
if (hits != null) {
   if (hits.length() == 0) {
    System.out.println("没有找到任何结果");
    System.out.println();
   } else {
    System.out.print("找到");
    for (int i = 0; i < hits.length(); i++) {
     //取得文档对象
     Document d = hits.doc(i);
     //取得“title”字段的内容
     String dname = d.get("title");
     System.out.print(dname + "   ");
    }
    System.out.println();
    System.out.println();
   }
}
}
}

[color=red]65443941这是我们的q q 群技术讨论社区。flex java gis （arcgis googlemap）技术讨论，数据库优化，数据模型设计，系统架构设计。基于uml的需求分析系统设计等等，欢迎加入。

分享到：