Lucene Hack之通过缩小搜索结果集来提升性能

weiwu83

浏览: 193222 次
来自: ...

最近访客更多访客>>

xuqb1

lauyuhim

ljjr13

overloving

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

lucene 算法 Google 嵌入式网络应用

作者：caocao（网络隐士），http://www.caocao.name，http://www.caocao.mobi
转载请注明来源：http://www.iteye.com/topic/78884

一、缘起
Lucene在索引文件上G之后的搜索性能下降很严重，随便跑个搜索就要上0.x秒。如果是单线程搜索那么性能尚可，总可以在0.x秒返回结果，如果是Web式的多线程访问，由于Lucene的内部机制导致数据被大量载入内存，用完后立即丢弃，随之引起JVM频繁GC，性能极其低下，1-10秒的长连接比比皆是。这也是世人为之诟病的Lucene应用瓶颈问题，那么是否有解决方法呢？

二、思路
我们来观察Google, Baidu的搜索，有一个总体的感觉就是搜索结果多的关键词耗时比较少，结果少的关键词耗时反而多，且结果多的时候会说“约******个结果”。隐士猜测Google, Baidu的算法是找到前n个结果后停止扫描索引，根据前n个结果来推断总共有多少个结果，此猜想可由Google, Baidu翻页限制而得到部分验证。
再看Lucene，其Hits.length()返回的总是精确的结果，如果可以让Lucene也返回模糊的结果，那么索引文件就算是10G也可以轻松应对了。

三、探索
隐士带着这个问题访名山、觅高人，可惜没有找到前人的成果，可能是隐士走的路不够勤，如有类似的解决方案，隐士不吝赐教。
无奈之下，隐士详细研究了Lucene 2.1.0源码，准备重新发明轮子。
一般来说大多数搜索应用中的Query都会落在BooleanQuery上，隐士就拿它开刀。一路看来，BooleanScorer2里的一个method吸引了隐士，代码如下：

代码

public void score(HitCollector hc) throws IOException {
if (countingSumScorer == null) {
initCountingSumScorer();
}
while (countingSumScorer.next()) {
hc.collect(countingSumScorer.doc(), score());
}
}

在while循环里嵌入写日志代码可证结果集有多大，此处就循环了多少次。countingSumScorer.next()的意思是找到下一个符合boolean规则的document，找到后放入HitCollector，这HitCollector后面会换个马甲放在大家熟悉的Hits里面。
如果可以在这个while循环里嵌一个break，到一定数量就break出来，性能提升将相当明显。这个代码相当简单，果然大幅提高了性能，带来的副作用是结果不太准，这个可以通过调整业务模型、逻辑来修正。毕竟这是一条提升Lucene性能的有效方法。
细细想来，正是由于这个break会导致结果集大的关键词提前出来，搜索时间少，结果集小的关键词不可避免会走完整个索引，相应的搜索时间会长一点。

四、效果
由于具体嵌入代码的过程极其繁琐，隐士将在第二回详细讲解。这第一回先来个Big picture。
历尽千辛万苦，隐士终于搞定了这套程序，效果可以从隐士做的视频搜索http://so.mdbchina.com/video/%E7%BE%8E%E5%A5%B3看出。
这个关键词“美女”可以找到18万个视频，平均0.5秒返回结果，现在用上了新算法，只要0.06x秒返回结果，而且返回结果足够好了，估算的8.5万个结果虽然离18万有很大差距，不过由于是估算的，差2-3倍应属可以接受的。
由算法的特性可知，while里面的hc.collect总可以在常量时间内完成，循环次数又是<=常量，该算法的时间复杂度只和BooleanQuery的复杂程度相关，和索引文件大小以及命中的Document在索引文件内的分布密度没有关系，因为BooleanQuery的复杂程度决定了countingSumScorer.next()需要经过多少次判断、多少次读取索引文件，countingSumScorer.next()正是整个算法中耗时不定的部分。
现在这个视频搜索的索引文件接近3G，热门关键词可以在0.0x秒返回结果，隐士相信即使以后索引文件上到10G，依然可以在0.0x秒返回结果。

五、原则
1、不改动lucene-core的代码
肆意改动lucene-core的代码实在是很不道德的事情，而且会导致后期维护升级的大量问题。如果真的有这等迫切需求，还不如加入lucene开发组，尽一份绵薄之力。看官说了，隐士你怎么不去啊，唉，代码比较丑陋，没脸去人家那里，后文详述。
2、不改动lucene索引文件格式
道理同上。
3、替换常规搜索的接口尽量少
这样可以方便来回切换标准搜索和这个搜索，减小代码修改、维护的成本。
4、命名规范
所有增加的类名均以Inaccurate开头，其余遵循lucene命名规范。

六、限制
1、隐士只做了BooleanWeight2的替代品，如果Weight不是BooleanWeight2，则等同于常规搜索。
2、如果搜索结果集小于等于最大允许的结果集，则等同于常规搜索。

七、文件

代码

org.apache.lucene.search
InaccurateBooleanScorer2.java // BooleanScorer2的替代品
InaccurateBooleanWeight2.java // BooleanWeight2的替代品
InaccurateHit.java // Hit的替代品
InaccurateHitIterator.java // HitIterator的替代品
InaccurateHits.java // Hits的替代品
InaccurateIndexSearcher.java // IndexSearcher的替代品
org.apache.lucene.util
InaccurateResultAggregation.java // 放搜索统计信息的value object

八、实战
1、InaccurateIndexSearcher
InaccurateIndexSearcher extends IndexSearcher，结构很简单，增加了两个成员变量：maxNumberOfDocs和inaccurateResultAggregation，以及几个methods。
丑陋的部分来了：

代码

public void search(Weight weight, Filter filter, final HitCollector results, boolean ascending) throws IOException {
...
if (weight.getClass().getSimpleName().equals("BooleanWeight2")) { // hook BooleanWeight2
InaccurateBooleanWeight2 inaccurateBooleanWeight2 = new InaccurateBooleanWeight2(
this, weight.getQuery());
float sum = inaccurateBooleanWeight2.sumOfSquaredWeights();
float norm = this.getSimilarity().queryNorm(sum);
inaccurateBooleanWeight2.normalize(norm); // bad smell
InaccurateBooleanScorer2 inaccurateBooleanScorer2 = inaccurateBooleanWeight2
.getInaccurateBooleanScorer2(reader, maxNumberOfDocs);
if (inaccurateBooleanScorer2 != null) {
inaccurateResultAggregation = inaccurateBooleanScorer2
.getInaccurateTopAggregation(collector, ascending);
}
} else {
Scorer scorer = weight.scorer(reader);
if (scorer != null) {
scorer.score(collector);
}
}
...
}

代码

...
TopDocs topDocs = (sort == null) ? searcher.search(weight, filter, n,
ascending) : searcher
.search(weight, filter, n, sort, ascending);
length = topDocs.totalHits;
InaccurateResultAggregation inaccurateResultAggregation = searcher
.getInaccurateResultAggregation();
if (inaccurateResultAggregation == null) {
totalLength = length;
} else {
accurate = inaccurateResultAggregation.isAccurate();
if (inaccurateResultAggregation.isAccurate()) {
totalLength = inaccurateResultAggregation
.getNumberOfRecordsFound();
} else {
int maxDocID = searcher.maxDoc();
totalLength = 1000 * ((int) Math
.ceil((0.001
* maxDocID
/ (inaccurateResultAggregation.getLastDocID() + 1) * inaccurateResultAggregation
.getNumberOfRecordsFetched()))); // guessing how many records there are
}
}
...

代码

public InaccurateResultAggregation getInaccurateTopAggregation(
HitCollector hc, boolean ascending) throws IOException {
// DeltaTime dt = new DeltaTime();
if (countingSumScorer == null) {
initCountingSumScorer();
}
int lastDocID = 0;
boolean reachedTheEnd = true;
int numberOfRecordsFetched = 0;
while (countingSumScorer.next()) {
lastDocID = countingSumScorer.doc();
float score = score();
hc.collect(lastDocID, score);
numberOfRecordsFetched++;
if (numberOfRecordsFetched >= maxNumberOfDocs) {
reachedTheEnd = !countingSumScorer.next();
break;
}
}
// System.out.println(dt.getTimeElasped());
/*
* This method might cast the rest away. So it might be inaccurate.
*/
return new InaccurateResultAggregation(lastDocID, ascending,
reachedTheEnd, numberOfRecordsFetched, numberOfRecordsFetched);
}
public InaccurateResultAggregation getAccurateBottomAggregation(
HitCollector hc, boolean ascending) throws IOException {
// DeltaTime dt = new DeltaTime();
if (countingSumScorer == null) {
initCountingSumScorer();
}
LinkedList<ResultNode> resultNodes = new LinkedList<ResultNode>();
boolean isFull = false;
int lastDocID = 0;
int index = 0;
int numberOfRecordsFound = 0;
while (countingSumScorer.next()) {
lastDocID = countingSumScorer.doc();
float score = score();
resultNodes.add(new ResultNode(lastDocID, score));
if (isFull) {
resultNodes.removeFirst();
}
index++;
numberOfRecordsFound++;
if (index >= maxNumberOfDocs) {
isFull = true;
index = 0;
// break;
}
}
for (ResultNode resultNode : resultNodes) {
hc.collect(resultNode.getDoc(), resultNode.getScore());
}
// System.out.println(dt.getTimeElasped());
/*
* Since this method runs full scan against all matched docs, it's
* accurate at all.
*/
return new InaccurateResultAggregation(lastDocID, ascending, true,
resultNodes.size(), numberOfRecordsFound);
}

九、总结
代码已经打包上传了，有隐士写的简略注释，调用方式写在readme.txt里面，只需要替换几行代码即可。
总的来说只要
1、将Searcher searcher = new IndexSearcher(reader);替换为InaccurateIndexSearcher searcher = new InaccurateIndexSearcher(reader, 5000);
2、将Hits hits = searcher.search(query);替换为InaccurateHits hits = searcher.search(query, sort, ascending);
就行了。欢迎大家试用，如果有什么改进，请务必把改进后的代码也开源给大家，互相学习，互相促进。
由于代码里面有几处有bad smell，隐士实在没脸去lucene开发组那里喊一嗓子。