Lucene Hack之通过缩小搜索结果集来提升性能 (2)

caocao

浏览: 276107 次
来自: 上海

最近访客更多访客>>

u012363178

ljjr13

ybbid

stonecold1108

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Java

lucene REST Apache 算法 J#

作者：caocao（网络隐士），http://www.caocao.name，http://www.caocao.mobi
转载请注明来源：http://www.iteye.com/topic/80073

书接前文(http://www.iteye.com/topic/78884)，上回说了个大致的原理，这回开始上代码。

五、原则
1、不改动lucene-core的代码
肆意改动lucene-core的代码实在是很不道德的事情，而且会导致后期维护升级的大量问题。如果真的有这等迫切需求，还不如加入lucene开发组，尽一份绵薄之力。看官说了，隐士你怎么不去啊，唉，代码比较丑陋，没脸去人家那里，后文详述。
2、不改动lucene索引文件格式
道理同上。
3、替换常规搜索的接口尽量少
这样可以方便来回切换标准搜索和这个搜索，减小代码修改、维护的成本。
4、命名规范
所有增加的类名均以Inaccurate开头，其余遵循lucene命名规范。

六、限制
1、隐士只做了BooleanWeight2的替代品，如果Weight不是BooleanWeight2，则等同于常规搜索。
2、如果搜索结果集小于等于最大允许的结果集，则等同于常规搜索。

七、文件

org.apache.lucene.search
    InaccurateBooleanScorer2.java // BooleanScorer2的替代品
    InaccurateBooleanWeight2.java // BooleanWeight2的替代品
    InaccurateHit.java // Hit的替代品
    InaccurateHitIterator.java // HitIterator的替代品
    InaccurateHits.java // Hits的替代品
    InaccurateIndexSearcher.java // IndexSearcher的替代品
org.apache.lucene.util
    InaccurateResultAggregation.java // 放搜索统计信息的value object

八、实战
1、InaccurateIndexSearcher
InaccurateIndexSearcher extends IndexSearcher，结构很简单，增加了两个成员变量：maxNumberOfDocs和inaccurateResultAggregation，以及几个methods。
丑陋的部分来了：

public void search(Weight weight, Filter filter, final HitCollector results, boolean ascending) throws IOException {
...
  if (weight.getClass().getSimpleName().equals("BooleanWeight2")) { // hook BooleanWeight2
   InaccurateBooleanWeight2 inaccurateBooleanWeight2 = new InaccurateBooleanWeight2(
     this, weight.getQuery());
   float sum = inaccurateBooleanWeight2.sumOfSquaredWeights();
   float norm = this.getSimilarity().queryNorm(sum);
   inaccurateBooleanWeight2.normalize(norm); // bad smell
   InaccurateBooleanScorer2 inaccurateBooleanScorer2 = inaccurateBooleanWeight2
     .getInaccurateBooleanScorer2(reader, maxNumberOfDocs);
   if (inaccurateBooleanScorer2 != null) {
    inaccurateResultAggregation = inaccurateBooleanScorer2
      .getInaccurateTopAggregation(collector, ascending);
   }
  } else {
   Scorer scorer = weight.scorer(reader);
   if (scorer != null) {
    scorer.score(collector);
   }
  }
...
}

由于BooleanWeight2被lucene-core给藏起来了，instanceof都不能用，只好丑陋一把用weight.getClass().getSimpleName().equals("BooleanWeight2")。
把BooleanWeight2替换为InaccurateBooleanWeight2后代码老是搜不到任何结果，经过千辛万苦地调试才发现BooleanWeight2初始化后并不算完，需要拿到sum、norm，然后normalize一把，有点bad smell。
接着从InaccurateBooleanWeight2里拿到InaccurateBooleanScorer2，调用getInaccurateTopAggregation搜一把，这里ascending并没有发挥作用，原因相当复杂，隐士引入ascending的本意是调整lucene扫描索引的方式，docID小->大或docID大->小，后来调整了建索引的方式就不需要这个了，所以隐士只是留这个接口以后用，万一以后lucene-core支持双向扫描索引即可启用。
2、InaccurateHits
InaccurateIndexSearcher里面调用search其实是调用new InaccurateHits(this, query, null, sort, ascending)。getMoreDocs会反向调用新写的search方法。
上代码：

...
TopDocs topDocs = (sort == null) ? searcher.search(weight, filter, n,
    ascending) : searcher
    .search(weight, filter, n, sort, ascending);
  length = topDocs.totalHits;
  InaccurateResultAggregation inaccurateResultAggregation = searcher
    .getInaccurateResultAggregation();
  if (inaccurateResultAggregation == null) {
   totalLength = length;
  } else {
   accurate = inaccurateResultAggregation.isAccurate();
   if (inaccurateResultAggregation.isAccurate()) {
    totalLength = inaccurateResultAggregation
      .getNumberOfRecordsFound();
   } else {
    int maxDocID = searcher.maxDoc();
    totalLength = 1000 * ((int) Math
      .ceil((0.001
        * maxDocID
        / (inaccurateResultAggregation.getLastDocID() + 1) * inaccurateResultAggregation
        .getNumberOfRecordsFetched()))); // guessing how many records there are
    }
  }
...

代码没什么特别的，除了一个猜测记录总数的算法。lucene从docID小向大的扫，由于上回说了扫到一半会跳出来，那么由最后扫到的lastDocID和maxDocID的比例可以猜测总共有多少条记录，虽然不是很准，但是数量级的精度是可以保证的，反正一般用户只能看到前1000条记录，具体有多少对用户来说不过是过眼云烟。
3、InaccurateBooleanWeight2
InaccurateBooleanWeight2没什么好说的，就是个拿到InaccurateBooleanScorer2的跳板。
4、InaccurateBooleanScorer2
InaccurateBooleanScorer2的代码均来自BooleanScorer2，由于BooleanScorer2从设计上来说并不准备被继承，隐士只好另起炉灶，bad smell啊。隐士没有修改任何从BooleanScorer2过来的代码，只加了getMaxNumberOfDocs、getInaccurateTopAggregation、getAccurateBottomAggregation。getInaccurateTopAggregation是扫描到maxNumberOfDocs后立即跳出来，所以结果会有所不准，getAccurateBottomAggregation总是保留最后maxNumberOfDocs个结果，结果也会有所不准，但是统计值是准的，因为每次都走完了所有索引。由两者差异可知getAccurateBottomAggregation性能会差一点，准确性和性能不可兼得啊。

 public InaccurateResultAggregation getInaccurateTopAggregation(
   HitCollector hc, boolean ascending) throws IOException {
  // DeltaTime dt = new DeltaTime();
  if (countingSumScorer == null) {
   initCountingSumScorer();
  }
  int lastDocID = 0;
  boolean reachedTheEnd = true;
  int numberOfRecordsFetched = 0;
  while (countingSumScorer.next()) {
   lastDocID = countingSumScorer.doc();
   float score = score();
   hc.collect(lastDocID, score);
   numberOfRecordsFetched++;
   if (numberOfRecordsFetched >= maxNumberOfDocs) {
    reachedTheEnd = !countingSumScorer.next();
    break;
   }
  }
  // System.out.println(dt.getTimeElasped());
  /*
   * This method might cast the rest away. So it might be inaccurate.
   */
  return new InaccurateResultAggregation(lastDocID, ascending,
    reachedTheEnd, numberOfRecordsFetched, numberOfRecordsFetched);
 }

 public InaccurateResultAggregation getAccurateBottomAggregation(
   HitCollector hc, boolean ascending) throws IOException {
  // DeltaTime dt = new DeltaTime();
  if (countingSumScorer == null) {
   initCountingSumScorer();
  }
  LinkedList<ResultNode> resultNodes = new LinkedList<ResultNode>();
  boolean isFull = false;
  int lastDocID = 0;
  int index = 0;
  int numberOfRecordsFound = 0;
  while (countingSumScorer.next()) {
   lastDocID = countingSumScorer.doc();
   float score = score();
   resultNodes.add(new ResultNode(lastDocID, score));
   if (isFull) {
    resultNodes.removeFirst();
   }
   index++;
   numberOfRecordsFound++;
   if (index >= maxNumberOfDocs) {
    isFull = true;
    index = 0;
    // break;
   }
  }
  for (ResultNode resultNode : resultNodes) {
   hc.collect(resultNode.getDoc(), resultNode.getScore());
  }
  // System.out.println(dt.getTimeElasped());

  /*
   * Since this method runs full scan against all matched docs, it's
   * accurate at all.
   */
  return new InaccurateResultAggregation(lastDocID, ascending, true,
    resultNodes.size(), numberOfRecordsFound);
 }

九、总结
代码已经打包上传了，有隐士写的简略注释，调用方式写在readme.txt里面，只需要替换几行代码即可。
总的来说只要
1、将Searcher searcher = new IndexSearcher(reader);替换为InaccurateIndexSearcher searcher = new InaccurateIndexSearcher(reader, 5000);
2、将Hits hits = searcher.search(query);替换为InaccurateHits hits = searcher.search(query, sort, ascending);
就行了。欢迎大家试用，如果有什么改进，请务必把改进后的代码也开源给大家，互相学习，互相促进。
由于代码里面有几处有bad smell，隐士实在没脸去lucene开发组那里喊一嗓子。

inaccurate.rar (13 KB)
描述: Lucene Extension
下载次数: 743

分享到：

硅谷之行 (18) 硅谷-斯坦福-旧金山 | Lucene Hack之通过缩小搜索结果集来提升性 ...

2007-05-15 12:32
浏览 4426
评论(3)
论坛回复 / 浏览 (3 / 11075)
查看更多

3 楼 bluepoint 2007-09-25

lucene 2.2
单个词搜索时(如:中国): 好像默认:TermQuery 那么 weight.getClass().getSimpleName().equals("BooleanWeight")
不成立,就用不上了,有啥办法处理?

多个词搜索时(如:中国上海):则可以使用.

2 楼 regedit 2007-07-06

老大，如何调整lucene扫描索引的方式？？？
我想改为从大ID开始扫。。。

1 楼 caocao 2007-07-06

已经升级到2.2了，在2.2里面，用这个修改的包不会有任何效果，因为BooleanWeight2已经没有了，现在只剩下BooleanWeight，只要把weight.getClass().getSimpleName().equals("BooleanWeight2")改成weight.getClass().getSimpleName().equals("BooleanWeight")即可。

顺便鄙视一下IT168，转载我的文章不写明出处。链接如下：

http://tech.it168.com/j/2007-06-04/200706040903078.shtml

http://tech.it168.com/j/2007-06-04/200706040939156.shtml

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论