Lucene Hack之通过缩小搜索结果集来提升性能 (2)

全部 Hibernate Spring Struts iBATIS 企业应用 Lucene SOA Java综合 Tomcat 设计模式 OO JBoss

浏览 11053 次

锁定老帖子主题：Lucene Hack之通过缩小搜索结果集来提升性能 (2) 该帖已经被评为良好帖
作者	正文
caocao 等级: 文章: 125 积分: 315 来自: 上海	发表时间：2007-05-15 相关推荐: 在XSLT样式表中声明命名空间小结 PostgreSQL学习笔记（一）：数据库、模式、表空间、用户、用户角色数据库-数据库结构与模式【数据库】一、数据库基础知识整理(关系数据库、非关系型数据库、三级模式、二级映象、数据模型、层次模型网状模型关系模型、实体完整性参照完整性用户定义完整性、关系模式) 数据库与模式的区别更多相关推荐 Lucene 作者：caocao（网络隐士），http://www.caocao.name，http://www.caocao.mobi 转载请注明来源：http://www.iteye.com/topic/80073 书接前文(http://www.iteye.com/topic/78884)，上回说了个大致的原理，这回开始上代码。五、原则 1、不改动lucene-core的代码肆意改动lucene-core的代码实在是很不道德的事情，而且会导致后期维护升级的大量问题。如果真的有这等迫切需求，还不如加入lucene开发组，尽一份绵薄之力。看官说了，隐士你怎么不去啊，唉，代码比较丑陋，没脸去人家那里，后文详述。 2、不改动lucene索引文件格式道理同上。 3、替换常规搜索的接口尽量少这样可以方便来回切换标准搜索和这个搜索，减小代码修改、维护的成本。 4、命名规范所有增加的类名均以Inaccurate开头，其余遵循lucene命名规范。六、限制 1、隐士只做了BooleanWeight2的替代品，如果Weight不是BooleanWeight2，则等同于常规搜索。 2、如果搜索结果集小于等于最大允许的结果集，则等同于常规搜索。七、文件 org.apache.lucene.search InaccurateBooleanScorer2.java // BooleanScorer2的替代品 InaccurateBooleanWeight2.java // BooleanWeight2的替代品 InaccurateHit.java // Hit的替代品 InaccurateHitIterator.java // HitIterator的替代品 InaccurateHits.java // Hits的替代品 InaccurateIndexSearcher.java // IndexSearcher的替代品 org.apache.lucene.util InaccurateResultAggregation.java // 放搜索统计信息的value object 八、实战 1、InaccurateIndexSearcher InaccurateIndexSearcher extends IndexSearcher，结构很简单，增加了两个成员变量：maxNumberOfDocs和inaccurateResultAggregation，以及几个methods。丑陋的部分来了： public void search(Weight weight, Filter filter, final HitCollector results, boolean ascending) throws IOException { ... if (weight.getClass().getSimpleName().equals("BooleanWeight2")) { // hook BooleanWeight2 InaccurateBooleanWeight2 inaccurateBooleanWeight2 = new InaccurateBooleanWeight2( this, weight.getQuery()); float sum = inaccurateBooleanWeight2.sumOfSquaredWeights(); float norm = this.getSimilarity().queryNorm(sum); inaccurateBooleanWeight2.normalize(norm); // bad smell InaccurateBooleanScorer2 inaccurateBooleanScorer2 = inaccurateBooleanWeight2 .getInaccurateBooleanScorer2(reader, maxNumberOfDocs); if (inaccurateBooleanScorer2 != null) { inaccurateResultAggregation = inaccurateBooleanScorer2 .getInaccurateTopAggregation(collector, ascending); } } else { Scorer scorer = weight.scorer(reader); if (scorer != null) { scorer.score(collector); } } ... } 由于BooleanWeight2被lucene-core给藏起来了，instanceof都不能用，只好丑陋一把用weight.getClass().getSimpleName().equals("BooleanWeight2")。把BooleanWeight2替换为InaccurateBooleanWeight2后代码老是搜不到任何结果，经过千辛万苦地调试才发现BooleanWeight2初始化后并不算完，需要拿到sum、norm，然后normalize一把，有点bad smell。接着从InaccurateBooleanWeight2里拿到InaccurateBooleanScorer2，调用getInaccurateTopAggregation搜一把，这里ascending并没有发挥作用，原因相当复杂，隐士引入ascending的本意是调整lucene扫描索引的方式，docID小->大或docID大->小，后来调整了建索引的方式就不需要这个了，所以隐士只是留这个接口以后用，万一以后lucene-core支持双向扫描索引即可启用。 2、InaccurateHits InaccurateIndexSearcher里面调用search其实是调用new InaccurateHits(this, query, null, sort, ascending)。getMoreDocs会反向调用新写的search方法。上代码： ... TopDocs topDocs = (sort == null) ? searcher.search(weight, filter, n, ascending) : searcher .search(weight, filter, n, sort, ascending); length = topDocs.totalHits; InaccurateResultAggregation inaccurateResultAggregation = searcher .getInaccurateResultAggregation(); if (inaccurateResultAggregation == null) { totalLength = length; } else { accurate = inaccurateResultAggregation.isAccurate(); if (inaccurateResultAggregation.isAccurate()) { totalLength = inaccurateResultAggregation .getNumberOfRecordsFound(); } else { int maxDocID = searcher.maxDoc(); totalLength = 1000 * ((int) Math .ceil((0.001 * maxDocID / (inaccurateResultAggregation.getLastDocID() + 1) * inaccurateResultAggregation .getNumberOfRecordsFetched()))); // guessing how many records there are } } ... 代码没什么特别的，除了一个猜测记录总数的算法。lucene从docID小向大的扫，由于上回说了扫到一半会跳出来，那么由最后扫到的lastDocID和maxDocID的比例可以猜测总共有多少条记录，虽然不是很准，但是数量级的精度是可以保证的，反正一般用户只能看到前1000条记录，具体有多少对用户来说不过是过眼云烟。 3、InaccurateBooleanWeight2 InaccurateBooleanWeight2没什么好说的，就是个拿到InaccurateBooleanScorer2的跳板。 4、InaccurateBooleanScorer2 InaccurateBooleanScorer2的代码均来自BooleanScorer2，由于BooleanScorer2从设计上来说并不准备被继承，隐士只好另起炉灶，bad smell啊。隐士没有修改任何从BooleanScorer2过来的代码，只加了getMaxNumberOfDocs、getInaccurateTopAggregation、getAccurateBottomAggregation。getInaccurateTopAggregation是扫描到maxNumberOfDocs后立即跳出来，所以结果会有所不准，getAccurateBottomAggregation总是保留最后maxNumberOfDocs个结果，结果也会有所不准，但是统计值是准的，因为每次都走完了所有索引。由两者差异可知getAccurateBottomAggregation性能会差一点，准确性和性能不可兼得啊。 public InaccurateResultAggregation getInaccurateTopAggregation( HitCollector hc, boolean ascending) throws IOException { // DeltaTime dt = new DeltaTime(); if (countingSumScorer == null) { initCountingSumScorer(); } int lastDocID = 0; boolean reachedTheEnd = true; int numberOfRecordsFetched = 0; while (countingSumScorer.next()) { lastDocID = countingSumScorer.doc(); float score = score(); hc.collect(lastDocID, score); numberOfRecordsFetched++; if (numberOfRecordsFetched >= maxNumberOfDocs) { reachedTheEnd = !countingSumScorer.next(); break; } } // System.out.println(dt.getTimeElasped()); /* * This method might cast the rest away. So it might be inaccurate. / return new InaccurateResultAggregation(lastDocID, ascending, reachedTheEnd, numberOfRecordsFetched, numberOfRecordsFetched); } public InaccurateResultAggregation getAccurateBottomAggregation( HitCollector hc, boolean ascending) throws IOException { // DeltaTime dt = new DeltaTime(); if (countingSumScorer == null) { initCountingSumScorer(); } LinkedList<ResultNode> resultNodes = new LinkedList<ResultNode>(); boolean isFull = false; int lastDocID = 0; int index = 0; int numberOfRecordsFound = 0; while (countingSumScorer.next()) { lastDocID = countingSumScorer.doc(); float score = score(); resultNodes.add(new ResultNode(lastDocID, score)); if (isFull) { resultNodes.removeFirst(); } index++; numberOfRecordsFound++; if (index >= maxNumberOfDocs) { isFull = true; index = 0; // break; } } for (ResultNode resultNode : resultNodes) { hc.collect(resultNode.getDoc(), resultNode.getScore()); } // System.out.println(dt.getTimeElasped()); / * Since this method runs full scan against all matched docs, it's * accurate at all. */ return new InaccurateResultAggregation(lastDocID, ascending, true, resultNodes.size(), numberOfRecordsFound); } 九、总结代码已经打包上传了，有隐士写的简略注释，调用方式写在readme.txt里面，只需要替换几行代码即可。总的来说只要 1、将Searcher searcher = new IndexSearcher(reader);替换为InaccurateIndexSearcher searcher = new InaccurateIndexSearcher(reader, 5000); 2、将Hits hits = searcher.search(query);替换为InaccurateHits hits = searcher.search(query, sort, ascending); 就行了。欢迎大家试用，如果有什么改进，请务必把改进后的代码也开源给大家，互相学习，互相促进。由于代码里面有几处有bad smell，隐士实在没脸去lucene开发组那里喊一嗓子。 inaccurate.rar (13 KB) 描述: Lucene Extension 下载次数: 743 声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

caocao 等级: 文章: 125 积分: 315 来自: 上海	发表时间：2007-07-06 已经升级到2.2了，在2.2里面，用这个修改的包不会有任何效果，因为BooleanWeight2已经没有了，现在只剩下BooleanWeight，只要把weight.getClass().getSimpleName().equals("BooleanWeight2")改成weight.getClass().getSimpleName().equals("BooleanWeight")即可。顺便鄙视一下IT168，转载我的文章不写明出处。链接如下： http://tech.it168.com/j/2007-06-04/200706040903078.shtml http://tech.it168.com/j/2007-06-04/200706040939156.shtml
返回顶楼	回帖地址 0 0 请登录后投票

regedit 等级: 初级会员性别: 文章: 1 积分: 30 来自: 广州	发表时间：2007-07-06 老大，如何调整lucene扫描索引的方式？？？我想改为从大ID开始扫。。。
返回顶楼	回帖地址 0 0 请登录后投票

bluepoint 等级: 初级会员性别: 文章: 17 积分: 30 来自: 上海	发表时间：2007-09-25 lucene 2.2 单个词搜索时(如:中国): 好像默认:TermQuery 那么 weight.getClass().getSimpleName().equals("BooleanWeight") 不成立,就用不上了,有啥办法处理? 多个词搜索时(如:中国上海):则可以使用.
返回顶楼	回帖地址 0 0 请登录后投票

论坛首页 → Java企业应用版

跳转论坛: