浏览 10552 次
该帖已经被评为良好帖
|
|
---|---|
作者 | 正文 |
发表时间:2007-05-15
http://www.caocao.name,http://www.caocao.mobi
作者:caocao(网络隐士),
转载请注明来源:http://www.iteye.com/topic/80073 书接前文(http://www.iteye.com/topic/78884),上回说了个大致的原理,这回开始上代码。 五、原则 1、不改动lucene-core的代码 肆意改动lucene-core的代码实在是很不道德的事情,而且会导致后期维护升级的大量问题。如果真的有这等迫切需求,还不如加入lucene开发组,尽一份绵薄之力。看官说了,隐士你怎么不去啊,唉,代码比较丑陋,没脸去人家那里,后文详述。 2、不改动lucene索引文件格式 道理同上。 3、替换常规搜索的接口尽量少 这样可以方便来回切换标准搜索和这个搜索,减小代码修改、维护的成本。 4、命名规范 所有增加的类名均以Inaccurate开头,其余遵循lucene命名规范。 六、限制 1、隐士只做了BooleanWeight2的替代品,如果Weight不是BooleanWeight2,则等同于常规搜索。 2、如果搜索结果集小于等于最大允许的结果集,则等同于常规搜索。 七、文件 org.apache.lucene.search InaccurateBooleanScorer2.java // BooleanScorer2的替代品 InaccurateBooleanWeight2.java // BooleanWeight2的替代品 InaccurateHit.java // Hit的替代品 InaccurateHitIterator.java // HitIterator的替代品 InaccurateHits.java // Hits的替代品 InaccurateIndexSearcher.java // IndexSearcher的替代品 org.apache.lucene.util InaccurateResultAggregation.java // 放搜索统计信息的value object 八、实战 1、InaccurateIndexSearcher InaccurateIndexSearcher extends IndexSearcher,结构很简单,增加了两个成员变量:maxNumberOfDocs和inaccurateResultAggregation,以及几个methods。 丑陋的部分来了: public void search(Weight weight, Filter filter, final HitCollector results, boolean ascending) throws IOException { ... if (weight.getClass().getSimpleName().equals("BooleanWeight2")) { // hook BooleanWeight2 InaccurateBooleanWeight2 inaccurateBooleanWeight2 = new InaccurateBooleanWeight2( this, weight.getQuery()); float sum = inaccurateBooleanWeight2.sumOfSquaredWeights(); float norm = this.getSimilarity().queryNorm(sum); inaccurateBooleanWeight2.normalize(norm); // bad smell InaccurateBooleanScorer2 inaccurateBooleanScorer2 = inaccurateBooleanWeight2 .getInaccurateBooleanScorer2(reader, maxNumberOfDocs); if (inaccurateBooleanScorer2 != null) { inaccurateResultAggregation = inaccurateBooleanScorer2 .getInaccurateTopAggregation(collector, ascending); } } else { Scorer scorer = weight.scorer(reader); if (scorer != null) { scorer.score(collector); } } ... } 由于BooleanWeight2被lucene-core给藏起来了,instanceof都不能用,只好丑陋一把用weight.getClass().getSimpleName().equals("BooleanWeight2")。 把BooleanWeight2替换为InaccurateBooleanWeight2后代码老是搜不到任何结果,经过千辛万苦地调试才发现BooleanWeight2初始化后并不算完,需要拿到sum、norm,然后normalize一把,有点bad smell。 接着从InaccurateBooleanWeight2里拿到InaccurateBooleanScorer2,调用getInaccurateTopAggregation搜一把,这里ascending并没有发挥作用,原因相当复杂,隐士引入ascending的本意是调整lucene扫描索引的方式,docID小->大或docID大->小,后来调整了建索引的方式就不需要这个了,所以隐士只是留这个接口以后用,万一以后lucene-core支持双向扫描索引即可启用。 2、InaccurateHits InaccurateIndexSearcher里面调用search其实是调用new InaccurateHits(this, query, null, sort, ascending)。getMoreDocs会反向调用新写的search方法。 上代码: ... TopDocs topDocs = (sort == null) ? searcher.search(weight, filter, n, ascending) : searcher .search(weight, filter, n, sort, ascending); length = topDocs.totalHits; InaccurateResultAggregation inaccurateResultAggregation = searcher .getInaccurateResultAggregation(); if (inaccurateResultAggregation == null) { totalLength = length; } else { accurate = inaccurateResultAggregation.isAccurate(); if (inaccurateResultAggregation.isAccurate()) { totalLength = inaccurateResultAggregation .getNumberOfRecordsFound(); } else { int maxDocID = searcher.maxDoc(); totalLength = 1000 * ((int) Math .ceil((0.001 * maxDocID / (inaccurateResultAggregation.getLastDocID() + 1) * inaccurateResultAggregation .getNumberOfRecordsFetched()))); // guessing how many records there are } } ... 代码没什么特别的,除了一个猜测记录总数的算法。lucene从docID小向大的扫,由于上回说了扫到一半会跳出来,那么由最后扫到的lastDocID和maxDocID的比例可以猜测总共有多少条记录,虽然不是很准,但是数量级的精度是可以保证的,反正一般用户只能看到前1000条记录,具体有多少对用户来说不过是过眼云烟。 3、InaccurateBooleanWeight2 InaccurateBooleanWeight2没什么好说的,就是个拿到InaccurateBooleanScorer2的跳板。 4、InaccurateBooleanScorer2 InaccurateBooleanScorer2的代码均来自BooleanScorer2,由于BooleanScorer2从设计上来说并不准备被继承,隐士只好另起炉灶,bad smell啊。隐士没有修改任何从BooleanScorer2过来的代码,只加了getMaxNumberOfDocs、getInaccurateTopAggregation、getAccurateBottomAggregation。getInaccurateTopAggregation是扫描到maxNumberOfDocs后立即跳出来,所以结果会有所不准,getAccurateBottomAggregation总是保留最后maxNumberOfDocs个结果,结果也会有所不准,但是统计值是准的,因为每次都走完了所有索引。由两者差异可知getAccurateBottomAggregation性能会差一点,准确性和性能不可兼得啊。 public InaccurateResultAggregation getInaccurateTopAggregation( HitCollector hc, boolean ascending) throws IOException { // DeltaTime dt = new DeltaTime(); if (countingSumScorer == null) { initCountingSumScorer(); } int lastDocID = 0; boolean reachedTheEnd = true; int numberOfRecordsFetched = 0; while (countingSumScorer.next()) { lastDocID = countingSumScorer.doc(); float score = score(); hc.collect(lastDocID, score); numberOfRecordsFetched++; if (numberOfRecordsFetched >= maxNumberOfDocs) { reachedTheEnd = !countingSumScorer.next(); break; } } // System.out.println(dt.getTimeElasped()); /* * This method might cast the rest away. So it might be inaccurate. */ return new InaccurateResultAggregation(lastDocID, ascending, reachedTheEnd, numberOfRecordsFetched, numberOfRecordsFetched); } public InaccurateResultAggregation getAccurateBottomAggregation( HitCollector hc, boolean ascending) throws IOException { // DeltaTime dt = new DeltaTime(); if (countingSumScorer == null) { initCountingSumScorer(); } LinkedList<ResultNode> resultNodes = new LinkedList<ResultNode>(); boolean isFull = false; int lastDocID = 0; int index = 0; int numberOfRecordsFound = 0; while (countingSumScorer.next()) { lastDocID = countingSumScorer.doc(); float score = score(); resultNodes.add(new ResultNode(lastDocID, score)); if (isFull) { resultNodes.removeFirst(); } index++; numberOfRecordsFound++; if (index >= maxNumberOfDocs) { isFull = true; index = 0; // break; } } for (ResultNode resultNode : resultNodes) { hc.collect(resultNode.getDoc(), resultNode.getScore()); } // System.out.println(dt.getTimeElasped()); /* * Since this method runs full scan against all matched docs, it's * accurate at all. */ return new InaccurateResultAggregation(lastDocID, ascending, true, resultNodes.size(), numberOfRecordsFound); } 九、总结 代码已经打包上传了,有隐士写的简略注释,调用方式写在readme.txt里面,只需要替换几行代码即可。 总的来说只要 1、将Searcher searcher = new IndexSearcher(reader);替换为InaccurateIndexSearcher searcher = new InaccurateIndexSearcher(reader, 5000); 2、将Hits hits = searcher.search(query);替换为InaccurateHits hits = searcher.search(query, sort, ascending); 就行了。欢迎大家试用,如果有什么改进,请务必把改进后的代码也开源给大家,互相学习,互相促进。 由于代码里面有几处有bad smell,隐士实在没脸去lucene开发组那里喊一嗓子。 声明:ITeye文章版权属于作者,受法律保护。没有作者书面许可不得转载。
推荐链接
|
|
返回顶楼 | |
发表时间:2007-07-06
已经升级到2.2了,在2.2里面,用这个修改的包不会有任何效果,因为BooleanWeight2已经没有了,现在只剩下BooleanWeight,只要把weight.getClass().getSimpleName().equals("BooleanWeight2")改成weight.getClass().getSimpleName().equals("BooleanWeight")即可。
顺便鄙视一下IT168,转载我的文章不写明出处。链接如下: http://tech.it168.com/j/2007-06-04/200706040903078.shtml http://tech.it168.com/j/2007-06-04/200706040939156.shtml |
|
返回顶楼 | |
发表时间:2007-07-06
老大,如何调整lucene扫描索引的方式???
我想改为从大ID开始扫。。。 |
|
返回顶楼 | |
发表时间:2007-09-25
lucene 2.2
单个词 搜索时(如:中国): 好像默认:TermQuery 那么 weight.getClass().getSimpleName().equals("BooleanWeight") 不成立,就用不上了,有啥办法处理? 多个词 搜索时(如:中国 上海):则可以使用. |
|
返回顶楼 | |