lucene打分机制的研究

Tonyguxu

浏览: 283799 次
性别:
来自: 北京

最近访客更多访客>>

greemranqq

1q2w3e4r11q

aaa2672829611

xld800

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2012-06 ( 13)
2012-05 ( 28)
2012-04 ( 20)
更多存档...

博客分类：

【**Search Engine】

提出问题

目前在查询时，会将得分小于1的查询结果过滤掉。

本文将回答如下问题：

lucene的打分机制是什么？得分小于1大于1说明什么问题？能否认为得分小于1的结果是部分匹配查询条件而大于1是完全匹配？

根据查询结果的得分小于1来过滤结果是否合理？会不会产生新的问题？

数字后加上"*"对打分有何影响？

lucene的打分机制简介

Lucene的打分机制结合了 Boolean model 和 Vector Space Model (VSM) 。

It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification（见附录1）

lucene在查询时会首先基于Boolean Model通过在查询语句中的boolean逻辑（AND，OR，NOT）来缩小待打分的文档结果，此过程涉及倒排表的合并，日志检索应用中已经将BooleanClause的occur设置为MUST,查询时会将众多倒排表取交集。

VSM模型会对上述返回的结果进行打分。Lucene使用了tf-idf值来计算文档与查询条件的相关性，当然lucene改良了VSM模型的打分。lucene的实际使用的打分公式如下：

关于公式中各因子的分析，参见附录2 lucene 3.4 DefaultSimilarity打分公式

打分的目的是为了排序，排序的目的是为了提高搜索的质量，让用户最想要的结果展示在前面。

测试

测试1

Time ： 20120110

Total amount ： 2336,6081

Index 文件大小： 2.86G

编号	查询条件	Time	Total hits	score
1.1	SPTOGW	20120110	480,8914	0.322 至 0.564	DefaultSimilarity
1.2	" 恭喜您获得 2 个福分，欢迎关注更多精彩互动 "	20120110	1,3190	5.834 至 7.001
1.3	" 恭喜您获得 2 个福分，欢迎关注更多精彩互动 " AND SPTOGW	20120110	1,3190	5.838 至 7.005
1.4	恭喜您获得 2 个福分，欢迎关注更多精彩互动	20120110	1 ， 3190	1.596 至 1.916

测试2

编号	查询条件	Time	Total hits	score
2.1	1065800715	20120315	129 ， 7241	0.580 至 0.677
2.2	1065800715 AND 106580071517	20120315	1 ， 4268	1.283 至 1.796
2.3	1065800715 AND 106580071517 AND 6341914	20120315	4	2.749
2.4	GWTOSP	20120315	511 ， 6587	0.323 至 0.430
2.5	GWTOSP AND 1065800715	20120315	129,7241	0.664 至 0.775
2.6	1065800715 AND 106580071517 AND GWTOSP	20120315	1 ， 4268	1.311 至 1.835
2.7	1065800715 AND 106580071517 AND GWTOSP	20120315	1 ， 4268	0.270 至 0.378	自定义 similarity (idf=1.0)
2.8	“ 暂无法接收您刚才发送的短信”	20120315	21,7555	3.934 至 5.901
2.9	暂无法接收您刚才发送的短信	20120315	24 ， 9941	1.133 至 1.699
2.10	“ 暂无法接收您刚才发送的短信 ”	20120315	21 ， 7555	1.125 至 1.687	自定义 similarity (idf=1.0)
2.11	暂无法接收您刚才发送的短信	20120315	24 ， 9941	0.25 至 0.375	自定义 similarity (idf=1.0)
2.12	“ 暂无法接收您刚才发送的短信” AND 1065800711	20120315	7,3527	2.960 至 5.920
2.13	1065800711	20120315	221 ， 9772	0.479 至 0.639
2.14	1065800715*	20120315	129 ， 7241	1.0
2.15	13932066196	20120315	2	2.511 至 3.013
2.16	1065800715 13932066196	20120315	2	2.557 至 3.069

测试3

编号	查询条件	Time	Total hits	score
3.1	SPTOGW	20120110	480,8914	0.322 至 0.564
3.2	" 恭喜您获得 2 个福分，欢迎关注更多精彩互动 "	20120110	1 ， 3190	5.834 至 7.001
3.3	" 恭喜您获得 2 个福分，欢迎关注更多精彩互动 " OR SPTOGW	20120110	480,8914	0.005 至 7.005
3.4	恭喜您获得 2 个福分，欢迎关注更多精彩互动	20120110	15 ， 3826	0.024 至 1.916
3.5	1065800715	20120315	129 ， 7241	0.580 至 0.677
3.6	1065800715*	20120315	0.58049536	1.0
3.7	13932066196	20120315	2	2.511 至 3.013
3.8	1065800715 OR 13932066196	20120315	129 ， 7241	0.054 至 3.069
3.9	“ 暂无法接收您刚才发送的短信” OR 1065800711		236 ， 3800	1.960 至 5.920
3.10	暂无法接收您刚才发送的短信 OR 1065800711		262 ， 3001	0.818 至 1.766

测试结果分析

Q：得分小于1大于1说明什么问题？能否认为得分小于1的结果是部分匹配查询条件而大于1是完全匹配？

A：lucene打分机制中并没有将分数为1作为区别查询结果匹配程度的临界值。分数是>0的任意的数。某个文档的分数越高，说明该文档与该查询的相关性越好，越是用户期望得到的结果。打分通常为了在一次查询中对返回结果排序用的，通过打分lucene会将打分高的结果排在前面分数低的排在后面。

用一个示例解释下什么是相关性，例如有两篇文章都包含lucene这个词，第一篇文章共有10个词，lucene出现了5次，第二篇文章共有1000个词，lucene也出现了5次，按照默认的相关性考虑的话，第一篇文章相关性比第二篇文章要高，在相关性打分上，第一篇文章得到的分数也会比第二篇文章分数高。

>>从测试1.1，2.1，2.5可以看出，使用lucene默认的相关性类similarity，即使是完全匹配的结果的打分也会小于1.

在自定义similarity中，将idf设置为1，这样在计算tf-idf时可以忽略idf，意味着，文档的相关性打分不受含有该term的document的数目的影响。

>>测试2.7说明即使忽略了idf，得分也是可能小于1的。

Q：数字后加上"*"对打分有何影响，为什么返回结果得分都为1？

A：lucene放弃对score的计算（主要是tf-idf值），1为默认的加权因子boost值。

Q：根据查询结果的得分小于1来过滤结果是否合理？会不会产生新的问题？

A：测试1.1，2.1，2.5中，返回结果的打分小于1，如果根据小于1规则来过滤，就没有结果返回，显然是不对的。

在测试3.3，3.4,3.8，3.10中，查询条件里的布尔逻辑为 OR，排在前面（打分较高）的结果为含有查询条件中term较多的结果，排序越往后的结果往往仅匹配条件中的某个term。如果此时想要的结果能匹配条件中的所有的term，表面上看通过<1来过滤可以达到，其实仅仅是一种巧合。应该通过布尔逻辑AND来保证查询结果命中所有term。

Q：测试2.15中两个查询结果的score不一样？

A：结果1的Explanation：

3.013709 = (MATCH) fieldWeight(content:13932066196 in 18137), product of:

1.0 = tf(termFreq(content:13932066196)=1)

16.073114 = idf(docFreq=2, maxDocs=10550949)

0.1875 = fieldNorm(field=content, doc=18137)

结果2的Explanation：

2.511424 = (MATCH) fieldWeight(content:13932066196 in 24885), product of:

1.0 = tf(termFreq(content:13932066196)=1)

16.073114 = idf(docFreq=2, maxDocs=10550949)

0.15625 = fieldNorm(field=content, doc=24885)

fieldNorm不同，fieldNorm指域的归一化（Normalization）值，表示域中包含的项数量，该值在索引期间计算，并保存在索引的norm中。更短的域（或更少的语汇单元）能获得更大的加权。通过检查查询结果，会发现结果1比结果2的content域内容短。

播发日志查询的要求

1.查询结果不能遗漏也不能增加（“增加”意味着结果里有不匹配的数据）

布尔查询逻辑AND，OR，NOT保证了查询结果完全命中还是部分命中查询条件中的term。

目前已经将BooleanQuery中的BooleanClause布尔子句的occor设置为MUST

if(luceneQuery instanceof BooleanQuery){
	                BooleanQuery bQuery = (BooleanQuery)luceneQuery;
	                List<BooleanClause> clauses = bQuery.clauses();
	                for (BooleanClause clause : clauses) {
	                    clause.setOccur(Occur.MUST);
	                }
	            }

因为Lucene的Query除了BooleanQuery外还有很多种类型，比如TermRangeQuery、PhraseQuery等，所以建议查询时每个查询子句间显式使用AND,OR,NOT。

2.没有按相关性进行排序的需求

播发日志检索是对日志进行检索，与常用的搜索引擎（google，百度）对查询结果的排序和打分需要不一样。Lucene采用的默认的打分和排序策略与google相似，是对商用搜索引擎的简化。

如果想要进行排序的话，可以采用自定义的排序策略：根据某个域来排序，在我们的应用中可以根据日志时间来排序。

查询结果得分小于1的原因分析

在打分公式中

tf(t in d) =

frequency½

其中frequency表示该term出现在待打分文档里的个数，因此tf(t in d)是个大于1的值；

idf(t) =	1 + log (		numDocs
------------------
docFreq+1		)

上面idf的值代表了 the inverse of docFreq（term t 出现的文档的数目），通过上面计算公式idf也应该是大于等于1的值；

什么因素可能导致了查询结果的得分小于1呢？

主要与norm(t,d)里的 lengthNorm有关。

lengthNorm表示 computed when the document is added to the index in accordance with（与。。一致） the number of tokens of this field in the document（Field的token的数目）, so that shorter fields contribute more to the score.

在默认相似度类DefaultSimilarity 里 lengthNorm 的计算公式如下：

lengthNorm = (float) (1.0 / Math.sqrt(numTerms))

由此计算公式lengthNorm是个小于1的值，通常也是因为 lengthNorm 才使最终结果得分小于1的。

如果想要改变lengthNorm的计算公式或者忽略lengthNorm（lengthNorm=1），该如何做？
在自定义的CustomSimilarity实现类中重写Similarity抽象类中定义的
public float computeNorm(String field, FieldInvertState state)

如果想忽略lengthNorm可以这样写

@Override
public float computeNorm(String field, FieldInvertState state) {
final int numTerms = 1;
return state.getBoost() * numTerms;
{color:#000000}}

还需要告诉IndexWriter使用自定义的CustomSimilarity
indexWriter.setSimilarity(Similarity similarity)
在3.4版本中该方法是过时的，可以在IndexWriterConfig setSimilarity(Similarity similarity)

另外在对query=1065800715 结果的explanation中也能对上述结论得到验证

query=1065800715

0.6772446 = (MATCH) fieldWeight(content:1065800715 in 2200), product of:
1.0 = tf (termFreq(content:1065800715)=1)
3.0959754 = idf (docFreq=1297241, maxDocs=10550949)
0.21875 = fieldNorm (field=content, doc=2200)

0.6772446 = 1.0 * 3.0959754 * 0.21875

至
0.58049536 = (MATCH) fieldWeight(content:1065800715 in 1224258), product of:
1.0 = tf (termFreq(content:1065800715)=1)
3.0959754 = idf (docFreq=1297241, maxDocs=10550949)
0.1875 = fieldNorm (field=content, doc=1224258)

0.58049536 = 1.0 * 3.0959754 * 0.1875

附录

1. lucene 3.4 score机制介绍

http://lucene.apache.org/core/old_versioned_docs//versions/3_4_0/scoring.html

2. lucene 3.4 DefaultSimilarity 打分公式

http://lucene.apache.org/core/old_versioned_docs//versions/3_4_0/api/core/index.html

查看图片附件

分享到：

播发日志检索系统出现的系列问题 | python list排序

2012-04-22 17:46
浏览 5882
评论(6)
分类:开源软件
查看更多

6 楼 Tonyguxu 2012-04-25

分析查询结果的打分小于1
1.query=1065800715* OR 106580071517
2.query=1065800715
用上面两个query查询，结果里都有打分小于1的document，分析什么因素导致分数小于1。

在query1里，跟 queryNorm， coord有关

在query2里，跟 fieldNorm 有关

分析当查询term里加上*，比如1065800715*，对打分有什么影响？
见http://nemogu.iteye.com/blog/1498262

5 楼 Tonyguxu 2012-04-25

query=13932066196

3.013709 = (MATCH) fieldWeight(content:13932066196 in 18137), product of:
1.0 = tf(termFreq(content:13932066196)=1)
16.073114 = idf(docFreq=2, maxDocs=10550949)
0.1875 = fieldNorm(field=content, doc=18137)

至

2.511424 = (MATCH) fieldWeight(content:13932066196 in 24885), product of:
1.0 = tf(termFreq(content:13932066196)=1)
16.073114 = idf(docFreq=2, maxDocs=10550949)
0.15625 = fieldNorm(field=content, doc=24885)

4 楼 Tonyguxu 2012-04-25

query=1065800715

0.6772446 = (MATCH) fieldWeight(content:1065800715 in 2200), product of:
1.0 = tf(termFreq(content:1065800715)=1)
3.0959754 = idf(docFreq=1297241, maxDocs=10550949)
0.21875 = fieldNorm(field=content, doc=2200)

至

0.58049536 = (MATCH) fieldWeight(content:1065800715 in 1224258), product of:
1.0 = tf(termFreq(content:1065800715)=1)
3.0959754 = idf(docFreq=1297241, maxDocs=10550949)
0.1875 = fieldNorm(field=content, doc=1224258)

3 楼 Tonyguxu 2012-04-25

query=1065800715*

1.0 = (MATCH) ConstantScore(content:1065800715*), product of:
1.0 = boost
1.0 = queryNorm

2 楼 Tonyguxu 2012-04-25

query=1065800715* OR 106580071517

1.7799454 = (MATCH) sum of:
   0.13035534 = (MATCH) ConstantScore(content:1065800715*), product of:
      1.0 = boost
      0.13035534 = queryNorm
   1.64959 = (MATCH) weight(content:106580071517 in 900495), product of:
      0.9914673 = queryWeight(content:106580071517), product of:
         7.6058817 = idf(docFreq=14268, maxDocs=10550949)
         0.13035534 = queryNorm
      1.6637866 = (MATCH) fieldWeight(content:106580071517 in 900495), product of:
         1.0 = tf(termFreq(content:106580071517)=1)
         7.6058817 = idf(docFreq=14268, maxDocs=10550949)
         0.21875 = fieldNorm(field=content, doc=900495)

至

0.06517767 = (MATCH) product of:
   0.13035534 = (MATCH) sum of:
       0.13035534 = (MATCH) ConstantScore(content:1065800715*), product of:
          1.0 = boost
          0.13035534 = queryNorm
   0.5 = coord(1/2)

1 楼 Tonyguxu 2012-04-25

score(q,d)   =   coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) ·

                                                       t in q

       idf(t) 2 · t.getBoost() · norm(t,d) )

一个查询语句里只有一个token（term）来分析：

tf(t in d)~~1
idf(t) >= 1
norm(t,d) encapsulates a few (indexing time) boost and length factors的值是绝大多数小于1的（主要是length factors小于1） < 1

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论