`
Tonyguxu
  • 浏览: 278445 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

【Lucene】Scoring

 
阅读更多

http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/scoring.html#Algorithm


Introduction

Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.

While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.

Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant(相关性) a given Document is to a User's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification . Lucene also adds some capabilities and refinements(改进) onto this model to support boolean and fuzzy searching(模糊查询), but it essentially remains a VSM based system at the heart. For some valuable references on VSM and IR in general refer to the Lucene Wiki IR references .

The rest of this document will cover Scoring basics and how to change your Similarity . Next it will cover ways you can customize the Lucene internals in Changing your Scoring -- Expert Level which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix .

Scoring

Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing (see Apache Lucene - Getting Started Guide and the Lucene file formats before continuing on with this section.) It is also assumed that readers know how to use the Searcher.explain(Query query, int doc) functionality, which can go a long way in informing why a score is returned.

Fields and Documents

In Lucene, the objects we are scoring are Documents . A Document is a collection of Fields . Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization (assumming the DefaultSimilarity on the Fields).

Score Boosting提高得分

Lucene allows influencing search results by "boosting" in more than one level:

  • Document level boosting - while indexing - by calling document.setBoost() before a document is added to the index.
  • Document's Field level boosting - while indexing - by calling field.setBoost() before adding a field to the document (and before adding the document to the index).
  • Query level boosting - during search, by setting a boost on a query clause, calling Query.setBoost() .

Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc (so shorter fields are automatically boosted up). The result is decoded as a single byte (with some precision loss of course) and stored in the directory. The similarity object in effect at indexing computes the length-norm of the field.

This composition of 1-byte representation of norms (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) is nicely described in Fieldable.setBoost() .

Encoding and decoding of the resulted float norm in a single byte are done by the static methods of the class Similarity: encodeNorm() and decodeNorm() . Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought into the score of document as norm(t, d) , as shown by the formula in Similarity .

 

影响查询结果的排序---“boosting” 加权操作

 

 

通过对document/field/query设置加权因子来影响查询结果的score从而改变sort。

1.建索引时,boosting document 文档加权操作

默认情况下,所有文档都没有加权值——或者说他们都具有同样的加权因子1.0,通过改变文档的加权因子,就能指示lucene在计算相关性时或多或少地考虑该文档相对索引中其他文档的重要程度。

 

新需求:有个某公司员工的员工信息表,获取员工信息列表时,符合某条件的人员信息排在前面。

或者 索引和搜索公司email,进行搜索结果排序时,使得公司员工的email排在其他email的重要位置上。

 

 

document.setBoost(float boost);
 

 

2.建索引时,适合“Field”粒度的加权操作

 

操作1中,一个document里的所有field进行加权时都采用相同的加权因子,而在操作2中,我们可以为单独的field设置加权因子,即将域的重要性区别对待。

 

新需求:在搜索email情景中,我们想让email的主题域比发送者域更加重要(document1的主题域中包含有查询语句里的email,而另外一个document2的发送者域里含有,那么通过设置field的加权,可以让docuemt1比document2重要,排在document2的前面

 

 

 

Fieldable接口定义 
--setBoost(float boost)

Field field = new Field("subject",subject,xx,xx);
field.setBoost(float boost);
  
注意:当改变一个文档或者域的加权因子时,必须完全删除并创建对应的文档,或者使用updateDocument方法。

lucene的评分算法对较短的域有隐含的加权,这跟 Norm 有关,在实现自己的Similarity,见
abstract  Similarity
public abstract float computeNorm(String field,
                                  FieldInvertState state)
  见

http://nemogu.iteye.com/blog/1490874


3.搜索时,在计算 匹配某个clause的文档(in addition to the normal weightings)的score时,利用设定的boost进行计算
Query
--public void setBoost(float b)
 

Understanding the Scoring Formula公式

This scoring formula is described in the Similarity class . Please take the time to study this formula, as it contains much of the information about how the basics of Lucene scoring work, especially the TermQuery .

TermA Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in, an interned string. Note that terms may represent more than words from text fields, but also things like dates, email addresses, urls, etc.

TermQueryA Query that matches documents containing a term. This may be combined with other terms with a BooleanQuery.

The Big Picture

OK, so the tf-idf formula and the Similarity is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are the use and interactions between the Query classes, as created by each application in response to a user's information need.

In this regard, Lucene offers a wide variety of Query implementations, most of which are in the org.apache.lucene.search package. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query section below highlights some of the more important Query classes. For information on the other ones, see the package summary . For details on implementing your own Query class, see Changing your Scoring -- Expert Level below.

Once a Query has been created and submitted to the IndexSearcher , the scoring process begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, control finally passes to the Weight implementation and its Scorer instance. In the case of any type of BooleanQuery , scoring is handled by the BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class) or BooleanWeight (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class).

Assuming the use of the BooleanWeight2, a BooleanScorer2 is created by bringing together all of the Scorer s from the sub-clauses of the BooleanQuery. When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores provided by each scorer while factoring in the coord() score.

Query Classes

For information on the Query Classes, refer to the search package javadocs

Changing Similarity

One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on how to do this, see the search package javadocs

改变similarity factors(相似度因子)是改变score的一种途径。


Changing your Scoring -- Expert Level

At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more about how to do this, refer to the search package javadocs


通过实现自定义的Query类或者相关的scoring类来影响打分。

Appendix

Algorithm

This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.

In the typical search application, a Query is passed to the Searcher , beginning the scoring process.

Once inside the Searcher, a Collector is used for the scoring and sorting of the search results. These important objects are involved in a search:

  1. The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the Searcher.
  2. The Searcher that initiated the call.
  3. A Filter for limiting the result set. Note, the Filter may be null.
  4. A Sort object for specifying how to sort the results if the standard score based sort method is not desired.

Assuming we are not sorting (since sorting doesn't effect the raw Lucene score), we call one of the search methods of the Searcher, passing in the Weight object created by Searcher.createWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The Searcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the Collector mechanism, see Searcher .) The TopDocCollector uses a PriorityQueue to collect the top results for the search.

If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for the IndexReader of the current searcher and we proceed by calling the score method on the Scorer .

At last, we are actually going to score some documents. The score method takes in the Collector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2 (see the section on customizing your scoring for info on changing this.)

Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer#next() method. The next() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overriden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.

分享到:
评论

相关推荐

    Lucene5学习之评分Scoring

    《Lucene5学习之评分Scoring》 在信息检索领域,Lucene是一个广泛使用的全文搜索引擎库,尤其在Java开发中应用颇广。在Lucene 5版本中,对于搜索结果的排序和评分机制进行了优化,使得搜索体验更加精准。本文将深入...

    Lucene的原理完整版pdf

    3. **评分(Scoring)**:Lucene使用TF-IDF算法来评估文档与查询的相关性,给出一个评分。评分高的文档在搜索结果中优先显示。 4. **结果集(Hit)**:搜索返回一个`TopDocs`对象,包含匹配文档的总数和最高评分的...

    Lucene全文检索引擎

    7. **评分(Scoring)**:Lucene会根据文档与查询的相关性来为搜索结果打分,最相关的文档会首先返回。 **二、Lucene的工作流程** 1. **创建索引**:首先,你需要创建一个Analyzer来定义如何分词,然后使用...

    Lucene全文搜索_LuceneJava全文搜索_

    此外,Lucene还提供了近似度评分(Similarity Scoring),根据查询词在文档中的出现频率和位置给出相关性分数,帮助用户找到最相关的搜索结果。 智能查询则涉及到更复杂的查询构造,如前缀查询(Prefix Query)、...

    lucene

    2. 匹配(Scoring):Lucene使用TF-IDF(词频-逆文档频率)算法计算文档与查询的相关性得分。 3. 排序与返回(Ranking & Retrieval):根据得分对匹配文档进行排序,并按需返回前N个结果。 四、扩展与优化 1. 近...

    lucene源码和程序

    7. **评分(Scoring)**:Lucene根据文档与查询的相关性给出评分,用于排序搜索结果。TF-IDF(Term Frequency-Inverse Document Frequency)是常用的评分算法。 8. **更新与删除(Update & Delete)**:一旦索引...

    Lucene实战源码(Lucene in Action Source Code)part1

    执行查询时,Lucene会使用索引来查找匹配的文档,并根据评分函数(Scoring Function)为每个匹配的文档打分。TF-IDF(Term Frequency-Inverse Document Frequency)是常见的评分算法,它结合了词频和文档频率来衡量...

    lucene站内搜索

    4. **评分(Scoring)**: Lucene使用TF-IDF(Term Frequency-Inverse Document Frequency)算法计算每个匹配文档的相关性分数。 5. **结果排序(Resuliting Sorting)**: 按照评分从高到低排序搜索结果,返回给用户...

    全文检索引擎Lucene

    3. **评分(Scoring)**: Lucene使用TF-IDF(Term Frequency-Inverse Document Frequency)算法对匹配文档进行评分,得分越高,相关性越强。 4. **排序和显示结果(Sorting and Displaying)**: 搜索结果按照评分...

    Lucene实战源码(Lucene in Action Source Code)part2

    6. **评分(Scoring)**:Lucene会根据查询与文档的相关性给出一个评分,用于排序搜索结果。 7. **过滤器(Filter)**:允许对搜索结果进行进一步筛选,例如按时间范围、地理位置等条件。 8. **更新和删除(Update...

    lucene实例lucene实例

    3. 评分系统(Scoring):Lucene根据查询词在文档中的频率、位置等因素计算相关性分数。 4. 跨文件搜索(Multi-File Search):通过DirectoryReader和IndexSearcher可以处理多索引文件。 5. 高级查询构造...

    lucene-4.7.0.zip

    5. 排序与评分(Sorting & Scoring):根据相关性对搜索结果进行排序,评分机制决定了哪些结果更为重要。 四、Lucene的使用 1. 创建索引:首先,需要将待搜索的数据通过Analyzer进行预处理,然后创建Document对象...

    lucene-3.6.0.zip

    7. 分数计算(Scoring):根据相关性对搜索结果进行排序。 二、Lucene 3.6.0的特性与改进 1. 性能优化:3.6.0版本在性能上进行了大量优化,包括更快的索引构建速度和更高效的搜索性能。 2. 新增查询语法:增加了...

    lucene学习资料收集

    6. **排序与评分(Scoring)**:Lucene根据相关性对搜索结果进行排序,相关性评分主要基于词频和文档频率等。 7. **内存缓存(In-memory Caching)**:为了提升性能,Lucene会缓存某些数据,如文档频率、位向量等。...

    lucene全文检索教程

    接着,Lucene提供了多种评分机制(Scoring Mechanisms)来决定文档与查询的相关性。TF-IDF(Term Frequency-Inverse Document Frequency)是最常用的评分方法,它基于词频和文档频率来计算每个文档的得分。此外,还...

    lucene-7.4.0jar包

    - **排序(Scoring)**:根据相关性对搜索结果进行评分和排序。 - **高亮(Highlighting)**:突出显示查询关键词在搜索结果中的位置。 - **命中集(Hit Set)**:搜索结果的集合,包含每个文档的相关信息。 3. ...

    lucene4.6.0 jar包

    4. **文档评分(Document Scoring)**: Lucene 使用 TF-IDF(Term Frequency-Inverse Document Frequency)算法计算文档的相关性,4.6.0 版本对此进行了优化,提升了搜索结果的质量。 **三、使用步骤** 1. **创建...

    Lucene索引和查询

    - 匹配评分(Scoring):Lucene使用TF-IDF(Term Frequency-Inverse Document Frequency)或其他评分算法,计算每个文档与查询的相关性。 - 结果排序(Sorting):根据评分对匹配到的文档进行排序,返回最相关的文档...

    Lucene-article.rar_Article lucene_lucene

    - **评分(Scoring)**:Lucene会根据相关性对匹配的文档进行评分,高分的文档被认为更相关。 - **结果排序(Result Ranking)**:默认按照评分排序,但也可以自定义排序规则。 4. **其他特性**: - **多语言...

    lucene 实现类似百度搜索

    5. **评分(Scoring)**:Lucene 会计算每个文档的相关性分数,作为排序依据。 6. **自定义分析器(Custom Analyzers)**:根据需求调整分词规则、过滤特殊字符等。 7. **多字段搜索(Multi-Field Search)**:同时...

Global site tag (gtag.js) - Google Analytics