在索引阶段设置Document Boost和Field Boost,存储在(.nrm)文件中。
如果不进行设定,则Document Boost和Field Boost默认为1。
Document Boost及FieldBoost的设定方式如下:
Document doc = new Document();
Field f = new Field("contents", "hello world", Field.Store.NO, Field.Index.ANALYZED);
Document Boost和Field Boost影响的是norm(t, d),其公式如下:
- Document boost:此值越大,说明此文档越重要。
- Field boost:此域越大,说明此域越重要。
- lengthNorm(field) = (1.0 / Math.sqrt(numTerms)):一个域中包含的Term总数越多,也即文档越长,此值越小,文档越短,此值越大。
根据Lucene的注释,No norms means that index-time field and document boosting and field length normalization are disabled. The benefit is less memory usage as norms take up one byte of RAM per indexed field for every document in the index, during searching. Note that once you index a given field with norms enabled, disabling norms will have no effect. 没有norms意味着索引阶段禁用了文档boost和域的boost及长度标准化。好处在于节省内存,不用在搜索阶段为索引中的每篇文档的每个域都占用一个字节来保存norms信息了。但是对norms信息的禁用是必须全部域都禁用的,一旦有一个域不禁用,则其他禁用的域也会存放默认的norms值。因为为了加快norms的搜索速度,Lucene是根据文档号乘以每篇文档的norms信息所占用的大小来计算偏移量的,中间少一篇文档,偏移量将无法计算。也即norms信息要么都保存,要么都不保存。
试验一:Document Boost的作用
public void testNormsDocBoost() throws Exception { File indexDir = new File("testNormsDocBoost"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); writer.setUseCompoundFile(false); Document doc1 = new Document(); Field f1 = new Field("contents", "common hello hello", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); doc1.setBoost(100); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common common hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc2.add(f2); writer.addDocument(doc2); Document doc3 = new Document(); Field f3 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc3.add(f3); writer.addDocument(doc3); writer.close();
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); TopDocs docs = searcher.search(new TermQuery(new Term("contents", "common")), 10); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }
docid : 2 score : 1.2337708 docid : 1 score : 1.0073696 docid : 0 score : 0.71231794
docid : 0 score : 39.889805 docid : 2 score : 0.6168854 docid : 1 score : 0.5036848
试验二:Field Boost的作用
public void testNormsFieldBoost() throws Exception { File indexDir = new File("testNormsFieldBoost"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); writer.setUseCompoundFile(false); Document doc1 = new Document(); Field f1 = new Field("title", "common hello hello", Field.Store.NO, Field.Index.ANALYZED); f1.setBoost(100); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common common hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc2.add(f2); writer.addDocument(doc2); writer.close();
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("title:common contents:common"); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }
docid : 1 score : 0.49999997 docid : 0 score : 0.35355338
docid : 0 score : 19.79899 docid : 1 score : 0.49999997
public void testNormsLength() throws Exception { File indexDir = new File("testNormsLength"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); writer.setUseCompoundFile(false); Document doc1 = new Document(); Field f1 = new Field("contents", "common hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common common hello hello hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc2.add(f2); writer.addDocument(doc2); writer.close();
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("title:common contents:common"); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }
docid : 1 score : 0.13928263 docid : 0 score : 0.09848769
docid : 0 score : 0.09848769 docid : 1 score : 0.052230984
public void testOmitNorms() throws Exception { File indexDir = new File("testOmitNorms"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); writer.setUseCompoundFile(false); Document doc1 = new Document(); Field f1 = new Field("title", "common hello hello", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); for (int i = 0; i < 10000; i++) { Document doc2 = new Document(); Field f2 = new Field("contents", "common common hello hello hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS); doc2.add(f2); writer.addDocument(doc2); } writer.close(); }


在搜索语句中,设置Query Boost.
title:common^4 content:common
public void testQueryBoost() throws Exception { File indexDir = new File("TestQueryBoost"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); Document doc1 = new Document(); Field f1 = new Field("contents", "common1 hello hello", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common2 common2 hello", Field.Store.NO, Field.Index.ANALYZED); doc2.add(f2); writer.addDocument(doc2); writer.close();
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("common1 common2"); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }
docid : 1 score : 0.24999999 docid : 0 score : 0.17677669
如果我们输入的查询语句为:"common1^100 common2",则第一篇文档打分较高:
docid : 0 score : 0.2499875 docid : 1 score : 0.0035353568
那Query Boost是如何影响文档打分的呢?
(1) float computeNorm(String field, FieldInvertState state)
(2) float lengthNorm(String fieldName, int numTokens)
(3) float queryNorm(float sumOfSquaredWeights)
(4) float tf(float freq)
(5) float idf(int docFreq, int numDocs)
(6) float coord(int overlap, int maxOverlap)
(7) float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)
(1) float computeNorm(String field, FieldInvertState state)
影响标准化因子的计算,如上述,他主要包含了三部分:文档boost,域boost,以及文档长度归一化。此函数一般按照上面norm(t, d)的公式进行计算。
(2) float lengthNorm(String fieldName, int numTokens)
主要计算文档长度的归一化,默认是1.0 / Math.sqrt(numTerms)。
(3) float queryNorm(float sumOfSquaredWeights)
(4) float tf(float freq)
(5) float idf(int docFreq, int numDocs)
idf是根据包含某个词的文档数以及总文档数计算出的分数,默认为(Math.log(numDocs/(double)(docFreq+1)) + 1.0)。
究其原因是MultiSearcher的docFreq(Term term)函数计算了包含两个索引中包含此词的总文档数,而IndexSearcher仅仅计算了每个索引中包含此词的文档数。当两个索引包含的文档总数是有很大不同的时候,分数是无法比较的。
public void testMultiIndex() throws Exception{ MultiIndexSimilarity sim = new MultiIndexSimilarity(); File indexDir01 = new File("TestMultiIndex/TestMultiIndex01"); File indexDir02 = new File("TestMultiIndex/TestMultiIndex02"); IndexReader reader01 = IndexReader.open(FSDirectory.open(indexDir01)); IndexReader reader02 = IndexReader.open(FSDirectory.open(indexDir02)); IndexSearcher searcher01 = new IndexSearcher(reader01); searcher01.setSimilarity(sim); IndexSearcher searcher02 = new IndexSearcher(reader02); searcher02.setSimilarity(sim); MultiSearcher multiseacher = new MultiSearcher(searcher01, searcher02); multiseacher.setSimilarity(sim); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("common"); TopDocs docs = searcher01.search(query, 10); System.out.println("----------------------------------------------"); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } System.out.println("----------------------------------------------"); docs = searcher02.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } System.out.println("----------------------------------------------"); docs = multiseacher.search(query, 20); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); }
------------------------------- docid : 0 score : 0.49317428 docid : 1 score : 0.49317428 docid : 2 score : 0.49317428 docid : 3 score : 0.49317428 docid : 4 score : 0.49317428 docid : 5 score : 0.49317428 docid : 6 score : 0.49317428 docid : 7 score : 0.49317428 ------------------------------- docid : 0 score : 0.45709616 docid : 1 score : 0.45709616 docid : 2 score : 0.45709616 docid : 3 score : 0.45709616 docid : 4 score : 0.45709616 ------------------------------- docid : 0 score : 0.5175894 docid : 1 score : 0.5175894 docid : 2 score : 0.5175894 docid : 3 score : 0.5175894 docid : 4 score : 0.5175894 docid : 5 score : 0.5175894 docid : 6 score : 0.5175894 docid : 7 score : 0.5175894 docid : 8 score : 0.5175894 docid : 9 score : 0.5175894 docid : 10 score : 0.5175894 docid : 11 score : 0.5175894 docid : 12 score : 0.5175894
class MultiIndexSimilarity extends Similarity {
@Override public float idf(int docFreq, int numDocs) { return 1.0f; }
----------------------------- docid : 0 score : 0.559017 docid : 1 score : 0.559017 docid : 2 score : 0.559017 docid : 3 score : 0.559017 docid : 4 score : 0.559017 docid : 5 score : 0.559017 docid : 6 score : 0.559017 docid : 7 score : 0.559017 ----------------------------- docid : 0 score : 0.559017 docid : 1 score : 0.559017 docid : 2 score : 0.559017 docid : 3 score : 0.559017 docid : 4 score : 0.559017 ----------------------------- docid : 0 score : 0.559017 docid : 1 score : 0.559017 docid : 2 score : 0.559017 docid : 3 score : 0.559017 docid : 4 score : 0.559017 docid : 5 score : 0.559017 docid : 6 score : 0.559017 docid : 7 score : 0.559017 docid : 8 score : 0.559017 docid : 9 score : 0.559017 docid : 10 score : 0.559017 docid : 11 score : 0.559017 docid : 12 score : 0.559017
(6) float coord(int overlap, int maxOverlap)
public void TestCoord() throws Exception { MySimilarity sim = new MySimilarity(); File indexDir = new File("TestCoord"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); Document doc1 = new Document(); Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED); doc2.add(f2); writer.addDocument(doc2); for(int i = 0; i < 10; i++){ Document doc3 = new Document(); Field f3 = new Field("contents", "world", Field.Store.NO, Field.Index.ANALYZED); doc3.add(f3); writer.addDocument(doc3); } writer.close();
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(sim); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("common world"); TopDocs docs = searcher.search(query, 2); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }
class MySimilarity extends Similarity {
@Override public float coord(int overlap, int maxOverlap) { return 1; }
docid : 1 score : 1.9059997 docid : 0 score : 1.2936771
class MySimilarity extends Similarity {
@Override public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; }
docid : 0 score : 1.2936771 docid : 1 score : 0.95299983
(7) float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)
由payload的定义我们知道,索引是以倒排表形式存储的,对于每一个词,都保存了包含这个词的一个链表,当然为了加快查询速度,此链表多用跳跃表进行存储。Payload信息就是存储在倒排表中的,同文档号一起存放,多用于存储与每篇文档相关的一些信息。当然这部分信息也可以存储域里(stored Field),两者从功能上基本是一样的,然而当要存储的信息很多的时候,存放在倒排表里,利用跳跃表,有利于大大提高搜索速度。

class BoldAnalyzer extends Analyzer {
@Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new WhitespaceTokenizer(reader); result = new BoldFilter(result); return result; }
class BoldFilter extends TokenFilter { public static int IS_NOT_BOLD = 0; public static int IS_BOLD = 1;
private TermAttribute termAtt; private PayloadAttribute payloadAtt;
protected BoldFilter(TokenStream input) { super(input); termAtt = addAttribute(TermAttribute.class); payloadAtt = addAttribute(PayloadAttribute.class); }
@Override public boolean incrementToken() throws IOException { if (input.incrementToken()) {
final char[] buffer = termAtt.termBuffer(); final int length = termAtt.termLength();
String tokenstring = new String(buffer, 0, length); if (tokenstring.startsWith("<b>") && tokenstring.endsWith("</b>")) { tokenstring = tokenstring.replace("<b>", ""); tokenstring = tokenstring.replace("</b>", ""); termAtt.setTermBuffer(tokenstring); payloadAtt.setPayload(new Payload(int2bytes(IS_BOLD))); } else { payloadAtt.setPayload(new Payload(int2bytes(IS_NOT_BOLD))); } return true; } else return false; }
public static int bytes2int(byte[] b) { int mask = 0xff; int temp = 0; int res = 0; for (int i = 0; i < 4; i++) { res <<= 8; temp = b[i] & mask; res |= temp; } return res; }
public static byte[] int2bytes(int num) { byte[] b = new byte[4]; for (int i = 0; i < 4; i++) { b[i] = (byte) (num >>> (24 - i * 8)); } return b; }
class PayloadSimilarity extends DefaultSimilarity {
@Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { int isbold = BoldFilter.bytes2int(payload); if(isbold == BoldFilter.IS_BOLD){ System.out.println("It is a bold char."); } else { System.out.println("It is not a bold char."); } return 1; } }
最后,查询的时候,一定要用PayloadXXXQuery(在此用PayloadTermQuery,在Lucene 2.4.1中,用BoostingTermQuery),否则scorePayload不起作用。
public void testPayloadScore() throws Exception { PayloadSimilarity sim = new PayloadSimilarity(); File indexDir = new File("TestPayloadScore"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new BoldAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); Document doc1 = new Document(); Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common <b>hello</b> world", Field.Store.NO, Field.Index.ANALYZED); doc2.add(f2); writer.addDocument(doc2); writer.close();
IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(sim); PayloadTermQuery query = new PayloadTermQuery(new Term("contents", "hello"), new MaxPayloadFunction()); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }
It is not a bold char. It is a bold char. docid : 0 score : 0.2101998 docid : 1 score : 0.2101998
class PayloadSimilarity extends DefaultSimilarity {
@Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { int isbold = BoldFilter.bytes2int(payload); if(isbold == BoldFilter.IS_BOLD){ System.out.println("It is a bold char."); return 10; } else { System.out.println("It is not a bold char."); return 1; } } }
It is not a bold char. It is a bold char. docid : 1 score : 2.101998 docid : 0 score : 0.2101998
以上各种方法,已经把Lucene score计算公式的所有变量都涉及了,如果这还不能满足您的要求,还可以继承实现自己的collector。
在Lucene 2.4中,HitCollector有个函数public abstract void collect(int doc, float score),用来收集搜索的结果。
public void collect(int doc, float score) { if (score > 0.0f) { totalHits++; if (reusableSD == null) { reusableSD = new ScoreDoc(doc, score); } else if (score >= reusableSD.score) { reusableSD.doc = doc; reusableSD.score = score; } else { return; } reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD); } }
public static long milisecondsOneDay = 24L * 3600L * 1000L;
public static long millisecondsOneWeek = 7L * 24L * 3600L * 1000L;
public static long millisecondsOneMonth = 30L * 24L * 3600L * 1000L;
public void collect(int doc, float score) { if (score > 0.0f) {
long time = getTimeByDocId(doc);
if(time < milisecondsOneDay) {
score = score * 1.0;
} else if (time < millisecondsOneWeek){
score = score * 0.8;
} else if (time < millisecondsOneMonth) {
score = score * 0.3;
} else {
score = score * 0.1;
totalHits++; if (reusableSD == null) { reusableSD = new ScoreDoc(doc, score); } else if (score >= reusableSD.score) { reusableSD.doc = doc; reusableSD.score = score; } else { return; } reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD); } }
在Lucene 3.0中,Collector接口为void collect(int doc),TopScoreDocCollector实现如下:
public void collect(int doc) throws IOException { float score = scorer.score(); totalHits++; if (score <= pqTop.score) { return; } pqTop.doc = doc + docBase; pqTop.score = score; pqTop = pq.updateTop(); }
