Lucene学习总结之九：Lucene的查询对象(1)转

xangqun

浏览: 83956 次
性别:
来自: 江西

最近访客更多访客>>

林祥纤

donchiang709

marklam

wangzff

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

索引技术

lucene Apple 面试招聘 Cache

Lucene除了支持查询语法以外，还可以自己构造查询对象进行搜索。

从上一节的Lucene的语法一章可以知道，能与查询语句对应的查询对象有：BooleanQuery，FuzzyQuery，MatchAllDocsQuery，MultiTermQuery，MultiPhraseQuery，PhraseQuery，PrefixQuery，TermRangeQuery，TermQuery，WildcardQuery。

Lucene还支持一些查询对象并没有查询语句与之对应，但是能够实现相对高级的功能，本节主要讨论这些高级的查询对象。

它们中间最主要的一些层次结构如下，我们将一一解析。

Query

BoostingQuery
CustomScoreQuery
MoreLikeThisQuery
MultiTermQuery
- NumericRangeQuery<T>
- TermRangeQuery
SpanQuery
- FieldMaskingSpanQuery
- SpanFirstQuery
- SpanNearQuery
  - PayloadNearQuery
- SpanNotQuery
- SpanOrQuery
- SpanRegexQuery
- SpanTermQuery
  - PayloadTermQuery
FilteredQuery

1、BoostingQuery

BoostingQuery包含三个成员变量：

Query match：这是结果集必须满足的查询对象
Query context：此查询对象不对结果集产生任何影响，仅在当文档包含context查询的时候，将文档打分乘上boost
float boost

在BoostingQuery构造函数中：

public BoostingQuery(Query match, Query context, float boost) {

this.match = match;

this.context = (Query)context.clone();

this.boost = boost;

this.context.setBoost(0.0f);

}

在BoostingQuery的rewrite函数如下：

public Query rewrite(IndexReader reader) throws IOException {

BooleanQuery result = new BooleanQuery() {

@Override

public Similarity getSimilarity(Searcher searcher) {

return new DefaultSimilarity() {

@Override

public float coord(int overlap, int max) {

switch (overlap) {

case 1:

return 1.0f;

case 2:

return boost;

default:

return 0.0f;

}

};

}

};

result.add(match, BooleanClause.Occur.MUST);

result.add(context, BooleanClause.Occur.SHOULD);

return result;

}

由上面实现可知，BoostingQuery最终生成一个BooleanQuery，第一项是match查询，是MUST，即required，第二项是context查询，是SHOULD，即optional

然而由查询过程分析可得，即便是optional的查询，也会影响整个打分。

所以在BoostingQuery的构造函数中，设定context查询的boost为零，则无论文档是否包含context查询，都不会影响最后的打分。

在rewrite函数中，重载了DefaultSimilarity的coord函数，当仅包含match查询的时候，其返回1，当既包含match查询，又包含context查询的时候，返回boost，也即会在最后的打分中乘上boost的值。

下面我们做实验如下：

索引如下文件：

file01: apple other other other boy

file02: apple apple other other other

file03: apple apple apple other other

file04: apple apple apple apple other

对于如下查询(1)：

TermQuery must = new TermQuery(new Term("contents","apple"));
TermQuery context = new TermQuery(new Term("contents","boy"));
BoostingQuery query = new BoostingQuery(must, context, 1f);

或者如下查询(2)：

TermQuery query = new TermQuery(new Term("contents","apple"));

两者的结果是一样的，如下：

docid : 3 score : 0.67974937
docid : 2 score : 0.58868027
docid : 1 score : 0.4806554
docid : 0 score : 0.33987468

自然是包含apple越多的文档打分越高。

然而他们的打分计算过程却不同，用explain得到查询(1)打分细节如下：

docid : 0 score : 0.33987468
0.33987468 = (MATCH) fieldWeight(contents:apple in 0), product of:
1.0 = tf(termFreq(contents:apple)=1)
0.7768564 = idf(docFreq=4, maxDocs=4)
0.4375 = fieldNorm(field=contents, doc=0)

explain得到的查询(2)的打分细节如下：

docid : 0 score : 0.33987468
0.33987468 = (MATCH) sum of:
0.33987468 = (MATCH) fieldWeight(contents:apple in 0), product of:
    1.0 = tf(termFreq(contents:apple)=1)
    0.7768564 = idf(docFreq=4, maxDocs=4)
    0.4375 = fieldNorm(field=contents, doc=0)
0.0 = (MATCH) weight(contents:boy^0.0 in 0), product of:
    0.0 = queryWeight(contents:boy^0.0), product of:
      0.0 = boost
      1.6931472 = idf(docFreq=1, maxDocs=4)
      1.2872392 = queryNorm
    0.74075186 = (MATCH) fieldWeight(contents:boy in 0), product of:
      1.0 = tf(termFreq(contents:boy)=1)
      1.6931472 = idf(docFreq=1, maxDocs=4)
      0.4375 = fieldNorm(field=contents, doc=0)

可以知道，查询(2)中，boy的部分是计算了的，但是由于boost为0被忽略了。

让我们改变boost，将包含boy的文档打分乘以10：

TermQuery must = new TermQuery(new Term("contents","apple"));
TermQuery context = new TermQuery(new Term("contents","boy"));
BoostingQuery query = new BoostingQuery(must, context, 10f);

结果如下：

docid : 0 score : 3.398747
docid : 3 score : 0.67974937
docid : 2 score : 0.58868027
docid : 1 score : 0.4806554

explain得到的打分细节如下：

docid : 0 score : 3.398747
3.398747 = (MATCH) product of:
0.33987468 = (MATCH) sum of:
    0.33987468 = (MATCH) fieldWeight(contents:apple in 0), product of:
      1.0 = tf(termFreq(contents:apple)=1)
      0.7768564 = idf(docFreq=4, maxDocs=4)
      0.4375 = fieldNorm(field=contents, doc=0)
    0.0 = (MATCH) weight(contents:boy^0.0 in 0), product of:
      0.0 = queryWeight(contents:boy^0.0), product of:
        0.0 = boost
        1.6931472 = idf(docFreq=1, maxDocs=4)
        1.2872392 = queryNorm
      0.74075186 = (MATCH) fieldWeight(contents:boy in 0), product of:
        1.0 = tf(termFreq(contents:boy)=1)
        1.6931472 = idf(docFreq=1, maxDocs=4)
        0.4375 = fieldNorm(field=contents, doc=0)
10.0 = coord(2/2)

2、CustomScoreQuery

CustomScoreQuery主要包含以下成员变量：

Query subQuery：子查询
ValueSourceQuery[] valSrcQueries：其他信息源

ValueSourceQuery主要包含ValueSource valSrc成员变量，其代表一个信息源。

ValueSourceQuery会在查询过程中生成ValueSourceWeight并最终生成ValueSourceScorer，ValueSourceScorer在score函数如下：

public float score() throws IOException {

return qWeight * vals.floatVal(termDocs.doc());

}

其中vals = valSrc.getValues(reader)类型为DocValues，也即可以根据文档号得到值。

也即CustomScoreQuery会根据子查询和其他的信息源来共同决定最后的打分，而且公式可以自己实现，以下是默认实现：

public float customScore(int doc, float subQueryScore, float valSrcScores[]) {

if (valSrcScores.length == 1) {

return customScore(doc, subQueryScore, valSrcScores[0]);

}

if (valSrcScores.length == 0) {

return customScore(doc, subQueryScore, 1);

}

float score = subQueryScore;

for(int i = 0; i < valSrcScores.length; i++) {

score *= valSrcScores[i];

}

return score;

}

一般是什么样的信息源会对文档的打分有影响的？

比如说文章的作者，可能被保存在Field当中，我们可以认为名人的文章应该打分更高，所以可以根据此Field的值来影响文档的打分。

然而我们知道，如果对每一个文档号都用reader读取域的值会影响速度，所以Lucene引入了FieldCache来进行缓存，而FieldCache并非在存储域中读取，而是在索引域中读取，从而不必构造Document对象，然而要求此索引域是不分词的，有且只有一个Token。

所以有FieldCacheSource继承于ValueSource，而大多数的信息源都继承于FieldCacheSource，其最重要的一个函数即：

public final DocValues getValues(IndexReader reader) throws IOException {

return getCachedFieldValues(FieldCache.DEFAULT, field, reader);

}

我们举ByteFieldSource为例，其getCachedFieldValues函数如下：

public DocValues getCachedFieldValues (FieldCache cache, String field, IndexReader reader) throws IOException {

final byte[] arr = cache.getBytes(reader, field, parser);

return new DocValues() {

@Override

public float floatVal(int doc) {

return (float) arr[doc];

}

@Override

public int intVal(int doc) {

return arr[doc];

}

@Override

public String toString(int doc) {

return description() + '=' + intVal(doc);

}

@Override

Object getInnerArray() {

return arr;

}

};

}

其最终可以用DocValues根据文档号得到一个float值，并影响打分。

还用作者的例子，假设我们给每一个作者一个float的评级分数，保存在索引域中，用CustomScoreQuery可以将此评级融入到打分中去。

FieldScoreQuery即是ValueSourceQuery的一个实现。

举例如下：

索引如下文件：

file01: apple other other other boy

file02: apple apple other other other

file03: apple apple apple other other

file04: apple apple apple apple other

在索引过程中，对file01的"scorefield"域中索引"10"，而其他的文件"scorefield"域中索引"1"，代码如下：

Document doc = new Document();
doc.add(new Field("contents", new FileReader(file)));
if(file.getName().contains("01")){
doc.add(new Field("scorefield", "10", Field.Store.NO, Field.Index.NOT_ANALYZED));
} else {
doc.add(new Field("scorefield", "1", Field.Store.NO, Field.Index.NOT_ANALYZED));
}
writer.addDocument(doc);

对于建好的索引，如果进行如下查询TermQuery query = new TermQuery(new Term("contents", "apple"));

则得到如下结果：

docid : 3 score : 0.67974937
docid : 2 score : 0.58868027
docid : 1 score : 0.4806554
docid : 0 score : 0.33987468

自然是包含"apple"多的文档打分较高。

然而如果使用CustomScoreQuery进行查询：

TermQuery subquery = new TermQuery(new Term("contents","apple"));
FieldScoreQuery scorefield = new FieldScoreQuery("scorefield", FieldScoreQuery.Type.BYTE);
CustomScoreQuery query = new CustomScoreQuery(subquery, scorefield);

则得到如下结果：

docid : 0 score : 1.6466033
docid : 3 score : 0.32932067
docid : 2 score : 0.28520006
docid : 1 score : 0.23286487

显然文档0因为设置了数据源评分为10而跃居首位。

如果进行explain，我们可以看到，对于普通的查询，文档0的打分细节如下：

如果对于CustomScoreQuery，文档0的打分细节如下：

docid : 0 score : 1.6466033
1.6466033 = (MATCH) custom(contents:apple, byte(scorefield)), product of:
1.6466033 = custom score: product of:
    0.20850874 = (MATCH) weight(contents:apple in 0), product of:
      0.6134871 = queryWeight(contents:apple), product of:
        0.7768564 = idf(docFreq=4, maxDocs=4)
        0.7897047 = queryNorm
      0.33987468 = (MATCH) fieldWeight(contents:apple in 0), product of:
        1.0 = tf(termFreq(contents:apple)=1)
        0.7768564 = idf(docFreq=4, maxDocs=4)
        0.4375 = fieldNorm(field=contents, doc=0)
    7.897047 = (MATCH) byte(scorefield), product of:
      10.0 = byte(scorefield)=10
      1.0 = boost
      0.7897047 = queryNorm
1.0 = queryBoost

3、MoreLikeThisQuery

在分析MoreLikeThisQuery之前，首先介绍一下MoreLikeThis。

在实现搜索应用的时候，时常会遇到"更多相似文章"，"更多相关问题"之类的需求，也即根据当前文档的文本内容，在索引库中查询相类似的文章。

我们可以使用MoreLikeThis实现此功能：

IndexReader reader = IndexReader.open(……);

IndexSearcher searcher = new IndexSearcher(reader);

MoreLikeThis mlt = new MoreLikeThis(reader);

Reader target = ... //此是一个io reader，指向当前文档的文本内容。

Query query = mlt.like( target); //根据当前的文本内容，生成查询对象。

Hits hits = searcher.search(query); //查询得到相似文档的结果。

MoreLikeThis的Query like(Reader r)函数如下：

public Query like(Reader r) throws IOException {

return createQuery(retrieveTerms(r)); //其首先从当前文档的文本内容中抽取term，然后利用这些term构建一个查询对象。

}

public PriorityQueue <Object[]> retrieveTerms(Reader r) throws IOException {

Map<String,Int> words = new HashMap<String,Int>();

//根据不同的域中抽取term，到底根据哪些域抽取，可用函数void setFieldNames(String[] fieldNames)设定。

for (int i = 0; i < fieldNames.length; i++) {

String fieldName = fieldNames[i];

addTermFrequencies(r, words, fieldName);

}

//将抽取的term放入优先级队列中

return createQueue(words);

}

private void addTermFrequencies(Reader r, Map<String,Int> termFreqMap, String fieldName) throws IOException

{

//首先对当前的文本进行分词，分词器可以由void setAnalyzer(Analyzer analyzer)设定。

TokenStream ts = analyzer.tokenStream(fieldName, r);

int tokenCount=0;

TermAttribute termAtt = ts.addAttribute(TermAttribute.class);

//遍历分好的每一个词

while (ts.incrementToken()) {

String word = termAtt.term();

tokenCount++;

//如果分词后的term的数量超过某个设定的值，则停止，可由void setMaxNumTokensParsed(int i)设定。

if(tokenCount>maxNumTokensParsed)

{

break;

}

//如果此词小于最小长度，或者大于最大长度，或者属于停词，则属于干扰词。

//最小长度由void setMinWordLen(int minWordLen)设定。

//最大长度由void setMaxWordLen(int maxWordLen)设定。

//停词表由void setStopWords(Set<?> stopWords)设定。

if(isNoiseWord(word)){

continue;

}

// 统计词频tf

Int cnt = termFreqMap.get(word);

if (cnt == null) {

termFreqMap.put(word, new Int());

}

else {

cnt.x++;

}

private PriorityQueue createQueue(Map<String,Int> words) throws IOException {

//根据统计的term及词频构造优先级队列。

int numDocs = ir.numDocs();

FreqQ res = new FreqQ(words.size()); // 优先级队列，将按tf*idf排序

Iterator<String> it = words.keySet().iterator();

//遍历每一个词

while (it.hasNext()) {

String word = it.next();

int tf = words.get(word).x;

//如果词频小于最小词频，则忽略此词，最小词频可由void setMinTermFreq(int minTermFreq)设定。

if (minTermFreq > 0 && tf < minTermFreq) {

continue;

}

//遍历所有域，得到包含当前词，并且拥有最大的doc frequency的域

String topField = fieldNames[0];

int docFreq = 0;

for (int i = 0; i < fieldNames.length; i++) {

int freq = ir.docFreq(new Term(fieldNames[i], word));

topField = (freq > docFreq) ? fieldNames[i] : topField;

docFreq = (freq > docFreq) ? freq : docFreq;

}

//如果文档频率小于最小文档频率，则忽略此词。最小文档频率可由void setMinDocFreq(int minDocFreq)设定。

if (minDocFreq > 0 && docFreq < minDocFreq) {

continue;

}

//如果文档频率大于最大文档频率，则忽略此词。最大文档频率可由void setMaxDocFreq(int maxFreq)设定。

if (docFreq > maxDocFreq) {

continue;

}

if (docFreq == 0) {

continue;

}

//计算打分tf*idf

float idf = similarity.idf(docFreq, numDocs);

float score = tf * idf;

//将object的数组放入优先级队列，只有前三项有用，按照第三项score排序。

res.insertWithOverflow(new Object[]{word, // 词

topField, // 域

Float.valueOf(score), // 打分

Float.valueOf(idf), // idf

Integer.valueOf(docFreq), // 文档频率

Integer.valueOf(tf) //词频

});

}

return res;

}

private Query createQuery(PriorityQueue q) {

//最后生成的是一个布尔查询

BooleanQuery query = new BooleanQuery();

Object cur;

int qterms = 0;

float bestScore = 0;

//不断从队列中优先取出打分最高的词

while (((cur = q.pop()) != null)) {

Object[] ar = (Object[]) cur;

TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));

if (boost) {

if (qterms == 0) {

//第一个词的打分最高，作为bestScore

bestScore = ((Float) ar[2]).floatValue();

}

float myScore = ((Float) ar[2]).floatValue();

//其他的词的打分除以最高打分，乘以boostFactor，得到相应的词所生成的查询的boost，从而在当前文本文档中打分越高的词在查询语句中也有更高的boost，起重要的作用。

tq.setBoost(boostFactor * myScore / bestScore);

}

try {

query.add(tq, BooleanClause.Occur.SHOULD);

}

catch (BooleanQuery.TooManyClauses ignore) {

break;

}

qterms++;

//如果超过了设定的最大的查询词的数目，则停止，最大查询词的数目可由void setMaxQueryTerms(int maxQueryTerms)设定。

if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {

break;

}

return query;

}

MoreLikeThisQuery只是MoreLikeThis的封装，其包含了MoreLikeThis所需要的参数，并在rewrite的时候，由MoreLikeThis.like生成查询对象。

String likeText;当前文档的文本
String[] moreLikeFields;根据哪个域来抽取查询词
Analyzer analyzer;分词器
float percentTermsToMatch=0.3f;最后生成的BooleanQuery之间都是SHOULD的关系，其中至少有多少比例必须得到满足
int minTermFrequency=1;最少的词频
int maxQueryTerms=5;最多的查询词数目
Set<?> stopWords=null;停词表
int minDocFreq=-1;最小的文档频率

public Query rewrite(IndexReader reader) throws IOException

{

MoreLikeThis mlt=new MoreLikeThis(reader);

mlt.setFieldNames(moreLikeFields);

mlt.setAnalyzer(analyzer);

mlt.setMinTermFreq(minTermFrequency);

if(minDocFreq>=0)

{

mlt.setMinDocFreq(minDocFreq);

}

mlt.setMaxQueryTerms(maxQueryTerms);

mlt.setStopWords(stopWords);

BooleanQuery bq= (BooleanQuery) mlt.like(new ByteArrayInputStream(likeText.getBytes()));

BooleanClause[] clauses = bq.getClauses();

bq.setMinimumNumberShouldMatch((int)(clauses.length*percentTermsToMatch));

return bq;

}

举例，对于http://topic.csdn.net/u/20100501/09/64e41f24-e69a-40e3-9058-17487e4f311b.html?1469中的帖子

我们姑且将相关问题中的帖子以及其他共20篇文档索引。

File indexDir = new File("TestMoreLikeThisQuery/index");

IndexReader reader = IndexReader.open(indexDir);

IndexSearcher searcher = new IndexSearcher(reader);

//将《IT外企那点儿事》作为likeText，从文件读入。

StringBuffer contentBuffer = new StringBuffer();

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream("TestMoreLikeThisQuery/IT外企那点儿事.txt"), "utf-8"));

String line = null;

while((line = input.readLine()) != null){

contentBuffer.append(line);

}

String content = contentBuffer.toString();

//分词用中科院分词

MoreLikeThisQuery query = new MoreLikeThisQuery(content, new String[]{"contents"}, new MyAnalyzer(new ChineseAnalyzer()));

//将80%都包括的词作为停词，在实际应用中，可以有其他的停词策略。

query.setStopWords(getStopWords(reader));

//至少包含5个的词才认为是重要的

query.setMinTermFrequency(5);

//只取其中之一

query.setMaxQueryTerms(1);

TopDocs docs = searcher.search(query, 50);

for (ScoreDoc doc : docs.scoreDocs) {

Document ldoc = reader.document(doc.doc);

String title = ldoc.get("title");

System.out.println(title);

}

static Set<String> getStopWords(IndexReader reader) throws IOException{

HashSet<String> stop = new HashSet<String>();

int numOfDocs = reader.numDocs();

int stopThreshhold = (int) (numOfDocs*0.7f);

TermEnum te = reader.terms();

while(te.next()){

String text = te.term().text();

if(te.docFreq() >= stopThreshhold){

stop.add(text);

}

return stop;

}

结果为：

揭开外企的底儿（连载六）——外企招聘也有潜规则.txt

去央企还是外企，帮忙分析下.txt

哪种英语教材比较适合英语基础差的人.txt

有在达内外企软件工程师就业班培训过的吗.txt

两个月的“骑驴找马”，面试无数家公司的深圳体验.txt

一个看了可能改变你一生的小说《做单》,外企销售经理做单技巧大揭密.txt

HR的至高机密：20个公司绝对不会告诉你的潜规则.txt

4、MultiTermQuery

此类查询包含一到多个Term的查询，主要包括FuzzyQuery，PrefixQuery，WildcardQuery，NumericRangeQuery<T>，TermRangeQuery。

本章主要讨论后两者。

4.1、TermRangeQuery

在较早版本的Lucene，对一定范围内的查询所对应的查询对象是RangeQuery，然而其仅支持字符串形式的范围查询，因为Lucene 3.0提供了数字形式的范围查询NumericRangeQuery，所以原来的RangeQuery变为TermRangeQuery。

其包含的成员变量如下：

String lowerTerm; 左边界字符串
String upperTerm; 右边界字符串
boolean includeLower; 是否包括左边界
boolean includeUpper; 是否包含右边界
String field; 域
Collator collator; 其允许用户实现其函数int compare(String source, String target)来决定怎么样算是大于，怎么样算是小于

其提供函数FilteredTermEnum getEnum(IndexReader reader)用于得到属于此范围的所有Term：

protected FilteredTermEnum getEnum(IndexReader reader) throws IOException {

return new TermRangeTermEnum(reader, field, lowerTerm, upperTerm, includeLower, includeUpper, collator);

}

FilteredTermEnum不断取下一个Term的next函数如下：

public boolean next() throws IOException {

if (actualEnum == null) return false;

currentTerm = null;

while (currentTerm == null) {

if (endEnum()) return false;

if (actualEnum.next()) {

Term term = actualEnum.term();

if (termCompare(term)) {

currentTerm = term;

return true;

}

else return false;

}

currentTerm = null;

return false;

}

其中调用termCompare来判断此Term是否在范围之内，TermRangeTermEnum的termCompare如下：

protected boolean termCompare(Term term) {

if (collator == null) {

//如果用户没有设定collator，则使用字符串比较。

boolean checkLower = false;

if (!includeLower)

checkLower = true;

if (term != null && term.field() == field) {

if (!checkLower || null==lowerTermText || term.text().compareTo(lowerTermText) > 0) {

checkLower = false;

if (upperTermText != null) {

int compare = upperTermText.compareTo(term.text());

if ((compare < 0) ||

(!includeUpper && compare==0)) {

endEnum = true;

return false;

}

return true;

}

} else {

endEnum = true;

return false;

}

return false;

} else {

//如果用户设定了collator，则使用collator来比较字符串。

if (term != null && term.field() == field) {

if ((lowerTermText == null

|| (includeLower

? collator.compare(term.text(), lowerTermText) >= 0

: collator.compare(term.text(), lowerTermText) > 0))

&& (upperTermText == null

|| (includeUpper

? collator.compare(term.text(), upperTermText) <= 0

: collator.compare(term.text(), upperTermText) < 0))) {

return true;

}

return false;

}

endEnum = true;

return false;

}

由前面分析的MultiTermQuery的rewrite可以知道，TermRangeQuery可能生成BooleanQuery，然而当此范围过大，或者范围内的Term过多的时候，可能出现TooManyClause异常。

另一种方式可以用TermRangeFilter，并不变成查询对象，而是对查询结果进行过滤，在Filter一节详细介绍。

4.2、NumericRangeQuery

从Lucene 2.9开始，提供对数字范围的支持，然而欲使用此查询，必须使用NumericField添加域：

document.add(new NumericField(name).setIntValue(value));

或者使用NumericTokenStream添加域：

Field field = new Field(name, new NumericTokenStream(precisionStep).setIntValue(value));

field.setOmitNorms(true);

field.setOmitTermFreqAndPositions(true);

document.add(field);

NumericRangeQuery可因不同的类型用如下方法生成：

newDoubleRange(String, Double, Double, boolean, boolean)
newFloatRange(String, Float, Float, boolean, boolean)
newIntRange(String, Integer, Integer, boolean, boolean)
newLongRange(String, Long, Long, boolean, boolean)

public static NumericRangeQuery<Integer> newIntRange(final String field, Integer min, Integer max, final boolean minInclusive, final boolean maxInclusive) {

return new NumericRangeQuery<Integer>(field, NumericUtils.PRECISION_STEP_DEFAULT, 32, min, max, minInclusive, maxInclusive);

}

其提供函数FilteredTermEnum getEnum(IndexReader reader)用于得到属于此范围的所有Term：

protected FilteredTermEnum getEnum(final IndexReader reader) throws IOException {

return new NumericRangeTermEnum(reader);

}

NumericRangeTermEnum的termCompare如下：

protected boolean termCompare(Term term) {

return (term.field() == field && term.text().compareTo(currentUpperBound) <= 0);

}

另一种方式可以使用NumericRangeFilter，下面会详细论述。

举例，我们索引id从0到9的十篇文档到索引中：

Document doc = new Document();

doc.add(new Field("contents", new FileReader(file)));

String name = file.getName();

Integer id = Integer.parseInt(name);

doc.add(new NumericField("id").setIntValue(id));

writer.addDocument(doc);

搜索的时候，生成NumericRangeQuery:

File indexDir = new File("TestNumericRangeQuery/index");

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));

IndexSearcher searcher = new IndexSearcher(reader);

NumericRangeQuery<Integer> query = NumericRangeQuery.newIntRange("id", 3, 6, true, false);

TopDocs docs = searcher.search(query, 50);

for (ScoreDoc doc : docs.scoreDocs) {

System.out.println("docid : " + doc.doc + " score : " + doc.score);

}

结果如下：

docid : 3 score : 1.0
docid : 4 score : 1.0
docid : 5 score : 1.0

转：http://forfuture1978.iteye.com/blog/669444

分享到：

Lucene学习总结之九：Lucene的查询对象(2) ... | Lucene学习总结之八：Lucene的查询语法，Ja ...

2010-06-08 11:27
浏览 1084
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论