Lucene-2.2.0 源代码阅读学习(35)

pavel

浏览: 935454 次
性别:
来自: 北京

最近访客更多访客>>

macmilan

just_Word

沈寅麟

spedit

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

lucene

lucene Apache

于MultiPhraseQuery(多短语查询)。

MultiPhraseQuery可以通过多个短语的拼接来实现复杂查询。

举个例子：现在使用StandardAnalyzer分析器建立索引，索引中是将单个的汉字作为一个一个地词条。使用这个分析器，因为没有像“今天”这样两个汉字组成词条，所以要想单独按照索引中的词条进行检索是不可能查询出任何结果的。

当然，有很多方案可以选择，其中MultiPhraseQuery就能够实现：

它可以指定一个前缀，比如“今”，而后缀是一个Term[]数组，可以是{new Term("年"),new Term("天")}，则查询的时候只要含有“今年”和“今天”短语的Document都会查询出来。

而且，也可以指定一个后缀，多个前缀，还可以设定一个slop，指定前缀和后缀之间可以允许有多少个间隔。

下面分别测试一下使用MultiPhraseQuery的效果。

我总结了四种情形：

一个前缀，多个后缀

主函数如下所示：

package org.apache.lucene.shirdrn.main;

import java.io.IOException;
import java.util.Date;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MultiPhraseQuery;
import org.apache.lucene.store.LockObtainFailedException;

public class MultiPhraseQuerySearcher {

private String path = "E:\\Lucene\\index";
private MultiPhraseQuery multiPhraseQuery;

public MultiPhraseQuerySearcher(){
   multiPhraseQuery =new MultiPhraseQuery();
}

public void createIndex(){    // 建立索引
   String indexPath = "E:\\Lucene\\index";
   IndexWriter writer;
   try {
   //writer = new IndexWriter(indexPath,new ThesaurusAnalyzer(),true);
    writer = new IndexWriter(indexPath,new StandardAnalyzer(),true);
    Field fieldA = new Field("contents","今天是我们地球的生日。",Field.Store.YES,Field.Index.TOKENIZED);
    Document docA = new Document();
    docA.add(fieldA);

    Field fieldB1 = new Field("contents","今晚的辩题很道地：谁知道宇宙空间的奥秘，在我们这些人当中？",Field.Store.YES,Field.Index.TOKENIZED);
    Field fieldB2 = new Field("contents","我认为电影《今朝》是一部不错的影片，尤其是在今天，天涯海角到哪里找啊。",Field.Store.YES,Field.Index.TOKENIZED);
    Field fieldB3 = new Field("contents","长今到底是啥意思呢？",Field.Store.YES,Field.Index.TOKENIZED);
    Document docB = new Document();
    docB.add(fieldB1);
    docB.add(fieldB2);
    docB.add(fieldB3);

    Field fieldC1 = new Field("contents","宇宙学家对地球的重要性，今非昔比。",Field.Store.YES,Field.Index.TOKENIZED);
    Field fieldC2 = new Field("contents","衣带渐宽终不悔，为伊消得人憔悴。",Field.Store.YES,Field.Index.TOKENIZED);
    Document docC = new Document();
    docC.add(fieldC1);

    writer.addDocument(docA);
    writer.addDocument(docB);
    writer.addDocument(docC);
    writer.close();
   } catch (CorruptIndexException e) {
    e.printStackTrace();
   } catch (LockObtainFailedException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   }
}

public void useMultiPrefixExample(){    // 含有多个前缀的情形
   Term termA = new Term("contents","道");
   Term termB = new Term("contents","对");
   multiPhraseQuery.add(new Term[]{termA,termB});
   Term termC = new Term("contents","地");
   multiPhraseQuery.add(termC);
}

public void useMultiSuffixExample(){   // 含有多个后缀的情形
   Term termA = new Term("contents","今");
   multiPhraseQuery.add(termA);
   Term termB = new Term("contents","天");
   Term termC = new Term("contents","晚");
   Term termD = new Term("contents","非");
   multiPhraseQuery.add(new Term[]{termB,termC,termD});
}

public void useMultiPrefixAndMultiSuffixExample(){    // 含有多个前缀、多个后缀的情形
   Term termA = new Term("contents","生");
   Term termB = new Term("contents","今");
   multiPhraseQuery.add(new Term[]{termA,termB});
   Term termC = new Term("contents","非");
   Term termD = new Term("contents","日");
   Term termE = new Term("contents","朝");
   multiPhraseQuery.add(new Term[]{termC,termD,termE});
}

public void useSetSlopExample(){    //   设定slop的情形
   Term termA = new Term("contents","我");
   multiPhraseQuery.add(termA);
   Term termB = new Term("contents","影");
   Term termC = new Term("contents","球");
   multiPhraseQuery.add(new Term[]{termB,termC});
   multiPhraseQuery.setSlop(5);
}

public static void main(String[] args) {
   MultiPhraseQuerySearcher mpqs = new MultiPhraseQuerySearcher();
   mpqs.createIndex();
mpqs.useMultiSuffixExample();    // 调用含有多个后缀的实现
   try {
    Date startTime = new Date();
    IndexSearcher searcher = new IndexSearcher(mpqs.path);
    Hits hits = searcher.search(mpqs.multiPhraseQuery);
    System.out.println("********************************************************************");
    for(int i=0;i<hits.length();i++){
     System.out.println("Document的内部编号为： "+hits.id(i));
     System.out.println("Document内容为： "+hits.doc(i));
     System.out.println("Document的得分为： "+hits.score(i));
    }
    System.out.println("********************************************************************");
    System.out.println("共检索出符合条件的Document "+hits.length()+" 个。");
    Date finishTime = new Date();
    long timeOfSearch = finishTime.getTime() - startTime.getTime();
    System.out.println("本次搜索所用的时间为 "+timeOfSearch+" ms");
   } catch (CorruptIndexException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   }

}

检索的结果如下所示：

********************************************************************
Document的内部编号为： 0
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:今天是我们地球的生日。>>
Document的得分为： 1.0
Document的内部编号为： 2
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:宇宙学家对地球的重要性，今非昔比。>>
Document的得分为： 0.79999995
Document的内部编号为： 1
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:今晚的辩题很道地：谁知道宇宙空间的奥秘，在我们这些人当中？> stored/uncompressed,indexed,tokenized<contents:我认为电影《今朝》是一部不错的影片，尤其是在今天，天涯海角到哪里找啊。> stored/uncompressed,indexed,tokenized<contents:长今到底是啥意思呢？>>
Document的得分为： 0.5656854
********************************************************************
共检索出符合条件的Document 3 个。
本次搜索所用的时间为 141 ms

由上面的：

public void useMultiSuffixExample(){   // 含有多个后缀的情形
   Term termA = new Term("contents","今");
   multiPhraseQuery.add(termA);
   Term termB = new Term("contents","天");
   Term termC = new Term("contents","晚");
   Term termD = new Term("contents","非");
   multiPhraseQuery.add(new Term[]{termB,termC,termD});
}

可知，检索的是以“今”为唯一的前缀，后缀可以是“天”、“晚”、“非”，由检索结果可以看出：还有“今天”、“今晚”、“今非”的都被检索出来了。

多个前缀，一个后缀

调用useMultiPrefixExample()方法测试。

public void MultiPrefixExample(){
   Term termA = new Term("contents","道");
   Term termB = new Term("contents","对");
   multiPhraseQuery.add(new Term[]{termA,termB});
   Term termC = new Term("contents","地");
   multiPhraseQuery.add(termC);
}

修改主函数中mpqs.useMultiSuffixExample();为mpqs.useMultiPrefixExample();，测试结果如下所示：

********************************************************************
Document的内部编号为： 2
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:宇宙学家对地球的重要性，今非昔比。>>
Document的得分为： 0.88081205
Document的内部编号为： 1
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:今晚的辩题很道地：谁知道宇宙空间的奥秘，在我们这些人当中？> stored/uncompressed,indexed,tokenized<contents:我认为电影《今朝》是一部不错的影片，尤其是在今天，天涯海角到哪里找啊。> stored/uncompressed,indexed,tokenized<contents:长今到底是啥意思呢？>>
Document的得分为： 0.44040602
********************************************************************
共检索出符合条件的Document 2 个。
本次搜索所用的时间为 94 ms

我们测试的目的是检索出含有“道地”和“对地”的Document，检索结果和我们的预期想法是一致的。

多个前缀，多个后缀

其实就是对前缀Term[]数组与后缀Term[]数组进行匹配，即：对前缀Term[]数组中的每个Term都与后缀Term[]数组中每个Term进行组合匹配，进行查询。

public void useMultiPrefixAndMultiSuffixExample(){
   Term termA = new Term("contents","生");
   Term termB = new Term("contents","今");
   multiPhraseQuery.add(new Term[]{termA,termB});
   Term termC = new Term("contents","非");
   Term termD = new Term("contents","日");
   Term termE = new Term("contents","朝");
   multiPhraseQuery.add(new Term[]{termC,termD,termE});
}

这里，一种有6种组合：“生非”、“生日”、“生朝”、“今非”、“今日”、“今朝”。从索引文件中进行匹配，如果含有上面某个组合的短语，就为实际检索的结果。

修改主函数中mpqs.useMultiSuffixExample();为mpqs.useMultiPrefixAndMultiSuffixExample();，测试结果如下所示：

********************************************************************
Document的内部编号为： 0
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:今天是我们地球的生日。>>
Document的得分为： 1.0
Document的内部编号为： 2
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:宇宙学家对地球的重要性，今非昔比。>>
Document的得分为： 0.8
Document的内部编号为： 1
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:今晚的辩题很道地：谁知道宇宙空间的奥秘，在我们这些人当中？> stored/uncompressed,indexed,tokenized<contents:我认为电影《今朝》是一部不错的影片，尤其是在今天，天涯海角到哪里找啊。> stored/uncompressed,indexed,tokenized<contents:长今到底是啥意思呢？>>
Document的得分为： 0.4
********************************************************************
共检索出符合条件的Document 3 个。
本次搜索所用的时间为 110 ms

设定slop间隔范围

默认的slop值为0，即表示多个词条直接连接构成短语进行检索。

设定slop后，只要间隔小于等于(<=)slop值都呗认为是满足条件的检索。

调用下面的方法：

public void useSetSlopExample(){
   Term termA = new Term("contents","我");
   multiPhraseQuery.add(termA);
   Term termB = new Term("contents","影");
   Term termC = new Term("contents","球");
   multiPhraseQuery.add(new Term[]{termB,termC});
   multiPhraseQuery.setSlop(5);
}

也就是，满足下面组合的都为检索结果：

我球、我■球、我■■球、我■■■球、我■■■■球、我■■■■■球

我影、我■影、我■■影、我■■■影、我■■■■影、我■■■■■影

其中，一个“■”表示与检索无关的一个词条，即间隔。

进行测试，结果如下所示：

********************************************************************
Document的内部编号为： 0
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:今天是我们地球的生日。>>
Document的得分为： 0.6144207
Document的内部编号为： 1
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:今晚的辩题很道地：谁知道宇宙空间的奥秘，在我们这些人当中？> stored/uncompressed,indexed,tokenized<contents:我认为电影《今朝》是一部不错的影片，尤其是在今天，天涯海角到哪里找啊。> stored/uncompressed,indexed,tokenized<contents:长今到底是啥意思呢？>>
Document的得分为： 0.21284157
********************************************************************
共检索出符合条件的Document 2 个。
本次搜索所用的时间为 109 ms

总结

从上面的几种情况可以看出MultiPhraseQuery的用法很灵活，而且很方便，要根据具体是应用进行选择。

分享到：

Lucene-2.2.0 源代码阅读学习(36) | Lucene-2.2.0 源代码阅读学习(34)

2009-02-06 15:03
浏览 957
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论