`

lucene3.6.0的扩展搜索

 
阅读更多

自定义排序

IndexSearcher.java  动态计算存储的饭馆离某个位置最近最远
  /** Expert: Low-level search implementation with arbitrary sorting.  Finds
   * the top <code>n</code> hits for <code>query</code>, applying
   * <code>filter</code> if non-null, and sorting the hits by the criteria in
   * <code>sort</code>.
   *
   * <p>Applications should usually call {@link
   * Searcher#search(Query,Filter,int,Sort)} instead.
   * 
   * @throws BooleanQuery.TooManyClauses
   */
  @Override
  public TopFieldDocs search(Weight weight, Filter filter,
      final int nDocs, Sort sort) throws IOException {
    return search(weight, filter, nDocs, sort, true);
  }



SortField.java
  /** Creates a sort with a custom comparison function.
   * @param field Name of field to sort by; cannot be <code>null</code>.
   * @param comparator Returns a comparator for sorting hits.
   */
  public SortField(String field, FieldComparatorSource comparator) {
    initFieldType(field, CUSTOM);
    this.comparatorSource = comparator;
  }

FieldComparatorSource.java
/**
 * Provides a {@link FieldComparator} for custom field sorting.
 *
 * @lucene.experimental
 *
 */
public abstract class FieldComparatorSource implements Serializable {

  /**
   * Creates a comparator for the field in the given index.
   * 
   * @param fieldname
   *          Name of the field to create comparator for.
   * @return FieldComparator.
   * @throws IOException
   *           If an error occurs reading the index.
   */
  public abstract FieldComparator<?> newComparator(String fieldname, int numHits, int sortPos, boolean reversed)
      throws IOException;
}


对查询结果的进一步计算或者处理
Collector.java
* <p><b>NOTE:</b> The doc that is passed to the collect
 * method is relative to the current reader. If your
 * collector needs to resolve this to the docID space of the
 * Multi*Reader, you must re-base it by recording the
 * docBase from the most recent setNextReader call.  Here's
 * a simple example showing how to collect docIDs into a
 * BitSet:</p>
 * 
 * <pre>
 * Searcher searcher = new IndexSearcher(indexReader);
 * final BitSet bits = new BitSet(indexReader.maxDoc());
 * searcher.search(query, new Collector() {
 *   private int docBase;
 * 
 *   <em>// ignore scorer</em>
 *   public void setScorer(Scorer scorer) {
 *   }
 *
 *   <em>// accept docs out of order (for a BitSet it doesn't matter)</em>
 *   public boolean acceptsDocsOutOfOrder() {
 *     return true;
 *   }
 * 
 *   public void collect(int doc) {
 *     bits.set(doc + docBase);
 *   }
 * 
 *   public void setNextReader(IndexReader reader, int docBase) {
 *     this.docBase = docBase;
 *   }
 * });
 * </pre>

扩展QueryParse
1.禁用模糊查询和通配符查询
    /**
   * Builds a new FuzzyQuery instance
   * @param term Term
   * @param minimumSimilarity minimum similarity
   * @param prefixLength prefix length
   * @return new FuzzyQuery Instance
   */
  protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) {
    // FuzzyQuery doesn't yet allow constant score rewrite
    return new FuzzyQuery(term,minimumSimilarity,prefixLength);  //去掉改为抛出异常
  }

自定义过滤器,对于搜索结果本身可能会经常变化,导致在某段时间内可能需要过滤掉,某段时间不需要过滤,如果把这个字段加入索引,则可能导致结果不准确。比较好的方案是定义过滤器,可以根据某些特定规则对搜索进行过滤。比如热销书,某本书可能某段时间是热销书,某段时间不是,如果把是否热销书作为一个字段加入索引中,则不太合适,此时可以使用自定义filter计算某个doc是否要过滤掉。
  

/** 
 *  Abstract base class for restricting which documents may
 *  be returned during searching.
 */
public abstract class Filter implements java.io.Serializable {
  
  /**
   * Creates a {@link DocIdSet} enumerating the documents that should be
   * permitted in search results. <b>NOTE:</b> null can be
   * returned if no documents are accepted by this Filter.
   * <p>
   * Note: This method will be called once per segment in
   * the index during searching.  The returned {@link DocIdSet}
   * must refer to document IDs for that segment, not for
   * the top-level reader.
   * 
   * @param reader a {@link IndexReader} instance opened on the index currently
   *         searched on. Note, it is likely that the provided reader does not
   *         represent the whole underlying index i.e. if the index has more than
   *         one segment the given reader only represents a single segment.
   *          
   * @return a DocIdSet that provides the documents which should be permitted or
   *         prohibited in search results. <b>NOTE:</b> null can be returned if
   *         no documents will be accepted by this Filter.
   * 
   * @see DocIdBitSet
   */
  public abstract DocIdSet getDocIdSet(IndexReader reader) throws IOException;
}

DocIdSet是二进制bit位,各bit的位置跟docid对应,如果某个bit设置为1,则会出现在搜索结果中,否则则不会出现在搜索结果。

filterQuery.java使用过滤后的查询,会拼成最终的查询表达式去查询。

性能问题:
1.lucene会在内部把RangeQuery重写booleanQuery来查询,OR查询表达式

如果查询范围超过1024,会抛出 TooManyClauses异常

  /** Thrown when an attempt is made to add more than {@link
   * #getMaxClauseCount()} clauses. This typically happens if
   * a PrefixQuery, FuzzyQuery, WildcardQuery, or TermRangeQuery 
   * is expanded to many terms during search. 
   */
  public static class TooManyClauses extends RuntimeException {
    public TooManyClauses() {
      super("maxClauseCount is set to " + maxClauseCount);
    }
  }
 
分享到:
评论

相关推荐

    lucene-3.6.0

    《深入剖析Lucene 3.6.0:开源搜索引擎的核心技术》 Apache Lucene是一个高性能、全文本搜索库,它提供了完整的搜索引擎功能,包括索引、查询解析、排名等。在本文中,我们将深入探讨Lucene 3.6.0版本的核心特性,...

    lucene-3.6.0.zip

    《Apache Lucene 3.6.0:搜索引擎技术的核心解析》 Apache Lucene是一个高性能、全文本搜索库,被广泛应用于各种搜索引擎的开发中。3.6.0版本是Lucene的一个重要里程碑,它提供了丰富的功能和改进,使得开发者能够...

    IK和Lucene

    描述中提到"IKAnalyzer2012兼容lucene3.6.0,IKAnalyzer兼容lucene有限",这意味着IK Analyzer 2012版本是专门为配合Lucene 3.6.0设计的,它们之间的兼容性较好。而其他版本的IK Analyzer可能只支持部分版本的Lucene...

    lucene jar包

    - lucene-core-3.6.0.jar:这是Lucene 3.6.0的核心库,包含了实现文本搜索所需的基本组件,如索引构建、查询解析和执行等。 - lucene-1.4-final.jar:这是Lucene 1.4版本的库文件,同样包含了搜索功能,但可能没有...

    IKanalyzer2012修复与Lucene3.6.2Jar及IK使用示例

    Lucene3.6.2Jar则是Apache Lucene的一个特定版本,这是一个开源的全文搜索引擎库。这个示例可能涉及到如何将IKAnalyzer2012与Lucene3.6.2集成,并解决IKAnalyzer在处理中文停用词时存在的问题。 描述中提到的“修复...

    solr-3.6.0-src

    Solr 是基于 Lucene 的搜索服务器,提供了高效、可扩展的搜索和导航功能。它支持分布式搜索,可以处理大量数据,并提供丰富的查询语言和结果排序方式。 2. **版本 3.6.0 的特性** - **改进的性能**:Solr 3.6.0 ...

    使用IK Analyzer实现中文分词之Java实现(包含所有工具包)

    1、lucene-core-3.6.0.jar 2、IKAnalyzer2012.jar(主jar包) 3、IKAnalyzer.cfg.xml(分词器扩展配置文件) 4、stopword.dic(停止词典) 5、IkSegmentation.java(样例类)

    Solr3.6+IKAnalyzer3.2.8分词安装部署

    Solr是一款开源的全文检索服务器,它基于Lucene库开发而成,能够提供高度可定制化的搜索服务。Solr不仅支持文本搜索,还提供了高级功能如动态聚类、实时文档更新等,这使得它成为众多企业级应用中不可或缺的一部分。...

    Javaweb课程作业基于Hadoop的中文词频统计工具源码+使用说明.zip

    --用户可以在这里配置自己的扩展字典 --&gt; &lt;entry key="ext_dict"&gt;MyDic.dic; 【备注】 该项目是个人毕设/课设/大作业项目,代码都经过本地调试测试,功能ok才上传,高分作品,可快速上手运行!欢迎下载使用,可...

Global site tag (gtag.js) - Google Analytics