`

lucene3.6.0的高级搜索相关技术

 
阅读更多

高级搜索技术:

排序 默认排序按照相关性,
public class Sort
implements Serializable {

  /**
   * Represents sorting by computed relevance. Using this sort criteria returns
   * the same results as calling
   * {@link Searcher#search(Query,int) Searcher#search()}without a sort criteria,
   * only with slightly more overhead.
   */
  public static final Sort RELEVANCE = new Sort();  相关性

  /** Represents sorting by index order. */
  public static final Sort INDEXORDER = new Sort(SortField.FIELD_DOC);  按照索引顺序,跟相关性顺序不一样

  /** Sorts by the criteria in the given SortField. */
  public Sort(SortField field) {
    setSort(field);
  }

指定排序字段之后,如果排序字段相同,则按照索引顺序再进行排序。

排序字段SortField.java


 /** Represents sorting by document score (relevance). */
  public static final SortField FIELD_SCORE = new SortField(null, SCORE);

  /** Represents sorting by document number (index order). */
  public static final SortField FIELD_DOC = new SortField(null, DOC);

  private String field;
  private int type;  // defaults to determining type dynamically
  private Locale locale;    // defaults to "natural order" (no Locale)
  boolean reverse = false;  // defaults to natural order
  private FieldCache.Parser parser;

  // Used for CUSTOM sort
  private FieldComparatorSource comparatorSource;

  private Object missingValue;
  /** Creates a sort, possibly in reverse, by terms in the given field with the
   * type of term values explicitly given.
   * @param field  Name of field to sort by.  Can be <code>null</code> if
   *               <code>type</code> is SCORE or DOC.
   * @param type   Type of values in the terms.
   * @param reverse True if natural order should be reversed.
   */
  public SortField(String field, int type, boolean reverse) {
    initFieldType(field, type);
    this.reverse = reverse;
  }

PhraseQuery.java 允许多个项对应同一个位置进行查询,同义词查询
MultiPhraseQuery.java是对PhraseQuery的进一步扩展
  /**
   * Adds a term to the end of the query phrase.
   * The relative position of the term within the phrase is specified explicitly.
   * This allows e.g. phrases with more than one term at the same position
   * or phrases with gaps (e.g. in connection with stopwords).
   * 
   * @param term
   * @param position
   */
  public void add(Term term, int position) {
      if (terms.size() == 0)
          field = term.field();
      else if (term.field() != field)
          throw new IllegalArgumentException("All phrase terms must be in the same field: " + term);

      terms.add(term);
      positions.add(Integer.valueOf(position));
      if (position > maxPosition) maxPosition = position;
  }

也支持slop
  /** Sets the number of other words permitted between words in query phrase.
    If zero, then this is an exact phrase search.  For larger values this works
    like a <code>WITHIN</code> or <code>NEAR</code> operator.

    <p>The slop is in fact an edit-distance, where the units correspond to
    moves of terms in the query phrase out of position.  For example, to switch
    the order of two words requires two moves (the first move places the words
    atop one another), so to permit re-orderings of phrases, the slop must be
    at least two.

    <p>More exact matches are scored higher than sloppier matches, thus search
    results are sorted by exactness.

    <p>The slop is zero by default, requiring exact matches.*/
  public void setSlop(int s) { slop = s; }

实现多个域上的查询MultiFieldQueryParser.java

跨度查询SpanQuery.java,还需要返回相同项的不同位置信息
  在一个域的起点查找跨度SpanFirstQuery.java,在文档的开始end个token查询某个值
   /** Matches spans near the beginning of a field.
 * <p/> 
 * This class is a simple extension of {@link SpanPositionRangeQuery} in that it assumes the
 * start to be zero and only checks the end boundary.
 *
 *
 *  */
public class SpanFirstQuery extends SpanPositionRangeQuery {

  /** Construct a SpanFirstQuery matching spans in <code>match</code> whose end
   * position is less than or equal to <code>end</code>. */
  public SpanFirstQuery(SpanQuery match, int end) {
    super(match, 0, end);
  }

 彼此相邻的跨度SpanNearQuery.java

/** Matches spans which are near one another.  One can specify <i>slop</i>, the
 * maximum number of intervening unmatched positions, as well as whether
 * matches are required to be in-order. */
public class SpanNearQuery extends SpanQuery implements Cloneable {
  protected List<SpanQuery> clauses;
  protected int slop;
  protected boolean inOrder;

  protected String field;
  private boolean collectPayloads;

  /** Construct a SpanNearQuery.  Matches spans matching a span from each
   * clause, with up to <code>slop</code> total unmatched positions between
   * them.  * When <code>inOrder</code> is true, the spans from each clause
   * must be * ordered as in <code>clauses</code>.
   * @param clauses the clauses to find near each other
   * @param slop The slop value
   * @param inOrder true if order is important
   * */
  public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder) {
    this(clauses, slop, inOrder, true);     
  }

排序跨度交替SpanNotQuery.java
/** Removes matches which overlap with another SpanQuery. */
public class SpanNotQuery extends SpanQuery implements Cloneable {
  private SpanQuery include;
  private SpanQuery exclude;

全局跨度查询SpanOrQuery.java
/** Matches the union of its clauses.*/
public class SpanOrQuery extends SpanQuery implements Cloneable {
  private List<SpanQuery> clauses;
  private String field;

  /** Construct a SpanOrQuery merging the provided clauses. */
  public SpanOrQuery(SpanQuery... clauses) {

    // copy clauses array into an ArrayList
    this.clauses = new ArrayList<SpanQuery>(clauses.length);
    for (int i = 0; i < clauses.length; i++) {
      addClause(clauses[i]);
    }
  }

filter过滤器
CachingWrapperFilter.java能够将第一次查询结果缓存起来,后面可重用
QueryWrapperFilter.java可以把查询结果作为接下来的搜索的可用文档集
TermRangeFilter.java对搜索结果进一步进行过滤
/**
 * A Filter that restricts search results to a range of term
 * values in a given field.
 *
 * <p>This filter matches the documents looking for terms that fall into the
 * supplied range according to {@link
 * String#compareTo(String)}, unless a <code>Collator</code> is provided. It is not intended
 * for numerical ranges; use {@link NumericRangeFilter} instead.
 *
 * <p>If you construct a large number of range filters with different ranges but on the 
 * same field, {@link FieldCacheRangeFilter} may have significantly better performance. 
 * @since 2.9
 */
public class TermRangeFilter extends MultiTermQueryWrapperFilter<TermRangeQuery> {

自定义安全过滤器,查询的文档集要在某个用户的数据空间内
ChainedFilter.java过滤器链
FilteredQuery.java

对多个索引的搜索  lucene3.6.0建议使用MultiReader.java
MultiSearcher.java 

多线程搜索  对多个索引进行远程搜索
/** An IndexReader which reads multiple, parallel indexes.  Each index added
 * must have the same number of documents, but typically each contains
 * different fields.  Each document contains the union of the fields of all
 * documents with the same document number.  When searching, matches for a
 * query term are from the first index added that has the field.
 *
 * <p>This is useful, e.g., with collections that have large fields which
 * change rarely and small fields that change more frequently.  The smaller
 * fields may be re-indexed in a new index and both indexes may be searched
 * together.
 *
 * <p><strong>Warning:</strong> It is up to you to make sure all indexes
 * are created and modified the same way. For example, if you add
 * documents to one index, you need to add the same documents in the
 * same order to the other indexes. <em>Failure to do so will result in
 * undefined behavior</em>.
 */
public class ParallelReader extends IndexReader {


项向量 term vector

IndexReader.java
  /**
   * Return an array of term frequency vectors for the specified document.
   * The array contains a vector for each vectorized field in the document.
   * Each vector contains terms and frequencies for all terms in a given vectorized field.
   * If no such fields existed, the method returns null. The term vectors that are
   * returned may either be of type {@link TermFreqVector}
   * or of type {@link TermPositionVector} if
   * positions or offsets have been stored.
   * 
   * @param docNumber document for which term frequency vectors are returned
   * @return array of term frequency vectors. May be null if no term vectors have been
   *  stored for the specified document.
   * @throws IOException if index cannot be accessed
   * @see org.apache.lucene.document.Field.TermVector
   */
  abstract public TermFreqVector[] getTermFreqVectors(int docNumber)
          throws IOException;


通过文档获得域对应的项向量可以计算文档之间的相似度,从而可以进行相似查询或者推荐.
 
分享到:
评论

相关推荐

    lucene-3.6.0

    《深入剖析Lucene 3.6.0:开源搜索引擎的核心技术》 Apache Lucene是一个高性能、全文本搜索库,它提供了完整的搜索引擎功能,包括索引、查询解析、排名等。在本文中,我们将深入探讨Lucene 3.6.0版本的核心特性,...

    lucene-3.6.0.zip

    《Apache Lucene 3.6.0:搜索引擎技术的核心解析》 Apache Lucene是一个高性能、全文本搜索库,被广泛应用于各种搜索引擎的开发中。3.6.0版本是Lucene的一个重要里程碑,它提供了丰富的功能和改进,使得开发者能够...

    lucene 3.6.0 源代码

    lucene-core-3.6.0-sources 绝对可用

    lucene-3.6.0 api 手册

    lucene-3.6.0 api 手册, 最新的 , lucene 是个好东东, 一直在用, 之前还在使用3.1的,发现已经到3.6了, 落后啊

    利用Lucene 实现高级搜索

    ### 利用Lucene实现高级搜索的关键知识点 #### Lucene简介 Lucene是Apache软件基金会下的一个开源全文检索库,提供了高性能的文本搜索能力。它不仅适用于网站的搜索功能,还可以用于任何需要文本搜索的应用场景,如...

    lucene 高级搜索项目

    通过以上分析,我们可以看出这个“Lucene 高级搜索项目”全面覆盖了Lucene的核心技术,从基础的索引创建到复杂的附件搜索和全文搜索,再到插件开发和Solr的使用,为学习和实践Lucene提供了丰富的素材。

    lucene-core-3.6.0.jar

    lucene-core-3.6.0.jar,很好,很实用的一个包

    lucene高级应用

    在信息技术领域,搜索引擎技术是数据检索的重要手段,而Apache Lucene作为开源全文搜索引擎库,以其高效、灵活的特点被广泛应用于各类项目中。本篇文章将深入探讨Lucene的高级应用,结合提供的两个文档《lucene的...

    IK和Lucene

    在压缩包文件名称列表中,"IK和lucene"可能包含了IK Analyzer的相关文件以及不同版本的Lucene库。通常,这些文件会包括 IK Analyzer 的源码、编译后的JAR包、配置文件,以及Lucene的JAR包等。开发者可以利用这些资源...

    lucene高级搜索进阶项目_04

    《Lucene高级搜索进阶项目_04》 在深入探讨Lucene的高级搜索进阶项目时,我们首先需要理解Lucene的核心概念及其在信息检索中的应用。Lucene是一个高性能、全文本搜索库,它提供了丰富的搜索功能,包括布尔运算、...

    lucene-core-3.6.0.jar.zip

    Lucene是apache软件基金会4 jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,但它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎...

    lucene.jar包,四包(全)

    总结来说,Lucene 3.6.0版本提供了完整的搜索功能和强大的测试工具,对于Java开发者来说,这是一个高效且可靠的文本搜索解决方案。通过深入理解Lucene Core的组件和Test Framework,开发者能够更好地利用Lucene实现...

    lucene-highlighter-3.6.0-sources

    lucene-highlighter-3.6.0-sources

    lucene高级搜索进阶项目_03

    在本项目"Lucene高级搜索进阶项目_03"中,我们将深入探讨Apache Lucene这一强大的全文搜索引擎库。Lucene是Java开发的开源库,它提供了文本分析、索引和搜索功能,使得开发者能够轻松地在应用程序中实现复杂的搜索...

    lucene jar包

    - lucene-core-3.6.0.jar:这是Lucene 3.6.0的核心库,包含了实现文本搜索所需的基本组件,如索引构建、查询解析和执行等。 - lucene-1.4-final.jar:这是Lucene 1.4版本的库文件,同样包含了搜索功能,但可能没有...

    解密搜索引擎技术实战 LUCENE & JAVA(第3版)PDF

    《解密搜索引擎技术实战 LUCENE & JAVA(第3版)》是一本深入探讨搜索引擎技术的专业书籍,由罗刚撰写。这本书主要聚焦于LUCENE和JAVA这两种技术在搜索引擎开发中的应用,为读者揭示了搜索引擎背后的复杂机制和实现...

    Lucene搜索技术

    【Lucene搜索技术】是一种基于Java的全文索引引擎工具包,它并非一个完整的全文搜索引擎,而是提供了一套用于构建全文检索应用的API。Lucene的主要目标是方便开发者将其嵌入到各种应用程序中,实现对特定数据源的...

    Lucene全文搜索_LuceneJava全文搜索_

    此外,Lucene还提供了近似度评分(Similarity Scoring),根据查询词在文档中的出现频率和位置给出相关性分数,帮助用户找到最相关的搜索结果。 智能查询则涉及到更复杂的查询构造,如前缀查询(Prefix Query)、...

    解密搜索引擎技术实战Lucene&Java精华版(2)

    解密搜索引擎技术实战Lucene&Java精华版(第3版)源码 书名:解密搜索引擎技术实战Lucene&Java精华版(第3版) 作者:罗刚 等编著 出版社:电子工业出版社 关键词:Lucene solr 搜索引擎 Lucene实战 随书源码 本书随...

    Lucene时间区间搜索

    本篇将深入探讨如何在C#中实现Lucene的时间区间查询匹配,以及涉及的相关技术点。 首先,我们需要了解Lucene的基本操作流程,包括索引构建、查询解析和结果检索。在C#中,我们可以使用Apache.Lucene.Net库来操作...

Global site tag (gtag.js) - Google Analytics