`

Lucene: Understanding the indexing process

 
阅读更多



INDEX SEGMENTS

Every Lucene index consists of one or more segments. Each segment is a standalone index, holding a subset of all indexed documents. A new segment is created whenever the writer flushes buffered added documents and pending deletions into the directory. At search time, each segment is visited separately
and the results are combined.


Each segment, in turn, consists of multiple files, of the form _X.<ext>, where X is the segment’s name and <ext> is the extension that identifies which part of the index that file corresponds to. There are separate files to hold the different parts of the index (term vectors, stored fields, inverted index, and so on)

 

There’s one special file, referred to as the segments file and named segments_<N>,that references all live segments. This file is important! Lucene first opens this file, and then opens each segment referenced by it. The value <N>, called “the generation,” is an integer that increases by one every time a change is committed to the index. Naturally, over time the index will accumulate many segments, especially if you
open and close your writer frequently. This is fine. Periodically, IndexWriter will select segments and coalesce them by merging them into a single new segment and then removing the old segments. The selection of segments to be merged is governed by a separate MergePolicy. Once merges are selected, their execution is done by the MergeScheduler.

 

Adding documents to an index

 

IndexWriter writer = getWriter();
for (int i = 0; i < ids.length; i++)
{
  Document doc = new Document();

  doc.add(new Field("id", ids[i],Field.Store.YES,Field.Index.NOT_ANALYZED));
  doc.add(new Field("country", unindexed[i],Field.Store.YES,Field.Index.NO));
  doc.add(new Field("contents", unstored[i],Field.Store.NO,Field.Index.ANALYZED));
  doc.add(new Field("city", text[i],Field.Store.YES,Field.Index.ANALYZED));

  writer.addDocument(doc);
}

 

 

Deleting documents from an index

 

writer.deleteDocuments(new Term("id", "1"));

 

 

Updating documents in the index

  • updateDocument(Term, Document)     first deletes all documents containing theprovided term and then adds the new document using the writer’s default analyzer.
  • updateDocument(Term, Document, Analyzer)     does the same but uses the provided analyzer instead of the writer’s default analyzer.

------------------------------------------------------------------------------------------------------------------------------------

Field options

Field is perhaps the most important class when indexing documents: it’s the actual class that holds each value to be indexed. When you create a field, you can specify numerous options to control what Lucene should do with that field once you add the document to the index.

 

Field options for indexing

The options for indexing (Field.Index.*) control how the text in the field will be made searchable via the inverted index.

ƒ

  • Index.ANALYZED —Use the analyzer to break the field’s value into a stream of separate tokens and make each token searchable. This option is useful for normal text fields (body, title, abstract, etc.).
  • Index.NOT_ANALYZED —Do index the field, but don’t analyze the String value.Instead, treat the Field’s entire value as a single token and make that tokensearchable. This option is useful for fields that you’d like to search on but that shouldn’t be broken up, such as URLs, file system paths, dates, personal names,Social Security numbers, and telephone numbers. This option is especially useful for enabling “exact match” searching.
  • Index.ANALYZED_NO_NORMS—A variant of Index.ANALYZED that doesn’t store norms information in the index. Norms record index-time boost information inthe index but can be memory consuming when you’re searching.
  • Index.NOT_ANALYZED_NO_NORMS—Just like Index.NOT_ANALYZED, but alsodoesn’t store norms. This option is frequently used to save index space andmemory usage during searching, because single-token fields don’t need the norms information unless they’re boosted.
  • Index.NO —Don’t make this field’s value available for searching.

Field options for storing fields

The options for stored fields (Field.Store.*) determine whether the field’s exact value should be stored away so that you can later retrieve it during searching:

 

  • ƒ Store.YES—Stores the value. When the value is stored, the original String in its entirety is recorded in the index and may be retrieved by an IndexReader.This option is useful for fields that you’d like to use when displaying the searchresults (such as a URL, title, or database primary key). Try not to store very large fields, if index size is a concern, as stored fields consume space in the index.
  • Store.NO—Doesn’t store the value. This option is often used along with Index.ANALYZED to index a large text field that doesn’t need to be retrieved in its original form, such as bodies of web pages, or any other type of text document.

Field options for term vectors

Sometimes when you index a document you’d like to retrieve all its unique terms at search time.

  • One common use is to speed up highlighting the matched tokens instored fields.
  • Another use is to enable a link, “Find similar documents,” that when clicked runs a new search using the salient terms in an original document.
  • Yet another example is automatic categorization of documents.

--

  • ƒ TermVector.YES —Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information
  • TermVector.WITH_POSITIONS —Records the unique terms and their counts,and also the positions of each occurrence of every term, but no offsets
  • TermVector.WITH_OFFSETS—Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term,but no positions
  • TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts,along with positions and offsets
  • TermVector.NO—Doesn’t store any term vector information

 



 Field options for sorting

 

When returning documents that match a search, Lucene orders them by their score by default. Sometimes, you need to order results using other criteria. For instance, if you’re searching email messages, you may want to order results by sent or received date, or perhaps by message size or sender.

Fields used for sorting must be indexed and must contain one token per document. Typically this means using Field.Index.NOT_ANALYZED or Field.Index.NOT_ANALYZED_NO_NORMS (if you’re not boosting documents or fields), but if your analyzer will always produce only one token,such as KeywordAnalyzer, Field.Index.ANALYZED or Field.Index.ANALYZED_NO_NORMS will work as well.

 

 

 

 

 

  • 大小: 66.9 KB
  • 大小: 27.7 KB
  • 大小: 60.4 KB
分享到:
评论

相关推荐

    Lucene:基于Java的全文检索引擎简介

    Lucene是一个基于Java的全文索引工具包。 1. 基于Java的全文索引引擎Lucene简介:关于作者和Lucene的...5. Hacking Lucene:简化的查询分析器,删除的实现,定制的排序,应用接口的 扩展 6. 从Lucene我们还可以学到什么

    指南-Lucene:ES篇.md

    指南-Lucene:ES篇.md

    lucene-core-7.7.0-API文档-中文版.zip

    Maven坐标:org.apache.lucene:lucene-core:7.7.0; 标签:apache、lucene、core、中文文档、jar包、java; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档...

    IKAnalyzer中文分词支持lucene6.5.0版本

    由于林良益先生在2012之后未对IKAnalyzer进行更新,后续lucene分词接口发生变化,导致不可使用,所以此jar包支持lucene6.0以上版本

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介.doc

    【Lucene:基于Java的全文检索引擎简介】 Lucene是一个由Java编写的开源全文检索引擎工具包,由Doug Cutting创建并贡献给Apache基金会,成为Jakarta项目的一部分。它不是一个独立的全文检索应用,而是提供了一个可...

    Lucene:基于Java的全文检索引擎简介.rar

    **Lucene:基于Java的全文检索引擎简介** Lucene是一个高度可扩展的、高性能的全文检索库,由Apache软件基金会开发并维护。它是Java开发者在构建搜索引擎应用时的首选工具,因为它提供了完整的索引和搜索功能,同时...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介.docx

    **Lucene:基于Java的全文检索引擎** Lucene是一个由Apache软件基金会的Jakarta项目维护的开源全文检索引擎。它不是一个完整的全文检索应用,而是一个用Java编写的库,允许开发人员轻松地在他们的应用程序中集成...

    lucene:基于Java的全文检索引擎简介

    ### 基于Java的全文检索引擎Lucene简介 #### 1. Lucene概述与历史背景 Lucene是一个开源的全文检索引擎库,完全用Java编写。它为开发者提供了构建高性能搜索应用程序的基础组件。尽管Lucene本身不是一个现成的应用...

    基于 SSM 框架的二手书交易系统.zip

    快速上手 1. 运行环境 IDE:IntelliJ IDEA 项目构建工具:Maven 数据库:MySQL Tomcat:Tomcat 8.0.47 2. 初始化项目 创建一个名为bookshop的数据库,将bookshop.sql导入 打开IntelliJ IDEA,将项目导入 ...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介22173.doc

    【Lucene:基于Java的全文检索引擎简介】 Lucene是一个由Java编写的全文索引工具包,它不是一个完整的全文检索应用,而是作为一个可嵌入的引擎,为各种应用程序提供全文检索功能。Lucene的设计目标是简化全文检索的...

    lucene:Apache Lucene开源搜索软件

    Lucene: : 用Gradle构建 基本步骤: 安装OpenJDK 11(或更高版本) 从Apache下载Lucene并解压缩 连接到安装的顶层(lucene顶层目录的父目录) 运行gradle 步骤0)设置您的开发环境(OpenJDK 11或更高版本) ...

    lucene 所有jar包 包含IKAnalyzer分词器

    《Lucene分词技术与IKAnalyzer详解》 在信息技术领域,搜索引擎是不可或缺的一部分,而Lucene作为Apache软件基金会的一个开放源代码项目,是Java语言开发的全文检索引擎库,为构建高效、可扩展的信息检索应用提供了...

Global site tag (gtag.js) - Google Analytics