`

Lucene: Indexing numbers, dates, and times And Field truncation

 
阅读更多

Although most content is textual in nature, in many cases handling numeric or date/time values is crucial. In a commerce setting, the product’s price, and perhaps other numeric attributes like weight and height, are clearly important. A video search engine may index the duration of each video. Press releases and articles have a time-stamp.

 

Indexing numbers

There are two common scenarios in which indexing numbers is important.

  • In one scenario, numbers are embedded in the text to be indexed, and you want to make sure those numbers are preserved and indexed as their own tokens so that you can use them later as ordinary tokens in searches. To enable this, simply pick an analyzer that doesn’t discard numbers.
  • In the other scenario, you have a field that contains a single number and you want to index it as a numeric value and then use it for precise (equals) matching, rangesearching, and/or sorting.
doc.add(new NumericField("price").setDoubleValue(19.99));

 

Indexing dates and times

 

Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number. The simplest approach is to use Date.getTime to get the equivalent value, in millisecond precision, for a Java Date object:

doc.add(new NumericField("timestamp")
➥ .setLongValue(new Date().getTime()));

 

doc.add(new NumericField("day")
➥ .setIntValue((int) (new Date().getTime()/24/3600)));

 

Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth")
➥ .setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

 

------------------------------------------------------------------------------------------------------------------------------------

Field truncation

Some applications index documents whose sizes aren’t known in advance. As a safety mechanism to control the amount of RAM and hard disk space used, you may want to limit the amount of input they are allowed index per field. It’s also possible that a large binary document is accidentally misclassified as a text document, or contains binary content embedded in it that your document filter failed to process, which quickly adds many absurd binary terms to your index, much to your horror. Other applications deal with documents of known size but you’d like to index only a portion of each. For example, you may want to index only the first 200 words of each document.

 

To support these diverse cases, IndexWriter allows you to truncate per-Field indexing so that only the first N terms are indexed for an analyzed field. When you instantiate IndexWriter, you must pass in a MaxFieldLength instance expressing this limit. MaxFieldLength provides two convenient default instances: MaxField-Length.UNLIMITED, which means no truncation will take place, and MaxField-Length.LIMITED, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit.

 

分享到:
评论

相关推荐

    Lucene:基于Java的全文检索引擎简介

    Lucene是一个基于Java的全文索引工具包。 1. 基于Java的全文索引引擎Lucene简介:关于作者和Lucene的...5. Hacking Lucene:简化的查询分析器,删除的实现,定制的排序,应用接口的 扩展 6. 从Lucene我们还可以学到什么

    指南-Lucene:ES篇.md

    指南-Lucene:ES篇.md

    lucene-core-7.7.0-API文档-中文版.zip

    Maven坐标:org.apache.lucene:lucene-core:7.7.0; 标签:apache、lucene、core、中文文档、jar包、java; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介.doc

    【Lucene:基于Java的全文检索引擎简介】 Lucene是一个由Java编写的开源全文检索引擎工具包,由Doug Cutting创建并贡献给Apache基金会,成为Jakarta项目的一部分。它不是一个独立的全文检索应用,而是提供了一个可...

    IKAnalyzer中文分词支持lucene6.5.0版本

    由于林良益先生在2012之后未对IKAnalyzer进行更新,后续lucene分词接口发生变化,导致不可使用,所以此jar包支持lucene6.0以上版本

    Lucene:基于Java的全文检索引擎简介.rar

    **Lucene:基于Java的全文检索引擎简介** Lucene是一个高度可扩展的、高性能的全文检索库,由Apache软件基金会开发并维护。它是Java开发者在构建搜索引擎应用时的首选工具,因为它提供了完整的索引和搜索功能,同时...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介.docx

    **Lucene:基于Java的全文检索引擎** Lucene是一个由Apache软件基金会的Jakarta项目维护的开源全文检索引擎。它不是一个完整的全文检索应用,而是一个用Java编写的库,允许开发人员轻松地在他们的应用程序中集成...

    lucene:基于Java的全文检索引擎简介

    - **数据模型**:Lucene使用类似于数据库表的数据模型,即文档(Document)由多个字段(Field)组成。文档是索引的基本单位。 - **索引过程**:Lucene通过索引器(Indexer)将文档转换为索引文件。索引文件包含了文档的...

    基于 SSM 框架的二手书交易系统.zip

    快速上手 1. 运行环境 IDE:IntelliJ IDEA 项目构建工具:Maven 数据库:MySQL Tomcat:Tomcat 8.0.47 2. 初始化项目 创建一个名为bookshop的数据库,将bookshop.sql导入 打开IntelliJ IDEA,将项目导入 ...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介22173.doc

    【Lucene:基于Java的全文检索引擎简介】 Lucene是一个由Java编写的全文索引工具包,它不是一个完整的全文检索应用,而是作为一个可嵌入的引擎,为各种应用程序提供全文检索功能。Lucene的设计目标是简化全文检索的...

    lucene:Apache Lucene开源搜索软件

    Lucene: : 用Gradle构建 基本步骤: 安装OpenJDK 11(或更高版本) 从Apache下载Lucene并解压缩 连接到安装的顶层(lucene顶层目录的父目录) 运行gradle 步骤0)设置您的开发环境(OpenJDK 11或更高版本) ...

    lucene 所有jar包 包含IKAnalyzer分词器

    《Lucene分词技术与IKAnalyzer详解》 在信息技术领域,搜索引擎是不可或缺的一部分,而Lucene作为Apache软件基金会的一个开放源代码项目,是Java语言开发的全文检索引擎库,为构建高效、可扩展的信息检索应用提供了...

Global site tag (gtag.js) - Google Analytics