Lucene: Indexing numbers, dates, and times And Field truncation

ylzhj02

浏览: 248723 次
性别:
来自: 成都

最近访客更多访客>>

daqin

bbpopeye

也许on

learnmore

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Lucene

Although most content is textual in nature, in many cases handling numeric or date/time values is crucial. In a commerce setting, the product’s price, and perhaps other numeric attributes like weight and height, are clearly important. A video search engine may index the duration of each video. Press releases and articles have a time-stamp.

Indexing numbers

There are two common scenarios in which indexing numbers is important.

In one scenario, numbers are embedded in the text to be indexed, and you want to make sure those numbers are preserved and indexed as their own tokens so that you can use them later as ordinary tokens in searches. To enable this, simply pick an analyzer that doesn’t discard numbers.
In the other scenario, you have a field that contains a single number and you want to index it as a numeric value and then use it for precise (equals) matching, rangesearching, and/or sorting.

doc.add(new NumericField("price").setDoubleValue(19.99));

Indexing dates and times

Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number. The simplest approach is to use Date.getTime to get the equivalent value, in millisecond precision, for a Java Date object:

doc.add(new NumericField("timestamp")
➥ .setLongValue(new Date().getTime()));

doc.add(new NumericField("day")
➥ .setIntValue((int) (new Date().getTime()/24/3600)));

Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth")
➥ .setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

------------------------------------------------------------------------------------------------------------------------------------

Field truncation

Some applications index documents whose sizes aren’t known in advance. As a safety mechanism to control the amount of RAM and hard disk space used, you may want to limit the amount of input they are allowed index per field. It’s also possible that a large binary document is accidentally misclassified as a text document, or contains binary content embedded in it that your document filter failed to process, which quickly adds many absurd binary terms to your index, much to your horror. Other applications deal with documents of known size but you’d like to index only a portion of each. For example, you may want to index only the first 200 words of each document.

To support these diverse cases, IndexWriter allows you to truncate per-Field indexing so that only the first N terms are indexed for an analyzed field. When you instantiate IndexWriter, you must pass in a MaxFieldLength instance expressing this limit. MaxFieldLength provides two convenient default instances: MaxField-Length.UNLIMITED, which means no truncation will take place, and MaxField-Length.LIMITED, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit.

分享到：

Lucene: Adding search to your applicatio ... | Lucene: Boosting documents and fields

2014-07-04 10:28
浏览 642
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论