了解lucence

xiao_yi

浏览: 415299 次
性别:
来自: 北京

最近访客更多访客>>

u012363178

wsl_bug

空空大师111

小小世界

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

lucene 全文检索数据结构 SQL Apache

lucence是一个很容易上手,纯java语言的全文索引检索工具包。

Lucene的作者是资深的全文索引/检索专家，最开始发布在他本人的主页上，2001年10月贡献给APACHE，成为APACHE基金jakarta的一个子项目。
目前，lucene广泛用于全文索引/检索的项目中。
lucene也被翻译成C#版本，目前发展为Lucene.Net（不过最近好象有流产的消息）。

Lucene 原理

lucene的检索算法属于索引检索，即用空间来换取时间，对需要检索的文件、字符流进行全文索引，在检索的时候对索引进行快速的检索，得到检索位置，这个位置记录检索词出现的文件路径或者某个关键词。
在使用数据库的项目中，不使用数据库进行检索的原因主要是：数据库在非精确查询的时候使用查询语言“like %keyword%”，对数据库进行查询是对所有记录遍历，并对字段进行“%keyword%”匹配，在数据库的数据庞大以及某个字段存储的数据量庞大的时候，这种遍历是致命的，它需要对所有的记录进行匹配查询。因此，lucene主要适用于文档集的全文检索，以及海量数据库的模糊检索，特别是对数据库的 xml或者大数据的字符类型。

全文检索的实现机制

Lucene的API接口设计的比较通用，输入输出结构都很像数据库的表==>记录==>字段，所以很多传统的应用的文件、数据库等都可以比较方便的映射到Lucene的存储结构/接口中。总体上看：可以先把Lucene当成一个支持全文索引的数据库系统。

比较一下Lucene和数据库：

Lucene	数据库
索引数据源：doc(field1,field2...) doc(field1,field2...) \ indexer / _____________ \| Lucene Index\| -------------- / searcher \ 结果输出：Hits(doc(field1,field2) doc(field1...))	索引数据源：record(field1,field2...) record(field1..) \ SQL: insert/ _____________ \| DB Index \| ------------- / SQL: select \ 结果输出：results(record(field1,field2..) record(field1...))
Document：一个需要进行索引的“单元” 一个Document由多个字段组成	Record：记录，包含多个字段
Field：字段	Field：字段
Hits：查询结果集，由匹配的Document组成	RecordSet：查询结果集，由多个Record组成

全文检索 ≠ like "%keyword%"

搜索过程优化

以下是http://www.onjava.com/lpt/a/3273的原文

For instance, if we set mergeFactor to 10, a new segment will be created on the disk for every 10 documents added to the index. When the 10th segment of size 10 is added, all 10 will be merged into a single segment of size 100. When 10 such segments of size 100 have been added, they will be merged into a single segment containing 1000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each power of 10 index size.

The exception noted earlier has to do with another IndexWriter instance variable: maxMergeDocs. While merging segments, Lucene will ensure that no segment with more than maxMergeDocs is created. For instance, if we set maxMergeDocs to 1000, when we add the 10,000th document, instead of merging multiple segments into a single segment of size 10,000, Lucene will create a 10th segment of size 1000, and keep adding segments of size 1000 for every 1000 documents added.

就是说如果set mergeFactor to 10 ，当有10个document对象被增加到索引中，就会创建一个segment，然而当segment的数量到达10，100条数据，的时候就会把这10个segment存储到一个新的segment中，依次顺序增加。我们还可以设置maxMergeDocs来固定segment，如果set maxMergeDocs 为1000，当segment数量到达1000的时候不会合并创建到一个新的segment，而是固定的。