Lucene是开放源代码的全文搜索引擎工具包,凭借着其强劲的搜索功能和简单易用的实现,在国内已经很普及,甚至一度出现了言搜索必称Lucene的盛景。上个月Lucene的开发团队发布了 Java Lucene 2.3.1 ,相信很多朋友们都用上了。在国内对Lucene的介绍可以分为3块儿:
第一类是:以车东 的Lucene:基于Java的全文检索引擎简介 为代表的基础入门介绍;
第二类是Lucene倒排索引原理和Lucene软件包、实现类的介绍;
第三类是以中文分词为中心的介绍;
任何一个软件,包括所有伟大的软件都有这样或者那样的“缺点”和各自适用的领域,Lucene也不例外。在国内对Lucene这个软件包的批评,似乎没有看到过。可能大家都忙于做项目,纵然Lucene有再大的缺陷,凭借着Lucene良好的口碑,也不会说上一句不是。
今天在阅读LingWay (一个做垂直的语义搜索引擎)的CTO Cedric Champeau 先生的博客是发现有一篇题为:Why lucene isn't that good 为什么Lucene并不是想象的那么棒 的文章:Champeau 开门见山指出了Lucene的6大不足之处,鉴于 Lingway 公司使用Lucene已有好几年的历史,我相信Cedric Champeau的对Lucene的评论还是值得一读。
不选择使用Lucene的6大原因:
6、Lucene 的内建不支持群集。
Lucene是作为嵌入式的工具包的形式出现的,在核心代码上没有提供对群集的支持。实现对Lucene的群集有三种方式:1、继承实现一个 Directory;2、使用Solr 3、使用 Nutch+Hadoop;使用Solr你不得不用他的Index Server ,而使用Nutch你又不得不集成抓取的模块;
5、区间范围搜索速度非常缓慢;
Lucene的区间范围搜索,不是一开始就提供的是后来才加上的。对于在单个文档中term出现比较多的情况,搜索速度会变得很慢。因此作者称Lucene是一个高效的全文搜索引擎,其高效仅限于提供基本布尔查询 boolean queries;
4、排序算法的实现不是可插拔的,因为贯穿Lucene的排序算法的tf/idf 的实现,尽管term是可以设置boost或者扩展Lucene的Query类,但是对于复杂的排序算法定制还是有很大的局限性;
3、Lucene的结构设计不好;
Lucene的OO设计的非常糟,尽管有包package和类class,但是Lucene的设计基本上没有设计模式的身影。这是不是c或者c++程序员写java程序的通病?
A、Lucene中没有使用接口Interface,比如Query 类( BooleanQuery, SpanQuery, TermQuery...) 大都是从超类中继承下来的;
B、Lucene的迭代实现不自然: 没有hasNext() 方法, next() 返回一个布尔值 boolean然后刷新对象的上下文;
2、封闭设计的API使得扩展Lucene变得很困难;
参考第3点;
1、Lucene的搜索算法不适用于网格计算;
下面是英文原文 Moving Lucene a step forward
6. No built-in support for clustering. If you want to create clusters, either write your own implementation of a Directory, or use Solr, or Nutch+Hadoop. Both Solr and Nutch leverage Lucene, but are not straight replacements. Lucene is embeddable, while you must leverage on Solr or Nutch. Well, I think it is not very surprising that Hadoop idea emerged from the Lucene team : Lucene doesn't scale out. It's internals makes it (very) fast for most common situations, but for large document sets, you have to scale out, and as Lucene does not implement clustering at the core level, you must switch from Lucene to another search engine layer, which is not straightforward. The problem with switching to Solr or Nutch is that you carry things you probably won't need : integrated crawling for Nutch and indexing server for Solr.
5. Span queries are slow. This may be interpreted as a problem specific to Lingway where we make intensive use of span queries (NEAR operator : "red NEAR car"), but the Lucene index structure has been updated to add this feature, which was not thought at first. The underlying implementation leads to complex algorithms that are very slow, especially when some term is repeated many times in a single large document. That's why I tend to say that Lucene is a high-performance text search engine only if you use basic boolean queries.
4. Scoring is not really pluggable. Lucene has its own implementation of a scoring algorithm, where terms may be boosted, and makes use of a Similarity class, but soon shows limitations when you want to perform complex scoring, for example based on actual matches and meta data about the query. If you want to do so, you'll have to extend the Lucene query classes. The facts is that Lucene has been thought in terms of tf/idf like scoring algorithms, while in our situation, for linguistic based scoring, the structure of Lucene scoring facilities don't fit. We've been obliged to override every Lucene query class in order to add support for our custom scoring. And that was a problem :
3. Lucene is not well designed. As a software architect, I would tend to make this reason 1. Lucene has a very poor OO design. Basically, you have packages, classes, but almost no design pattern usage. I always makes me think about an application written by C(++) developers who discover Java and carry bad practices among them. This is a serious point : whenever you have to customize Lucene to suit your needs (and you will have to do so), you'll have to face the problem. Here are some examples :
Almost no use of interfaces. Query classes (for example BooleanQuery, SpanQuery, TermQuery...) are all subclasses of an abstract class. If you want to add a feature to those classes, you'll basically want to write an interface which describes the contract for your extension, but as the abstract Query class does not implement an interface, you'll have to constantly cast your custom query objects to a Query in order to be able to use your objects in native Lucene calls. Tons of examples like this (HitCollector, ...). This is also a problem when you wan't to use AOP and auto-proxying.
Unnatural iterator implementations. No hasNext() method, next() returns a boolean and refreshes the object context. This is a pain when you want to keep track of iterated elements. I assume this have been done intentionally to reduce memory footprint but it makes once again algorithms both unclear and complex.
2. A closed API which makes extending Lucene a pain. In Lucene world, it is called a feature. The policy is to open classes when some user needs to gain access to some feature. This leads to an API where most classes are package protected, which means you won't ever be able to extend it (unless you create your class in the same package, which is quite dirty for custom code) or you'll have to copy and rewrite the code. Moreover, and it is quite related to the previous point, there's a serious lack of OO design here too : some classes which should have been inner are not, and anonymous classes are used for complex operations where you would typically need to override their behaviour. The reason invoked for closing the API is that the code has to be cleaned up and made stable before releasing it publicly. While the idea is honourable, once again it is a pain because if you have some code that does not fit in the mainstream Lucene idea, you'll have to constantly backport Lucene improvements to your version until your patch is accepted. However, as the developers want to limit API changes as long as possible, there's little chance that your patch will ever be commited. Add some final modifiers on either classes or methods and you'll face the problem. I don't think the Spring framework would have come so popular if the code had been so locked...
1. Lucene search algorithms are not adapted to grid computing. Lucene has been written when hardware did not have memory that much and multicore processors didn't exist. Therefore, the index structure has been thought and implemented in order to perform fast linear searches with a very little memory footprint. I've personally spent lots of hours trying to rewrite span query algorithms so that I could make use of a multithreaded context (for dual/quad core cpus), but the iterator-based directory reading algorithms make it hardly impossible to do so. In some rare cases you'll be able to optimize things and iterate the index in parallel, but in most situations it will be impossible. In our case, when we have a very complex query with 50+ embedded span queries, the CPU is almost not used while the system is constantly calling I/Os, even using a RAMDirectory.
本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/eaglet/archive/2009/03/12/3984940.aspx
分享到:
相关推荐
**论文**:论文部分可能详细介绍了设计决策、技术选择的原因、性能评估以及可能遇到的问题和解决方案,这对于深入理解搜索引擎的构建过程至关重要。 总的来说,这个项目为开发者提供了一个实用的案例,展示了如何...
6. **Hadoop与大数据处理**:Hadoop实现了MapReduce模型,为大数据分析提供了强大的支持。虽然Android本身并不直接处理大数据,但Hadoop的技术原理启发了Android在数据处理和分析上的方法。 7. **Android与Java的...
通常示例程序能帮助开发者快速上手如何集成盘古分词与***,对理解文档和组件使用有极大帮助。 8. 注意事项 - 文档中提到由于OCR扫描的原因,部分文字可能存在识别错误或漏识别,实际使用时需要结合上下文进行逻辑...
2. 搜索时不区分英文单词的大小写。 3. 搜索结果会根据相关度进行排序,以便用户找到最匹配的文档。 **Elasticsearch 的特点** 1. **基于 Lucene**: Elasticsearch 内部使用 Lucene 提供核心搜索功能,但通过封装...
在本例中,使用的是Lucene 2.3.2版本。 #### “假死机”问题的原因分析 通常,“假死机”问题的发生与IDE(如MyEclipse)的配置不当有关。下面我们将通过具体的步骤来解决这些问题: 1. **禁用自动验证** 在...
选择Druid的主要原因是其强大的查询引擎和灵活的扩展性。Druid的堆外内存管理避免了垃圾回收问题,高效的批量数据处理提高了内存利用率,还有针对特定查询类型的缓存和优化,如topN查询和时间序列查询。此外,Druid...
【部分内容】中介绍了搜索引擎处理的数据类型特点、常见的搜索引擎用例、Solr的关键组件、选择Solr的原因和功能概览。还提到了社交媒体、云计算、移动应用和大数据等技术的快速发展为计算领域带来了激动人心的时刻。...
Hibernate是Java领域广泛使用的对象关系映射(ORM)框架,它极大地简化了数据库操作,使得开发者可以使用面向对象的方式处理数据。下面将详细阐述这两本书中的关键知识点。 《Hibernate Quickly》是Hibernate入门的...
访问官方下载页面:<http://lucene.apache.org/>,选择4.7版本进行下载。 2. **解压Solr** 将下载好的Solr压缩包解压到指定位置,例如E盘。 3. **安装Tomcat** 将Tomcat 7.0解压到E盘或其他指定位置。 4. *...
10.8.6. Cascading Write Operations 10.8.6.1. Notes Regarding Cascading Operations 11. Zend_Debug 11.1. 输出变量的值 (Dumping Variables) 12. Zend_Exception 12.1. 使用“异常” 13. Zend_Feed 13.1. ...
5. **索引(Indexing)**:Nutch使用Lucene或Solr这样的全文搜索引擎库对解析和分析后的数据进行索引,便于后续的搜索操作。 6. **反向链接(Backlink)**:Nutch还收集每个网页的反向链接信息,这对于评估网页的...
ES是一个基于Lucene的搜索引擎,它提供了全文搜索、结构化搜索、分析等多种功能,广泛应用于数据挖掘、日志分析等领域。标题中的“服务化实践”指的是将ES作为服务提供给公司内部不同业务线使用,意味着58到家通过...
#### 6. How(怎么做) - **实施方法**: - 结合具体场景和需求制定实施方案。 ### 二、家有家法,行有行规 —— 垂直搜索引擎之术语篇 #### 基本概念 - **分词(Term)**:索引中的最小单位,如中文句子中的...
因此,对于仍然需要使用Sense的用户来说,离线安装成为唯一的选择。通过下载提供的"sense(beta).crx"文件,用户可以在不依赖Google Chrome Store的情况下,手动将插件添加到他们的浏览器中,继续利用Sense的功能来与...
6. **Solr查询结果存放位置**:Solr是基于Lucene的搜索服务器,查询结果一般存储在内存中,可通过索引文件快速检索。 7. **使用Solr的原因**:Solr提供高效、可扩展的全文检索、命中高亮、 faceted search等功能,...
6. 论文部分:通常会包含系统设计的详细描述、架构图、技术选型原因、问题解决策略等内容,有助于理解整个项目的开发过程和思考。 开发文档是理解系统的重要组成部分,它包括设计文档、需求分析、数据库设计、API...