Lucene in a cluster
Lucene is a highly optimized inverted index search engine. It stored a number of inverted indexes in a custom file format that is highly optimized to ensure that the indexes can be loaded by searchers quickly and searched efficiently. These structures are create so that they are almost completely pre-computed.
Lucene是个高度优化的倒转索引搜索引擎。它将倒转的索引存储在定制的文件格式中,文件格式被高度优化以确保能被搜索器快速的加载以及有效的搜索。Lucene产生这些结构以致索引几乎完全的被预先计算好
To store the index, Lucene uses an implementation of a 'Directory' interface, not to be confused with anything in java.io. . The standard implementation if FSDirectory that stored the search index on a file system. There a number of other implementations that can bused including ones to split the index on the filesystem into smaller chunks, and ones to distribute the index throughout a cluster using Map Reduce (see google). There is additionally a database implementation that stored the index as blocks in a database.
Lucene 通过使用Directory接口的实现来存储索引,注意不要将Directory与java.io混淆了.FSDirectory 是Directory接口的一个标准的实现,它将索引保存在文件系统中.还有一些其他的实现,比如有的实现将索引切分小的数据块保存在文件系统中,有的通过使用Map Reduce(见google)的集群来分布索引.还有一种数据库的实现,它将索引作为数据块保存在数据库中
Lucene derives its speed from this index structure, and to work really well it needs to be able to seek efficiently into the blocks of the segments that make up the index. This is trivial where the underlying storage mechanism supports seek, but less trivial if the storage mechanism does not. The FSDirectory is based on files, and is efficient in this area. If the files are on a local file system, pure seeks can be used. If the index is on a shared file system , there will always be some latency and potentially increased IO traffic. The Database implementation is highly dependent on the the blob implementation in the target database and will nearly always be slower than the FSDirectory. Some databases support seekable blobs (Oracle), some emulate this behavior (MySQL with emulateLocators=true), others just don't support it and so are really slow. (and I mean really slow)
(Lucene 的快速是因为它的索引结构,为了能出色的搜索,Lucene需要能有效的寻址扇区块,正是这些扇区块组成了索引.如果底层的存储机制支持这种寻址,那么就没什么好说的,如果不支持那么这就是问题了.在这个问题上,基于文件的FSDirectory 是有效的.如果索引文件保存在本地的文件系统中,那么这种访问效果还不错.如果被放在共享的文件系统中,那么总是会存在一些延迟和潜在的IO阻塞. 上面说的那种数据库的实现方式高度依赖于目标数据库blob实现,而且几乎总是比FSDirectory慢).一些数据库支持可寻址的blob,比如说oracle.mysql也模拟了这种行为(当你将MySQL的参数 emulateLocators设置为true)其他的数据库就是不支持,所以真的是慢(我的意思是指实际上就是很慢)
All of this impacts how Lucene works in a cluster. Each node performing the search needs access to the index. To make search work in a clustered environment we must provide this. There are 3 ways of doing this.
Use a shared file system between all nodes, and use FSDirectory.
Use indexes on the nodes local file system and a synchronization strategy.
Use a database using JDBCDirectory
Use a distributed file system (eg Google File System, Nutch Distributed File System)
Use a local cache with backup in the Database
所有的这些将影响到集群环境中的lucene. 每个进行搜索的节点需要访问索引。所以为了能使实现集群环境中的搜索,我们必须提供共享的索引文件。有3(译注:应该是5 )种方式提供参考。
1)在所有的节点间使用共享文件,而且使用FSDirectory
2)使用节点本地文件系统中的索引,并且保持各节点间的同步
3)使用JDBCDirectory操作数据库 (译注:将索引文件保存在数据库中实现节点间的共享)
4)使用分布式的文件系统(例如google文件系统和Nutch 分布式文件系统)
5)使用本地缓存保存数据库中备份
Shared filesystem
There are a number of issues with a shared file system. Performance is lower than a local file system (obviously), unless a SAN is used, but a SAN shared file system must be a true SAN file system (eg Redhat Global File System, Apple XSan) as modifications to the file system blocks must be mirrored instantly in the block cache of all connected nodes, otherwise they will see a corrupted file system. Remember a SAN is just a networked block device, that without additional help cannot be shared by multiple compute nodes at the same time. Provided the performance of the shared file system is sufficient, Lucene works well like this with no modifications using the FSDirectory implementation. The implementation of the lock managed in the Sakai Search component eliminates problems with locks reported by the Lucene community.
This mechanism is available now in Sakai Search.
使用共享文件系统存在一些问题.性能会明显低于使用本地文件系统. 如果SAN不能使用,那么一个SAN共享文件系统必须是个真正的SAN file system比如Redhat Global File System和Apple XSan.对于文件系统块的修改必须能立即镜像到所有连接节点的块缓存中,否则文件系统会崩溃.记住SAN是个网络化的块设备,没有额外的帮助,不可能同时被多个计算节点共享.如果共享文件系统的性能可以保证的话,Lucene使用FSDirectory的实现会一如既往地表现出色.由Sakai Search组件管理的锁实现解决了Lucene社区提交的锁问题.
以上所说的机制现在在 Sakai Search是可用的
Synchronized Local indexes.
Where the architecture of the cluster is a shared nothing architecture, the Lucene indexes can be written to local disk and synchronized at the end of each index cycle. This is an optimal deployment of Lucene in a cluster as it ensures that all the IO is from the local disk and is hence fast. To ensure that there is always a back up copy of the index, the synchronization would also target a backup location.
The difficulty with this approach is that without support in the implementation of the search engine, it requires some deployment support. This may involve include making hard link mirrors to speed up the synchronization process. Lucene indexes are suitable for synchronizing with rsync which is a block based synchronization mechanism.
The main drawback of this approach is that the full index is present on the local machine. In large search environments, this duplication will be wastefully, however in search engine terms, a single deployment of Sakai will probably never get into the large space ( large > 100M documents, 2TB index)
This mechanism is available, but requires local configuration
这种集群的架构并非一个共享的架构,Lucene索引写到本地磁盘然后在创建索引结束后同步到各节点.这是集群环境中Lucene最佳部署方案,因为这样能确保所有的IO是在本地磁盘因此很快.为了确保总是有一个索引的备份,同步应该制定备份路径
这种方案的困难在于缺乏搜索引擎实现的支持,它需要部署支持.这可能会采用连接镜像来加速同步过程.Lucene索引 非常适合采用rsync同步.rsync是一个基于块的同步机制.
这种方案的主要缺陷是本地保存了所有的索引文件.在一个大型的搜索环境中,这种复制是很浪费空间的.然而从搜索引擎的条件看, Sakai的单个部署决不可能占用很大空间(100多兆的documents的索引占用了2TB空间)
Database hosted search index.
Where a simple cluster setup is required, a database hosted search index is straightforward option. There are however significant drawbacks with this approach, most notable being the drop in performance. The index is stored as blocks in blobs inside the database. These blobs are stored in a block structure to eliminate most of the unnecessary loading however each blob bypasses any local disk block cache on the local machine and has to be streamed over the network. If the database supports seekable blobs, within the database itself, it is possible to minimize unnecessary network traffic. Oracle has this support. However where the database only emulated this behavior (MySQL) the performance is poor as the complete blob needs to be streamed over the network. In addition to this the speed of access is slower since a SQL statement has to be executed for each data access.
The net result is slower performance.
This mechanims is available, but performance is probably unacceptable
但需要实现简单的集群的时候,使用基于数据库的索引是个直接的选择.然而这种方法有明显的缺陷.最值得注意的是性能下降.索引作为块以blob的形式保存在数据库中.这些blob以块结构保存用以消除一些不必要的加载.然而每个快会绕开任何本地磁盘块缓存,而且必须通过网络传递数据流.如果数据库支持可寻址的块,对数据库本身而言,最小化不必要的网络阻塞时可能的.Oracle提供了这种支持, 然而当数据库仅仅是模拟这种行为(比如mysql)的话,当整个blob需要通过流化再在网络上传递的时候,性能是很差的.而且,访问的速度比较慢,因为 访问数据时得执行sql statement
总之性能会下降.
这种机制是可以使用的,但性能很可能是不可接受的
Distributed File System
Real Search Engines use a distributed file system that provides a self healing file system where the data itself is distributed across multiple nodes in such a way that the file system can recover from the loss of one or more nodes. The original file system of this form is the Google File System and the Nutch Distributed File System is modeled on Google File System. Both implementations use a gather scatter algorithm detailed by Google in Map-Reduce (see Google labs).
This approach results in every node containing a part of the file system. Where the index size has grown to such an extent to make the storage of the complete index on every node in the cluster, this approach becomes more attractive.
At the moment there are no plans to provide an implementation of a distributed file system within Sakai.
一些真实的搜索引擎使用分布式的文件系统.这种分布式系统提供了一个自治的系统,数据通过多节点分布,这样的系统能从一个或多个节点的损坏中恢复.Google 文件系统和 Nutch分布式文件系统(建模在.Google 文件系统之上)就是这样一个例子.两种实现采用了一种聚合扫描算法,Google在Map-Reduce详述了这种算法(见Google labs)
这种方法使每个节点包含文件系统的一个部分.但索引变得庞大以致保存在每个节点时,这个方法变得更具吸引力.
Sakai目前还没有计划提供分布式文件系统的实现
Database Clustered Local Search
In this approach, indexes are used from local disk, but backed up to the database as Lucene Segments. A cluster app node is installed, it synconizes the local copy of the search index with the database. When new content is added by one of the cluster app nodes, it updates the backup copy in the database. On reciept of the index reload events, all cluster app nodes resyncronize the with the database downloading changed and new search segments.
This mechanism in in the process of being tested, I exhibits the same performance as a local basaed search for a 200MB index with 80,000 documents.
Once this mechanism is completely tested it will become the default OOTB mechanism, as it works where there is a single cluster node or more than one cluster node. The added advantage of this mechanism is that the index is stored in the database.
It will also be possible to implement this mechanism with a shared filestore acting as the backup location.
本方法中,索引通过本地磁盘使用,但作为Lucene扇区备份在数据库中,可以安装一个集群应用节点在本地拷贝和数据库之间同步.当接受到索引重载事件,所有的集群应用节点再次与数据库同步从而下载更新和全新的搜索扇区.
这种机制处于测试阶段,我发现当搜索200MB的索引(包含80000文档)的时候,这种机制的性能和基于本地的索引相当.
一旦这种机制经过完全测试,当工作在一个或多个集群节点的环境中,这种机制将变成默认的OOTB机制.额外的优点是索引保存在数据中.
也可以采用共享的文件存储作为一个备份路径来实现这种机制.
相关推荐
6. **分布式搜索**: 通过Solr或Elasticsearch,可以搭建分布式Lucene集群,处理大量数据和高并发请求。 ### 五、学习资源 - 官方文档:Apache Lucene官方站点提供了详细的API文档和教程。 - 示例代码:通过阅读和...
3. 分布式搜索:当数据量巨大时,可以采用分布式Lucene集群,提高搜索性能。 4. 查询优化:使用短语匹配、模糊匹配、同义词扩展等技术提高查询精度。 5. 缓存机制:对热门查询结果进行缓存,减少不必要的索引操作。 ...
在实际应用中,Lucene通常与其他框架或库结合使用,例如Solr和Elasticsearch,它们提供了更高级的服务,如分布式搜索、集群管理和RESTful API。通过阅读《Annotated Lucene 中文版 Lucene源码剖析》,读者不仅能掌握...
在实际应用中,Lucene通常与Solr或Elasticsearch等工具结合使用,以提供更高级的功能,如集群管理、分布式搜索、实时索引和更丰富的搜索特性。这些工具在Lucene的基础上添加了更多管理和扩展功能,使搜索解决方案...
- 高级用户可以使用Solr或Elasticsearch,它们基于Lucene,提供了分布式搜索、集群管理和更多高级特性。 在"lucensetest"文件中,可能是包含一些测试代码,用于演示如何使用上述组件和类来建立索引、执行搜索以及...
理解这些操作对于有效管理Elasticsearch集群至关重要。同时,书中可能还会涵盖如何处理和避免常见问题,如性能优化、监控和调试技巧。 通过阅读《从Lucene到Elasticsearch 全文检索实战.pdf》这本书,读者可以深入...
在实际应用中,Lucene通常与其他技术结合使用,例如Solr或Elasticsearch,它们提供了更高级的功能,如分布式搜索、集群管理和RESTful API。这部分源码可能未涵盖这些,但对理解Lucene的基本工作原理和内部机制非常有...
相比Lucene,Solr提供了更丰富的功能集和更好的集群支持能力,适用于大规模数据的实时检索场景。 ##### 2.1 Solr的主要特点 - **高可用性**:支持水平扩展,能够轻松实现多节点集群部署。 - **丰富的特性**:除了...
5. **Lucene-Solr Grandparent**: `lucene-solr-grandparent.jar` 可能与Solr有关,Solr是基于Lucene的全文搜索引擎服务器,提供了更高级的功能,如集群、分布式搜索、近实时搜索等。这个jar可能包含了与Solr集成的...
1. **Solr和Elasticsearch**: 基于Lucene的分布式搜索平台,提供了更高级的特性,如集群、多租户、近实时搜索、丰富的数据分析等。 2. **Nutch**: 一个开源的网络爬虫项目,使用Lucene作为其索引和搜索的后端。 3....
开发者可以利用Lucene快速实现自己的全文搜索引擎,但Lucene本身并不提供分布式处理和集群管理。 **Solr** Solr是基于Lucene构建的企业级搜索平台,它扩展了Lucene的功能,增加了许多高级特性,如多核心处理、...
9. **分布式搜索**:介绍Solr和Elasticsearch等基于Lucene的分布式搜索平台,讲解如何构建大规模的搜索集群以处理海量数据。 10. **性能优化**:提供实战指导,帮助开发者识别和解决性能瓶颈,包括内存管理、磁盘I/...
此外,社区还提供了许多扩展库,如Solr和Elasticsearch,它们基于Lucene,增加了分布式搜索、集群管理和更高级的功能。 9. **API使用** Lucene提供了丰富的Java API,开发人员可以通过简单的代码调用来实现索引...
学习如何利用Lucene的分布式搜索框架如Solr或Elasticsearch进行集群部署,以实现大规模数据的高效检索。 8. **错误处理与调试:** 在集成Lucene到Web应用中,可能会遇到各种问题,比如索引损坏、查询错误等。了解...
4. 扩展性:Lucene支持分布式搜索,通过Solr或Elasticsearch等工具,可以构建大规模的搜索引擎集群。 总的来说,“lucene3.6.jar”与IkAnalyzer的结合,为开发者提供了一套强大的中文全文检索解决方案,既兼顾了...
- **Solr**:也是基于Lucene,提供了更多的企业级特性,如集群管理和高可用性支持。 - **其他选项**:根据具体需求选择合适的搜索引擎,可能还包括Sphinx、Whoosh等。 #### 四、高级搜索技术 - **复杂查询**:...
Solr是基于Lucene的一个高级搜索平台,它提供了更强大的集群管理和Web界面,适合大型企业级应用。 在这个教程中,读者会学习到如何配置和使用Lucene,如何处理各种类型的查询,如何优化索引和搜索性能,以及如何...
5. **分布式搜索**:支持多节点集群,实现大规模数据的分布式索引和搜索。 在实际应用中,开发者可以根据需求选择合适的分析器,例如英文分析器、中文分词器等,以满足不同语言的处理需求。同时,Lucene还提供了...
7. **多线程和分布式搜索**:对于大型数据集,Lucene支持多线程索引和分布式搜索,通过Solr或Elasticsearch等工具可以实现更复杂的集群部署。 8. **高级特性**:书中还涵盖了高级话题,如近实时搜索、复杂查询构造...