原文地址 :http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
How to make indexing faster
Here are some things to try to speed up the indexing speed of your Lucene application. Please see ImproveSearchingSpeed for how to speed up searching.
Be sure you really need to speed things up.
Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your indexing speed is indeed too slow and the slowness is indeed within Lucene.
Make sure you are using the latest version of Lucene.
Use a local filesystem.
Remote filesystems are typically quite a bit slower for indexing. If your index needs to be on the remote fileysystem, consider building it first on the local filesystem and then copying it up to the remote filesystem.
Get faster hardware, especially a faster IO system.
Open a single writer and re-use it for the duration of your indexing session.
Flush by RAM usage instead of document count.
Call writer.ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don't set it too large otherwise you may hit LUCENE-845. Somewhere around 2-3X your "typical" flush count should be OK.
Use as much RAM as you can afford.
More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. Testing in LUCENE-843 found that around 48 MB is the sweet spot for that content set, but, your application could have a different sweet spot.
Turn off compound file format.
Call setUseCompoundFile(false). Building the compound file format takes time during indexing (7-33% in testing for LUCENE-888). However, note that doing this will greatly increase the number of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.
Re-use Document and Field instances
As of Lucene 2.3 (not yet released) there are new setValue(...) methods that allow you to change the value of a Field. This allows you to re-use a single Field instance across many added documents, which can save substantial GC cost.
It's best to create a single Document instance, then add multiple Field instances to it, but hold onto these Field instances and re-use them by changing their values for each added document. For example you might have an idField, bodyField, nameField, storedField1, etc. After the document is added, you then directly change the Field values (idField.setValue(...), etc), and then re-add your Document instance.
Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field's value until the Document containing that Field has been added to the index. See Field for details.
Re-use a single Token instance in your analyzer
Analyzers often create a new Token for each term in sequence that needs to be indexed from a Field. You can save substantial GC cost by re-using a single Token instance instead.
Use the char[] API in Token instead of the String API to represent token Text
As of Lucene 2.3 (not yet released), a Token can represent its text as a slice into a char array, which saves the GC cost of new'ing and then reclaiming String instances. By re-using a single Token instance and using the char[] API you can avoid new'ing any objects for each term. See Token for details.
Use autoCommit=false when you open your IndexWriter
In Lucene 2.3 (not yet released), there are substantial optimizations for Documents that use stored fields and term vectors, to save merging of these very large index files. You should see the best gains by using autoCommit=false for a single long-running session of IndexWriter. Note however that searchers will not see any of the changes flushed by this IndexWriter until it is closed; if that is important you should stick with autoCommit=true instead or periodically close and re-open the writer.
Instead of indexing many small text fields, aggregate the text into a single "contents" field and index only that (you can still store the other fields).
Increase mergeFactor, but not too much.
Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives.
Turn off any features you are not in fact using.
If you are storing fields but not using them at query time, don't store them. Likewise for term vectors. If you are indexing many fields, turning off norms for those fields may help performance.
Use a faster analyzer.
Sometimes analysis of a document takes alot of time. For example, StandardAnalyzer is quite time consuming. If you can get by with a simpler analyzer, then try it.
Speed up document construction.
Often the process of retrieving a document from somewhere external (database, filesystem, crawled from a Web site, etc.) is very time consuming.
Don't optimize unless you really need to (for faster searching).
Use multiple threads with one IndexWriter.
Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory architectures, native command queuing in hard drives, etc.) so using more than one thread to add documents can give good gains overall. Even on older machines there is often still concurrency to be gained between IO and CPU. Test the number of threads to find the best performance point.
Index into separate indices then merge.
If you have a very large amount of content to index then you can break your content into N "silos", index each silo on a separate machine, then use the writer.addIndexesNoOptimize to merge them all into one final index.
Run a Java profiler.
If all else fails, profile your application to figure out where the time is going. I've had success with a very simple profiler called JMP. There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.
分享到:
相关推荐
"Oracle 海量数据中提升创建索引的速度" Oracle 海量数据中提升创建索引的速度是指在 Oracle 数据库中,特别是在海量数据的情况下,如何提高创建索引的速度。创建索引是数据库优化中的重要步骤,但是在海量数据的...
索引的建立对于MySOL的高效运行是很重要的,索引可以大大提高MvSOL的检索速度。打个比方,如果合理的设计且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车拿汉语字典的目录页...
索引调整向导是一种工具,通过使用查询优化器来分析查询任务的工作量,向有大量工作量的数据库推荐一种最佳的索引混合方式,以加快数据库的查询速度。 索引和索引调整向导具有以下特点: * 索引可以加快数据库的...
在数据库中,索引是一种特殊的文件结构,它的主要目的是为了提高数据检索的速度。索引通过创建一种数据结构(例如B树)来实现这一点,这种结构允许数据库管理系统能够快速定位到数据所在的物理位置。 #### 二、B+树...
例如,如果我们有一个名为`students`的表,包含`id`(主键)、`name`和`age`字段,我们可以在`name`字段上创建一个索引,以加快按姓名查找学生的速度。命令如下: ``` CREATE INDEX idx_students_name ON students...
可以使用 nologging 和 parallel 参数加快创建速度,例如: ```sql create unique index index_name on tablename(column) online nologging parallel 2 compute statistics; ``` 创建完成后,需要马上检查系统应用...
### 分区索引—本地索引与全局索引的区别 #### 一、Oracle分区索引概念及分类 在Oracle数据库中,分区索引是针对分区表的一种特殊索引类型,它可以显著提高对于大规模数据集的查询性能。根据索引是否与表的分区...
* 用索引优化向导创建索引:索引优化向导是 SQL Server 2000 提供的新的创建索引工具,使用查询优化器分析工作负荷中的查询任务,向有大量工作负荷的数据库推荐最佳的索引混合方式,以加快数据库的查询。 索引的...
在Windows 7旗舰版系统中,用户可能会遇到搜索速度慢的问题,这可能是因为系统索引功能在处理大量无用或不必要的文件时消耗了过多资源。本文将介绍如何通过清除无用索引来提升搜索效率。 首先,我们需要进入系统...
非聚集索引的缺点是查询速度较慢,但可以在多个列上建立索引。 何时使用聚集索引或非聚集索引?下表总结了何时使用聚集索引或非聚集索引: | 动作描述 | 使用聚集索引 | 使用非聚集索引 | | --- | --- | --- | | ...
- **加快检索速度**:索引可以显著提高检索速度,特别是当数据库表非常庞大时。 - **提高系统性能**:合理设计的索引可以有效地提升系统的整体性能。通过减少磁盘I/O操作和系统资源消耗,可以显著提升查询性能。 -...
为了提高数据库查询的效率,一种常用的方法是建立索引,它能够显著加速数据检索过程,是数据库查询速度优化不可或缺的一部分。 索引可以比喻为一本书的目录,它不是数据本身,而是一个指向数据位置的指针结构,使得...
聚集索引的查询速度通常比非聚集索引快,但创建和维护聚集索引可能会对写操作性能产生影响。 **索引模式**则涉及到如何设计和选择合适的索引策略来优化数据库性能。这包括决定哪些列应该被索引,以及选择适合的索引...
虽然索引加快了查询速度,但它们也会占用磁盘空间,并可能导致插入、更新和删除操作变慢,因为这些操作都需要维护索引。因此,在设计数据库时,应谨慎选择需要创建索引的列,通常选择在查询中频繁作为WHERE子句条件...
总的来说,文件索引的创建是提升系统性能的重要手段,通过合理地设计和管理索引,可以显著加快数据的读取速度,从而优化整体的用户体验。在实际应用中,需要根据具体的业务场景和数据特性,选择最合适的索引策略,并...
B-tree 索引可以显著减少定位记录时所经历的中间过程,从而加快存取速度。B-tree 索引的结构可以分为多个层次,每个层次都是一个平衡树,这使得 B-tree 索引可以快速定位记录。 GiST 索引 ------------ GiST...
分布式索引构建能利用HDFS的多份拷贝特性加快索引的分发速度,并且在集群发生硬件故障时不会整体失败,从而显著提高了系统的可靠性和处理速度。 分布式索引构建的过程可以简化为以下步骤:首先是将数据分布式存储在...
在SQL Server数据库中,索引是提升数据处理效率的关键手段,尤其对于联机事务处理(OLTP)系统,快速的数据查询速度是系统性能的核心指标。本文主要探讨如何通过合理使用索引来优化SQL Server的性能。 首先,我们来...
Oracle 索引是数据库性能优化的重要工具,它可以大大加快数据的检索速度,提高系统的性能。但是,索引也存在一些缺陷,例如创建索引和维护索引要耗费时间,索引需要占物理空间,等等。在本文中,我们将详细介绍 ...
索引的作用是加快数据的查询速度,使得数据库中的数据更易于管理和维护。 SqlServer索引工作原理可以分为两大类:聚簇索引和非聚簇索引。 聚簇索引是一种树形结构的索引,它的数据物理存储顺序和索引顺序一致的。...