`
turingfellow
  • 浏览: 135182 次
  • 性别: Icon_minigender_1
  • 来自: 福建省莆田市
社区版块
存档分类
最新评论

How to make indexing faster

阅读更多
Here are some things to try to speed up the indexing speed of your Lucene application. Please see ImproveSearchingSpeed for how to speed up searching.

Be sure you really need to speed things up. Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your indexing speed is indeed too slow and the slowness is indeed within Lucene.

Make sure you are using the latest version of Lucene.

Use a local filesystem. Remote filesystems are typically quite a bit slower for indexing. If your index needs to be on the remote fileysystem, consider building it first on the local filesystem and then copying it up to the remote filesystem.

Get faster hardware, especially a faster IO system. If possible, use a solid-state disk (SSD). These devices have come down substantially in price recently, and much lower cost of seeking can be a very sizable speedup in cases where the index cannot fit entirely in the OS's IO cache.

Open a single writer and re-use it for the duration of your indexing session.

Flush by RAM usage instead of document count.

For Lucene <= 2.2: call writer.ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don't set it too large otherwise you may hit LUCENE-845. Somewhere around 2-3X your "typical" flush count should be OK.

For Lucene >= 2.3: IndexWriter can flush according to RAM usage itself. Call writer.setRAMBufferSizeMB() to set the buffer size. Be sure you don't also have any leftover calls to setMaxBufferedDocs since the writer will flush "either or" (whichever comes first).

Use as much RAM as you can afford.

More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. Testing in LUCENE-843 found that around 48 MB is the sweet spot for that content set, but, your application could have a different sweet spot.

Turn off compound file format.

Call setUseCompoundFile(false). Building the compound file format takes time during indexing (7-33% in testing for LUCENE-888). However, note that doing this will greatly increase the number of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.

Re-use Document and Field instances As of Lucene 2.3 there are new setValue(...) methods that allow you to change the value of a Field. This allows you to re-use a single Field instance across many added documents, which can save substantial GC cost. It's best to create a single Document instance, then add multiple Field instances to it, but hold onto these Field instances and re-use them by changing their values for each added document. For example you might have an idField, bodyField, nameField, storedField1, etc. After the document is added, you then directly change the Field values (idField.setValue(...), etc), and then re-add your Document instance.

Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field's value until the Document containing that Field has been added to the index. See Field for details.

Always add fields in the same order to your Document, when using stored fields or term vectors

Lucene's merging has an optimization whereby stored fields and term vectors can be bulk-byte-copied, but the optimization only applies if the field name -> number mapping is the same across segments. Future Lucene versions may attempt to assign the same mapping automatically (see LUCENE-1737), but until then the only way to get the same mapping is to always add the same fields in the same order to each document you index.

Re-use a single Token instance in your analyzer Analyzers often create a new Token for each term in sequence that needs to be indexed from a Field. You can save substantial GC cost by re-using a single Token instance instead.

Use the char[] API in Token instead of the String API to represent token Text

As of Lucene 2.3, a Token can represent its text as a slice into a char array, which saves the GC cost of new'ing and then reclaiming String instances. By re-using a single Token instance and using the char[] API you can avoid new'ing any objects for each term. See Token for details.

Use autoCommit=false when you open your IndexWriter

In Lucene 2.3 there are substantial optimizations for Documents that use stored fields and term vectors, to save merging of these very large index files. You should see the best gains by using autoCommit=false for a single long-running session of IndexWriter. Note however that searchers will not see any of the changes flushed by this IndexWriter until it is closed; if that is important you should stick with autoCommit=true instead or periodically close and re-open the writer.

Instead of indexing many small text fields, aggregate the text into a single "contents" field and index only that (you can still store the other fields).

Increase mergeFactor, but not too much.

Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives.

Turn off any features you are not in fact using. If you are storing fields but not using them at query time, don't store them. Likewise for term vectors. If you are indexing many fields, turning off norms for those fields may help performance.

Use a faster analyzer.

Sometimes analysis of a document takes alot of time. For example, StandardAnalyzer is quite time consuming, especially in Lucene version <= 2.2. If you can get by with a simpler analyzer, then try it.

Speed up document construction. Often the process of retrieving a document from somewhere external (database, filesystem, crawled from a Web site, etc.) is very time consuming.

Don't optimize unless you really need to (for faster searching).

Use multiple threads with one IndexWriter. Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory architectures, native command queuing in hard drives, etc.) so using more than one thread to add documents can give good gains overall. Even on older machines there is often still concurrency to be gained between IO and CPU. Test the number of threads to find the best performance point.

Index into separate indices then merge. If you have a very large amount of content to index then you can break your content into N "silos", index each silo on a separate machine, then use the writer.addIndexesNoOptimize to merge them all into one final index.

Run a Java profiler.

If all else fails, profile your application to figure out where the time is going. I've had success with a very simple profiler called JMP. There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.

分享到:
评论

相关推荐

    ElasticSearch Indexing(PACKT,2015)

    Beginning with an overview of the way ElasticSearch stores data, you'll begin to extend your knowledge to tackle indexing and mapping, and learn how to configure ElasticSearch to meet your users' ...

    SVD and LSI Tutorial 4: Latent Semantic Indexing (LSI) How-to Calculations

    "SVD and LSI Tutorial 4: Latent Semantic Indexing (LSI) How-to Calculations" 指的是该文档是一篇教程,涵盖了如何进行奇异值分解(SVD)和潜在语义索引(LSI)的计算。 【描述】 在描述中提到,这篇教程会指导...

    Expert.Oracle.Indexing.and.Access.Paths.2nd.epub

    Expert Oracle Indexing and Access Paths helps by bringing together information on indexing and how to use it into one blissfully short volume that you can read quickly and have at your fingertips for...

    MongoDB Basics

    You'll also learn MongoDB design basics, including geospatial indexing, how to navigate, view, and query your database, and how to use GridFS with a bit of Python. What you’ll learn What sets ...

    Pandas Cookbook 2017 pdf 2分

    How to get the most out of this book Conventions Assumptions for every recipe Dataset Descriptions Sections Getting ready How to do it... How it works... There's more... See also Reader feedback ...

    Apache Solr(Apress,2015)

    It also teaches you how to make your system intelligent and able to learn through feedback loops. After covering out-of-the-box features, Solr expert Dikshant Shahi dives into ways you can customize...

    IndexingService_Binary.zip

    标题中的"IndexingService_Binary.zip"提示我们关注的是与索引服务相关的二进制文件。这可能是指Windows操作系统中的“Indexing Service”,它是一个用于快速检索计算机上文件内容和属性的服务。在Windows中,这个...

    Probabilistic Latent Semantic Indexing

    ### 概率潜在语义索引 (Probabilistic Latent Semantic Indexing, PLSI) #### 引言 随着数字数据库和通信网络的发展,大量的文本数据仓库已经对公众开放。如何开发智能的人机交互界面来支持计算机用户寻找相关信息...

    solr indexing

    solr indexing 介绍solr indexing过程,及常用的上传方法

    Managing Gigabytes: Compressing and Indexing Documents and Images

    authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of ...

    Optics Classification and Indexing Scheme (OCIS)

    The Optics Classification and Indexing Scheme (OCIS) provides a flexible, comprehensive classification system for all optical author input and user retrieval needs. OCIS has a two-level hierarchical ...

    Learning Redis

    By working with real world scenarios pertaining to using Redis, you will discover sharding and indexing techniques, along with how to improve scalability and performance through persistent strategies...

    Learning Laravel's Eloquent

    Chapter 1, Setting Up Our First Project, will discuss how to deal with Composer and Homestead. We will also cover the installation process of our very first Laravel project. Chapter 2, Building the ...

    Elasticsearch Indexing.pdf

    关于书籍《Elasticsearch Indexing》的信息,该书由Hüseyin Akdoğan编写,于2015年12月由Packt Publishing出版。在本书中,Hüseyin Akdoğan通过实际案例和技巧,教读者如何使用Elasticsearch的索引功能来提升...

Global site tag (gtag.js) - Google Analytics