`
kylinsoong
  • 浏览: 241397 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Lucene/Solr Dev 2 : Indexing and Searching Simultaneously

阅读更多

Abstract

   First i give my conclusion: Lucene can index and search simultaneously, what's more, we can only index one document and then we can search this document at once. Sounds cool, but to do like this is necessary or useful? The answer is NO.  To account for the detail reason, we should first delve into Lucene Buffering and Flushing mechanism.

 

Lucene Buffering and Flushing

 



 

As shown in the above figure, when new Documents are add to a Lucene Index, or deletions are pending, they're initially buffered in memory instead of being immediately written to the disk. This is done for performance, to minimize disk IO. Periodcally, this changes are flushed to the index Directory as a new segment. The flush can be triggered by IndexWriter according to the below three criteria:

1. To flush when the buffer has consumed more then a pre-set amount of RAM, we can use the below method to set the buffer size:

IndexWriter writer = getWriter();
writer.setRAMBufferSizeMB(10);

 2. If your IndexWriter called the setMaxBufferedDocs() Method, the flush will be triggered when the specific number of Documents has added,like the below code:

IndexWriter writer = getWriter();
writer.setMaxBufferedDocs(100);

 3. Like the Criteria 2 if your IndexWriter called setMaxBufferedDeleteTerms():

IndexWriter writer = getWriter();
writer.setMaxBufferedDeleteTerms(100);

 In addition, by deffault , IndexWriter flushs only when RAM usage is 16MB.

By now you may be have a question, while the flush has be triggered and the IndexReader can Search this flushed Document? actually, when a flush occurs, the writer creates new segment and deletion files in the Directory. However,these files are neither visible nor usable to a newly opened IndexReader until the writer commits the changes and the reader is reopened. It’s important to understand this difference. Flushing is done to free up memory consumed by buffered changes to the index, whereas committing is done to make all changes(buffered or already flushed) persistent and visible in the index. This means IndexReader always sees the starting state of the index (when IndexWriter was opened), until the writer commits.

Turn to this Articles Topic, if you want to index and search simultaneously, or even if you want to index one document and then search this document at once, your IndexWriter must call commit() method when your IndexWriter only add a Document. To continue explain Lucene index and search simultaneously, we should delve into Lucene index commits.

 

Lucene index commits

 At the begin of this article, we do not need to commit frequently during a indexing, because commit is a costly operation, and doing so frequently will slow down your indexing throughput.

I have made a experiment, indexing 100MB resource File that means adding almost 68768 Dcuments to index directory, and i got the following results

1. if we can not call IndexWriter's commit() Method during the whole indexing process, it only need 25 seconds

 2. if we call commit() method frequemtly,  i called the commit() after every added a Documment to index directory, means i do commit 68768 times during indexing the same 100 MB resource file, and the indexing time is more than 2 hours. it is obvious that do index commit more frequent is not a good choice.

3. if we call commit method while 5000 Documents has added to the index directory, we also indexing 100 MB resouece file, add 68768 Documents. in this condition, we call commit method 13 times, and the final indexing time is  31 seconds.

4. and if we call commit method while 500 Documents has added, the final indexing time is 46 seconds

5. and while 100 Documents has added the indexing time is 115 seconds.

To sum up this results to a table, it might more strightforward:

 

indexing 100MB resource file, adding 68768 documents to index directory
call commit frequency do commit times indexing time  
never 0 25 seconds  
5000 documents has added 13 31 seconds  
500 documents has added 137 46 seconds  
100 documents has added 687 115 seconds  
50 documents has added 1375 210 seconds  
1 document has added 68768 more than 2 hours  

through the table we can get the conclusion.

 

Conclusion

Lucene indexing and searching simultaneous is possible, only has the IndexWriter's commit() method called and the IndexReader is reopened. in order to keep lucene performance, we should not do commit more frequently, the best choice is more than 500 Documents has added to the index directory we call the commit method. 

 

PS : for some reason, the most is my english is not so proficient, so some of my viewpoint is not state clearly, Welcome your comments to made the view more clearly.

 

 

 

 

 

 

 

 

 

 

 

  • 大小: 41.1 KB
分享到:
评论

相关推荐

    为Lucene/Solr 4.7重新打包的IKAnalyzer 2012 FF

    当前的IKAnalyzer官方版在用于Solr4以上高版本时,由于没有TokenizerFactory而造成诸多不便,于是有了为Lucene/Solr 4.7重新打包的IKAnalyzer 2012 FF

    mmseg4j-solr总共4个文件

    mmseg4j-solr-2.0.0.jar 要求 lucene/solr >= 4.3.0。在 lucene/solr [4.3.0, 4.7.1] 测试过兼容可用。 mmseg4j-solr-2.1.0.jar 要求 lucene/solr 4.8.x mmseg4j-solr-2.2.0.jar 要求 lucene/solr [4.9, 4.10.x] ...

    LoremIpsumSearch:包含与 lucene 和 solr 一起使用的搜索算法

    LoremIpsum搜索 包含与 lucene 和 solr 一起使用的搜索算法... export CLASSPATH="<lucene>/lucene/replicator/lib/*:<nutch>/build/*:<nutch>/build/lib/*:<lucene>/solr/dist/*:<lucene>/solr/ dist/solrj-lib/*:*:.

    solr -8.11.1.zip 文件

    solr -8.11.1.zip 文件

    解决solr启动404问题

    Solr是Apache Lucene项目的一个子项目,是一个高性能、基于Java的企业级全文搜索引擎服务器。当你在尝试启动Solr时遇到404错误,这通常意味着Solr服务没有正确地启动或者配置文件设置不正确。404错误表示“未找到”...

    apache lucene solr 官网历史版本 免费下载地址

    http://archive.apache.org/dist/lucene/java/ 这个是lucene的历史版本 http://archive.apache.org/dist/lucene/solr/ 这个是solr的历史版本

    lucene-solr-sandbox:Apache Lucene和Solr开源搜索软件插件模块沙箱

    Apache Lucene和Solr是两个在信息技术领域广泛使用的开源搜索技术。Lucene是一个高性能、全文本搜索引擎库,而Solr是基于Lucene构建的一个企业级搜索服务器,提供了更高级别的功能和服务。"lucene-solr-sandbox"是这...

    lucene&solr原理分析

    lucene&solr原理分析,lucene搜索引擎和solr搜索服务器原理分析。

    Solr搜索引擎

    - 历史版本下载:[http://archive.apache.org/dist/lucene/solr/](http://archive.apache.org/dist/lucene/solr/) - 下载最新稳定版或根据项目需求选择合适的版本。 2. **安装** - 解压下载的压缩包至指定文件夹...

    在tomcat环境下搭建solr3.5和mmseg4j搜索引擎

    - 访问官方下载页面:[http://www.apache.org/dyn/closer.cgi/lucene/solr/](http://www.apache.org/dyn/closer.cgi/lucene/solr/) - 选择版本3.5并将其解压到D盘,例如路径为`D:/solr/apache-solr-3.5.0` 2. **...

    lucene,solr的使用

    ### Lucene与Solr的使用详解 #### 一、Lucene概述 Lucene是一款高性能、全功能的文本搜索引擎库,由Java语言编写而成。它能够为应用系统提供强大的全文检索能力,是当前最为流行的开源搜索库之一。由于其高度可...

    solr的学习

    - 下载地址:[http://archive.apache.org/dist/lucene/solr/](http://archive.apache.org/dist/lucene/solr/) - Linux 系统下载 lucene-4.10.3.tgz,Windows 系统下载 lucene-4.10.3.zip。 - **Solr 解压后目录...

    solr企业级搜索引擎准备阶段

    Apache Solr 是一个开源的搜索服务器,Solr 使用 Java 语言开发,主要基于 HTTP 和 Apache Lucene 实现. Solr 运行在Java的Servlet容器上,诸如: Tomcat or Jetty。...http://archive.apache.org/dist/lucene/solr/

    lucene-solr-analysis-turkish:Apache LuceneSolr的土耳其语分析组件

    适用于Apache Lucene / Solr的土耳其语分析组件 在土耳其,开源软件的使用正日益增长。 Apache Lucene / Solr(和其他 )邮件列表上的土耳其用户正在增加。 该项目利用公共可用的土耳其语NLP工具从中创建。 我创建...

    IKAnalyzer2012FF_u1.jar

    solr-4.10.3下载地址:http://archive.apache.org/dist/lucene/solr/4.10.3/ 具体操作如下: 引用 1.在/opt/cloudera/parcels/CDH/lib/solr/webapps/solr/WEB-INF创建classes目录 2.把IKAnalyzer.cfg.xml 和 ...

    IKAnalyzer 适用 lucene和solr 5.4.0版本

    2. **Solr集成**:对于Solr,IKAnalyzer同样可以作为Analyzer配置在schema.xml文件中,为Solr的中文处理提供支持。在Solr的配置文件中,需要指定Analyzer为`org.wltea.analyzer.lucene.IKAnalyzer`。 3. **版本兼容*...

Global site tag (gtag.js) - Google Analytics