`

Solr索引大小分析

阅读更多
http://juanggrande.wordpress.com/2010/12/20/solr-index-size-analysis/

In this post I’m going to talk about a set of benchmarks that I’ve done with Solr. The goal behind it is to see how each parameter defined in the schema affects the size of the index and the performance of the system.

The first step was to fetch the set of documents that I was going to use in the tests. I wanted the documents to be composed of real text, so I started to look for sources in Internet. The first one that I really liked was Twitter. They provide a REST API that allows you to read a continuous stream of tweets, composed of approximately 1% of all the public tweets. Each tweet is expressed as a JSON Object, and carries meta-data about the message and the author. While this source allowed me to get a good number of documents in a short time (about 1.7 million tweets in 2 days), they were really small, so I started to look for a source of bigger documents, finally choosing Wikipedia. I downloaded the documents through HTTP using the “Random Article” feature in their site, obtaining about 160,000 articles in a couple of days. At the time of writting, the site download.wikipedia.org, which provides an easy way of downloading a bunch of articles, was out of service.

The next step was to design the schema. Because one of the objectives is to see how each change in the schema affects the size of the index, I used many different combination of parameters, as to measure the influence of each one of them. On each case, the database of stop-words was populated using the top 100 terms of each set of documents, obtained from the administration panel of Solr. For both datasets, the “omitNorms”, “termVectors” and “stopWords” parameters are referred to the “text” field. In all cases, the value of the parameters “termOffsets” and “termPositions” is the same as “termVectors”.

In the first figure you can see the size of the index for each schema for the Twitter data-set, and which proportion of the index corresponds to each parameter. Remember that this data-set has lots of documents (about 1.7 million) but each one is small (240 bytes on average). There are many remarkable things here. The first one is that the space occupied by the term vectors (~280 MiB when not using stop words) is almost equal to the space occupied by the inverted index itself (~240 MiB).  In second place, the space saved by omitting norms is almost negligible (~2 MiB). Third, the space saved by using stop word is doubled when storing term vectors, going from about 4% of the index to about 10%. Finally, the space occupied by the stored fields (~340 MiB) is considerably bigger than the space occupied by the inverted index itself.



In the second figure you can see the same information for the Wikipedia data-set. The size occupied by the norms is still negligible (< 1MiB), however, the size occupied by the stop words has increased to about 22% of the index size when not storing term vectors, and about 25% when storing them. This time, the size occupied by the term vectors (~1067 MiB) is almost three times the space occupied by the inverted index itself (~380 MiB). Finally, the size of the stored documents (~6330 MiB) is more than four times the size of the index with term vectors stored.


At this point, we can state some conclusions concerning the size of the index:

1. When the number of fields is small, the size of the norms is negligible, independently of the size and number of documents.
2. When the documents are large, the stop words help reducing the size of the index significantly. Maybe here is important to note two things. In first place, the documents fetched from Wikipedia are writen using traditional language, and are all writen in English, while the documents fetched from Twitter are writen using modern language, and in many different languages. In second place, I didn’t measure the precision and recall of the system when using stop words, so it is possible that the findability in a real scenario won’t be good.
3. If you’re storing the documents, and they are big enough, it’s not so important if you store the term vectors or not, so if you’re using a feature such as highlighting and you are looking for good performance, you should store them. If you’re not storing documents, or your documents are small, you should think twice before storing the term vectors, because they’re going to increase significantly your index’s size.
I hope you find this post useful. Currently I’m working on a set of benchmarks to measure the influence of each one of these parameters in the performance of the system, so if you liked this post, stay tuned!
分享到:
评论

相关推荐

    基于solr的网站索引架构(一)

    Solr是由Apache Lucene项目维护的搜索平台,其核心功能包括文本分析、索引和搜索,以及结果排序和高亮显示。它支持多种数据源,如文件、数据库等,并提供了RESTful API,便于集成到各种Web应用程序中。 2. **索引...

    solr讲解,案例分析

    6. **性能调优**:分析索引大小、硬件资源、网络延迟等因素对Solr性能的影响,以及如何通过调整配置参数来提升性能。 7. **日志和监控**:了解如何配置和使用Solr的日志系统,以及利用工具如Zabbix或Grafana来监控...

    solr-mongo-importer-1.1

    Solr MongoDB Importer 是一个非常有用的工具,它允许用户将MongoDB的数据导入到Apache Solr索引中,以便进行高效、快速的全文搜索和数据分析。这个工具的主要版本是"solr-mongo-importer-1.1",这表明它是1.1版,...

    solr资料以及问题汇总

    通过合理地分段,可以平衡索引大小、查询速度和资源消耗。 "教你使用solr搭建你的全文检索.mht"文件是一个教程,指导用户如何从零开始搭建Solr全文检索系统。全 文检索是Solr的主要功能,包括字段匹配、模糊搜索、...

    solr7.5官方文档

    **Solr 系统需求**:详细列出了 Solr 运行所需的硬件和软件环境,包括操作系统、JVM 版本、内存大小等具体要求,确保系统能够顺利运行。 **安装 Solr**:提供了详细的安装步骤,指导用户完成 Solr 的安装过程。包括...

    solr-dataimportscheduler-1.4.jar 增量定时同步数据到solr.rar

    这个jar包的核心功能是DataImportHandler(DIH),它是一个内置在Solr中的模块,用于将外部数据源(如关系型数据库)的数据导入到Solr索引中。DIH支持全量导入和增量导入,这使得Solr可以保持与数据库的实时同步,...

    solr-4.4.0.tgz

    - 根据系统负载和性能需求,调整 Solr 的配置参数,例如增加索引段大小、调整缓存设置等。 以上就是 Solr 4.4.0 版本的主要知识点,以及如何在 Linux 环境下进行安装和部署到 Tomcat。通过熟练掌握这些步骤,你将...

    solr与java结合的小例子

    2. **lukeall-1.0.1.jar**:Luke是Solr的一个实用工具,它允许用户查看和分析Solr索引。lukeall版本可能包含了更多功能,便于开发者调试和理解索引结构。 3. **solr.rar**:这很可能是Solr的源码或配置文件,包含了...

    tomcat9 + solr

    - 使用Solr提供的API或管理界面,配置你的索引字段、分析器、查询处理器等。 - 将数据导入Solr进行索引,可以通过HTTP接口上传XML文件或使用Solr DataImportHandler连接数据库。 8. **查询与优化**: - 一旦索引...

    solr开发应用教程

    通过Solr的内置JMX接口,可以监控系统的运行状态,如CPU使用率、内存占用、索引大小等。对于性能瓶颈,可以进行索引优化、调整硬件配置或优化查询策略。 总结,本“Solr 3.5开发应用教程”将引导你从基础安装到实际...

    solr_api

    Solr API 是Apache Solr的核心组成部分,它提供了一组丰富的HTTP接口,允许用户与Solr索引进行交互,包括查询、文档的增删改查、配置管理等操作。Solr作为一个强大的全文搜索引擎,广泛应用于各类大数据量、高性能的...

    solr教材-PDF版

    - **3.6.4 Solr分词器、过滤器、分析器**:解释如何使用不同的分词器、过滤器和分析器来处理索引数据。 - **3.6.5 Solr高亮使用**:指导如何在搜索结果中突出显示关键词。 #### 四、SolrJ的用法 **4.1 搜索接口的...

    solr4.7.2服务器tomcat集成

    5. **索引路径配置**:描述中提到索引路径在`tomcat\bin`中,这可能是误述,因为通常索引文件会存储在`solr_home`的每个核心的`data`目录下。如果你需要更改索引路径,需要在`solrconfig.xml`中修改`dataDir`属性。 ...

    solr 分布式参考

    每个分片都是一个完整的Solr索引,可以独立工作并分布在不同的服务器上。 3. **Replication(复制)**: 分片可以有多个副本,以提高可用性和容错性。如果主分片出现问题,副本可以接管服务,确保服务不中断。 4. *...

    solr 的使用及安装

    3. **性能优化**:调整 Solr 的配置参数,如缓存大小、分词器设置等,以提升查询速度。 六、SolrCloud 配置 1. **集群设置**:配置 ZooKeeper 作为分布式协调器,确保 Solr 实例间的同步。 2. **创建分布式核心**:...

    最新版windows solr-8.6.0.zip

    - **索引压缩**:通过选择不同的编码和压缩算法,平衡索引大小和查询速度。 8. **安全和监控** - **Solr Security**:通过配置Zookeeper,可以实现身份验证和授权,保护Solr集群的安全。 - **JMX监控**:利用...

    solr压缩包

    2. **安装与配置**:解压`solr-4.10.3`后,用户需要配置Solr服务器,包括设置数据目录、定义索引和处理方式、调整内存大小等。默认情况下,Solr使用Jetty作为内置Web服务器,通过HTTP/HTTPS提供服务。 3. **索引...

    最新版linux solr-8.6.0.tgz

    - **硬件配置**:根据预期的索引大小和查询负载选择合适的硬件。 - **索引优化**:合理设置字段类型,使用适合的分析器,以提高搜索效率。 - **查询优化**:使用缓存策略,避免全表扫描,使用过滤器来减少结果集...

    最新版linux solr-8.5.2.tgz

    3. 性能调优:调整 JVM 参数,如堆大小、垃圾回收策略,提高 Solr 性能。 4. 索引优化:定期进行 optimize 操作,合并索引碎片,提高查询速度。 总结,Linux Solr 8.5.2 提供了强大的全文搜索和分析能力,适用于...

Global site tag (gtag.js) - Google Analytics