Scaling up Large Scale Search from 500,000 volumes to 5 Million volumes and beyond
To scale up from 500,000 volumes of full-text to 5 million, we decided to use Solr’s distributed search feature which allows us to split up an index into a number of separate indexes (called “shards”). Solr's distributed search feature allows the indexes to be searched in parallel and then the results aggregated so performance is better than having a very large single index.
Sizing the shards
Our previous tests indicated that disk I/O was the bottleneck in Solr search performance and that after the size of a single index increases beyond about 600,000 volumes, I/O demands rise sharply . (See Update on Testing: Memory and Load tests for details.) For that reason we decided that shards of between 500,000 and 600,000 volumes would be appropriate. For 5 million volumes, we decided on 10 shards each indexing 500,000 volumes. The corresponding size of the indexes for each shard would be around 200 GB for a total of about 2 terrabytes. We plan to add shards to scale beyond 5 million volumes. For details on the server hardware see New Hardware for searching 5 million+ volumes of full-text.
Several postings on the Solr list advised against using Network Attached Storage (NAS), so we initially configured the hardware to use Direct Attached Storage (DAS).[1]
Earlier tests indicated we needed to optimize our indexes in order to get acceptable performance. Solr requires about 2x the size of the index as temporary space during the optimization process of the indexing. So we needed about 400 GB of disk to accommodate 200 GB of index. Our tests also indicated that while Solr provides a mechanism to allow searching to take place while an index is being optimized, search performance severely degrades during optimization. To avoid this, we decided to build and optimize the indexes on separate Solr instances. However, to avoid copying our large indexes over the network frequently, we set up the separate index building and searching Solr instances on the same servers for the shards they serve.
We want to update the indexes daily to reflect newly ingested content. Because of the size of our index shards (almost 200GB) and our desire for a daily index update, we didn't want to spend time making copies of the newly-built shards every day. We decided to use LVM snapshots and simply take snapshots of the quiescent index shards after their daily build, avoiding copying the entire index every day.
We configured our servers with the largest and fastest hard disks available. Each server had 6 450GB 15K RPM SAS drives in a RAID 6 configuration. [2]
After the overhead for the OS, filesystem, and RAID, we had about 1,600 GB usable per server, which we divided into four 400 GB logical volumes for the four Solr shards: two for indexing and two for serving. [3] So with 2 500,000 volume serving shards per server, we will have 1 million volumes per server. To accomodate 5 million volumes, we set up 5 servers.
Index building process
Nightly, after indexing for all shards was completed, our driver script ran an optimize on each of the indexing shards. After the optimize finished the driver script triggered read-only snapshots of the optimized indexes and mounted them for use as serving shards. Once the snapshots were taken, indexing could resume on the indexing shards.
Implementation Issues.
As we were scaling up, when we got close to an index size of 190GB per shard, we started having problems with the optimization process running out of disk space. Several posts to the Solr list revealed that under various conditions, optimization can take up to 3 times the final optimized index size rather than the 2 times that we were expecting. [4]
We implemented a workaround. We wrote a program which uses the "maxSegments" parameter to the Solr optimize command and optimizes in stages. For example the program would optimize down to 16 segments, then to 8, then to 4… and finally down to 1. The clean up process at the end of each optimize gets rid of any unneeded files and/or file handles. [5]
Our early performance tests with this architecture indicated good response time and confirmed that disk I/O continued to be the bottleneck in search performance. Other resources appeared to be underutilized. During continuous search testing, CPU utilization tended to be between 10 and 20%.
Our intent was to have nightly index updates and for the search system to be available 24/7. At some point we ran our tests while an optimization was occurring and discovered that performance slowed to a crawl during optimization. This is because optimization is very disk intensive (the entire index needs to be read and then the optimized index written), and the disk I/O for optimization competes with the disk I/O needed for searching.
Once we noticed the poor searching performance during optimization we realized that the disk I/O impact of optimization was exacerbated by the additional I/O load of block-level LVM snapshots. [6] We examined our options for reducing I/O contention in a small 6-drive configuration and considered eliminating our use of snapshots and putting the indexing shard and the serving shard on separate RAID sets. This would reduce contention for disk I/O since indexing/optimizing and serving would be making read/write requests to entirely different disks, although some contention to the same RAID controller would remain. [7] However, splitting the configuration into two RAID sets would mean doubling the RAID overhead which would effectively reduce capacity by almost a terabyte or change from RAID 6 to RAID 5. A 1 terabyte reduction in capacity would mean reducing the size of each index by almost half, thus requiring twice as many servers. We abandoned the option of separate RAID controllers because we were neither willing to lose almost 1TB in capacity nor forego RAID 6.
Given the options with the existing hardware, we also considered confining optimizing to once a week during off-peak hours on the weekend. However, we soon ran in to another problem; we were rapidly running out of disk space to accomodate daily index growth. This was due to a number of factors including:
- Ingest of volumes accelerated considerably beyond our predictions
- We underestimated the disk space needed. When we implemented CommonGrams to meet our performance goals, the index size went from about 35% of the size of the documents to about 47% of the size of the documents. Using CommonGrams, the 200GB index size will accommodate about 425,000 documents per shard instead of the 500-600,000 we first estimated
- Our order for more hardware to keep up with the demand was delayed significantly.
These considerations along with a concern for a more manageable storage solution resulted in a decision to consider our options for a high-performance NAS solution, and to use our existing Isilon cluster (which houses the repository) provisionally.
Moving Large Scale Search to a high performance Network Attached Storage system
NAS or SAN storage is generally preferable to DAS because it is easier to manage and for a number of other reasons [8]. We initially pursued DAS based on various posts on the Solr list that led us to believe that DAS would have better performance. Given the size of our indexes and our requirements, we learned that NAS was a much better fit in our circumstances. The benefits of using our large-scale NAS over our simple DAS configuration for this project are significant, including:
- Significantly more spindles contributing to I/O. Even though we had 30 15K drives for storage in our DAS configuration, any given I/O transaction had access to only 6 of those, which barely distributes I/O load. Our Isilon cluster, on the other hand, has 480 drives, and is highly virtualized: every block of every file is striped over no less than 14 different drives. I/O is therefore distributed throughout the cluster, minimizing the I/O demand on any given drive.
- Shared storage. We could now build index shards on completely different servers without copying terabytes of index data over the network.
- Storage efficiency resulting from a shared overhead for redundancy and free space. With the DAS configuration, we allocated 1/3 of our storage to RAID 6 parity and could use at most 50% of our storage because of the need for excess capacity on each server during optimization. With the NAS configuration, we have a smaller, fixed redundancy overhead of under 20% because our stripe width is more than twice as large, and our free space overhead is shared among all uses of the system distributed over time, so our overall storage utilitization can be higher.
- Independent scaling of compute and storage resources. With DAS, we were going to be adding servers to get more storage when we didn't really need more compute resources. With NAS, we can add more of the resource we need. We can add more shards per server or index more fields by increasing our storage allocation, or increase Solr query throughput capacity by adding more servers.
- The Isilon system we are currently using has some characteristics that make it well-suited to this task, namely that it can scale performance linear with capacity due to its cluster architecture.
- NAS systems typically offer file-based rather than the block-based snapshots used by LVM, which are more compatible with the I/O characteristics encountered during optimization.
Current architecture
With the flexibility of the Isilon NAS, we can completely separate building the index from serving. The Isilon snapshot technology is file based rather than block based and because data can be striped across so many disks, we can use snapshots without the adverse effects that we had when we were using attached hard drives. Our current configuration is
Search Servers
- 4 Servers with one Tomcat and 3 shards on each server (only 10 of 12 shards currently in use)
- 72 GB memory, 16 GB allocated to the JVM
Index Server
- 1 Server with 12 Tomcats and 12 shards (only 10 of 12 Tomcats/shards currently in use)
- 72 GB memory, 6 GB allocated to each of 10 JVMs
Current Production Index building process
Every morning the index driver gets a list of newly added HathiTrust volumes and queues them for indexing. Indexing currently takes from 1 to 3 hours depending on the number of documents being indexed. After indexing is completed the index driver runs a program that optimizes the index on each shard down to two segments. [9] This currently takes about 3 hours but will increase as the number of documents in the second segment increases. At 3:00 am the next morning, a snapshot is taken of each optimized shard and synchronization of the index to our second site in Indiana begins. At 6:00am, a script performs sanity checks to ensure indexing and optimization finished correctly and simply changes the symbolic link pointing to the day-old shard to point to the newly-updated shard and restarts Tomcat.. At 6:05am a separate program runs 1600 cache-warming queries to warm the Solr and OS caches, and then our test suite to monitor response time.
Scaling beyond 5 million volumes
With the use of NAS, we can easily add more disk capacity to Large Scale search, so decisions about scaling up from 5 million to 10 million volumes will primarily involve resources other than disk storage.
Scaling Search
We currently have about 5.3 million documents in Large-Scale search, which averages out to about 530,000 volumes per shard. We are not yet using shards 11 and 12. At the current shard size we could accomodate about an additional 1.1 million documents using those two shards. However, the amount of growth that can be accomodated by adding more documents to existing shards and utilizing the two currently empty shards depends on how large we allow the shards to grow. We are working on experiments to determine the optimimum shard size, based primarily on search response time considerations. [10] We also plan to experiment to determine the optimum number of shards per server. As mentioned previously, the CPU utilization in the search servers is low. Once we determine the optimum size of each shard and the optimium number of shards per server, we will be able to develop better strategies for adding more memory or more servers..
Scaling Indexing
As the size of the index grows, we have seen that we are reaching the limits of the 72 GB of memory currently installed on our build machine with just 10 Solrs building shards. We believe we will not be able to build indexes on the other 2 shards without either adding memory or implementing some complex code that would start and stop Solr instances. We plan to obtain two new build servers each with twice the memory (144GB). This should accomodate future growth as well as provide us with failover options and the flexibility we will need to rebuild the entire index when necessary without disrupting our our normal production indexing workflow [11]
------------------------------------------------------------------------------------------------------------------------------------------
NOTES
[1]See http://www.lucidimagination.com/search/document/2105e4dba8711d71/two_solr_instances_index_and_commit#2105e4dba8711d71 andhttp://www.lucidimagination.com/search/document/f67e23ea39be9361/solr_and_nfs_in_distributed_deployment_real_time_indexing_and_real_time_searching#f67e23ea39be9361
[2] After hearing a presentation on using Solid State Drives for the Summa project we considered Solid State Drives but the cost of 2 terrabytes of SSDs (over $150,000) was prohibitive. (PDF of presentation on SSDs)
[3] The index building shards needed 400GB or 2x the size of the 200GB index in order to run the index optimization process. Because of the nature of the LVM block based snapshot technology, we also needed 400GB for the serving shards. This is explained in more detail in note 6.
[4] These posts to the Solr list revealed that optimizing can take 3 times the final optimized index size rather than the 2 times we were expecting:http://old.nabble.com/How-much-disk-space-does-optimize-really-take-to25790344.html#a25792748
http://old.nabble.com/solr-optimize---no-space-left-on-device-tt25767386.html#a25786569
[5] See http://old.nabble.com/Optimization-of-large-shard-succeeded-to25809281.html#a25809281. One of the Solr committers has opened an issue to implement our workaround as a default in Solr: https://issues.apache.org/jira/browse/SOLR-1560
[6] LVM is block based and its snapshots are block based. When first taken a snapshot consumes no space other than metadata. As changes are made to the source file system, the snapshot grows because changed blocks must be preserved, regardless of whether the blocks contained information about current files before they were changed. So in our setup where file systems were sized at 2x the index to accommodate optimization, the act of optimization changed every block in the file system. This in turn required LVM to copy every block in the file system in order to maintain the snapshot, effectively doubling the I/O load on the RAID set. In addition, our system administrators observed lengthy pauses in disk I/O during optimization (on the order of a minute), where the only explanation appeared to be that LVM required time for "house cleaning", for lack of a better term. During such pauses, the processing of a search query would likely have to wait until I/O operations are resumed.
[7] With the build and serve shards on separate RAID sets, we would still need 400GB for each shard. Once the build shard optimizes the index (which requires 400GB for a 200GB index) there are two alternatives to mounting the newly optimized index on the serving shard:
1) Copy the newly optimized index from the build shard to the serving shard. Since this takes considerable time, the serving shard would have to continue to serve from the old index until the copy is completed. This would require 200GB for the old index and 200GB for the new index. After the copy the old index could be deleted.
2) Change symlinks so that the server is pointing to the newly optimized index and the building Solr instance is pointing to the previous serve index.
[8] Conventionally, the benefits of networked storage (NAS or SAN) are easier management, consolidation, and flexible deployment, all coming with the additional cost of some sort of sophisticated storage controller system. If architected correctly, performance of I/O-intensive workloads, even though directed over a network, will also benefit.
Most of these benefits come through virtualization, which is the technology that allows all the disks in a networked storage system to be pooled together and for logical volumes are created within the pool, typically spread across many devices, controllers, and caches. Not only is it quicker and easier to create logical volumes with a single unified system, but logical volumes tend to perform better because the I/O activity in a virtualized system is balanced throughout the system and thus minimized on any single drive. In particular, throughput is increased because of the aggregation of multiple drives each capable of reading or writing through their own disk interface, and rotational delay is minimized because data retrieval can be parallelized rather than serialized across multiple drives. This effect is maximized in storage architectures that are large, highly virtualized, and designed for performance scaling as well as capacity scaling (so that the controllers can keep up with the additional I/O capacity of the aggregated system).
In evaluating whether to use a NAS or SAN system, the primary consideration is whether multiple servers need to be able to access the same data. NAS systems, by their nature, use protocols such as NFS that allow multiple read-write access to file systems. SAN systems, on the other hand, use protocols such as SCSI over fiber channel that permit only a single server to have access without special lock management software. Either provides the benefits described above but generally one will be a much better fit for a given application. For our project, NAS makes the most sense because we need to have the same indexes visible on multiple servers for building versus indexing, and down the road, for server failover.
[9] Prior to going live in production we optimized all the shards. At that point the index size per shard was about 220 GB. Every night we indexed additional documents, which created unoptimized index segments. If we optimized every night down to 1 segment, the entire index needs to be read, processed, and written to disk. This would be about 220 GB of reading and 220 GB of writing. We decided instead to optimize down to 2 segments, which means that the large 220 GB segment is not affected and the incremental added indexing gets optimized down to a second segment. So far adding about 600,000 total volumes (60,000 per shard), has resulted in a second segment in each shard of about 50GB. It currently takes between 3 and 4 hours to optimize the daily added indexing because of the time it takes to read and write 50GB of data for each shard. At some point when this time becomes too large, we will optimize down to 1 segment and start over.
[10]Our previous experiments showing a sharp rise in I/O demands with indexes over 600,000 were conducted in a significantly different environment. In particular, these experiments were conducted prior to the implementation of CommonGrams, which significantly reduces I/O for queries with common words. The new environment also has significantly more processing power and memory. We plan to repeat our experiments in the new environment on a test shard, to determine the relationship of index/shard size, to I/O demands.
[11] We plan to have each of the two new build servers host only 6 of the twelve shards during the normal indexing/build process. However, they will be configured so either one can take over building all 12 shards in the case that one of them fails. When we need to re-index the entire 5 million volumes, we plan to dedicate one server to the normal incremental indexing process on all 12 shards and the other to building the new index on 12 new shards. We need to be able to reindex the entire corpus whenever we change the Solr schema or the list of common words. Adding fields such as OCLC numbers or MARC metadata fields, changing the way the indexer processes punctuation or other changes to the indexing to improve response time or improve the user experience will require reindexing the entire corpus,
相关推荐
solr扩展词库,共100多万条,供大家下载使用,内容很全
Solr扩展依赖Jar包是Apache Solr项目中的一个重要组成部分,它是Solr运行和功能扩展的基础。Solr是一款基于Lucene的开源全文检索服务器,广泛应用于大数据环境下的全文检索、数据分析和云计算服务。为了实现更复杂、...
标题中的“php与solr交互扩展库包”指的是PHP与Apache Solr搜索引擎之间的一个扩展库,这个库使得在PHP环境中可以方便地与Solr服务进行数据的增删查改操作。Solr是一款强大的、高性能的全文检索服务器,常用于大型...
在本篇Solr学习笔记中,我们将探讨Solr的分布式索引功能,这对于处理大量数据和实现高可用性至关重要。Solr的分布式索引能力允许我们跨越多个节点分布和处理索引过程,从而提高索引速度和查询性能。在实际应用中,这...
2. **PHP环境**:首先,确认PHP已经安装并且版本与Solr扩展兼容。可以通过运行`php -v`来查看PHP版本。 3. **Apache Solr**:在安装PHP扩展之前,需要先安装并运行Apache Solr服务。可以从Apache官网下载相应版本,...
Apache Solr 是一个开源的企业级搜索平台,它允许快速、可扩展的全文检索,以及丰富的搜索功能和结果排序。在 PHP 中集成 Solr 扩展,可以极大地提高开发效率,使开发者能够轻松地在网站中实现高级搜索功能。 标题...
"solr.tar.gz" 文件是一个包含了 PHP Solr 2.4 拓展的修复版,特别针对 PHP 7 的兼容性问题进行了优化,解决了由于 PHP 新版本的更新导致的原有 Solr 扩展无法正常运行的问题。 PHP Solr 扩展允许 PHP 程序员直接...
电商搜索引擎solr的扩展词库,20W+的专业名词,txt文件;
PHP4.3的solr扩展让开发者能够方便地在PHP应用程序中集成Solr,进行索引构建和查询操作,从而提升网站的搜索体验。 Redis则是一个内存中的数据结构存储系统,可作为数据库、缓存和消息中间件。它的高速性能和丰富的...
- **第四步**:添加 Solr 扩展服务包至相应的 Tomcat lib 目录。 - **第五步**:添加 log4j.properties 文件到 Solr 的 lib 目录。 - **第六步**:在 web.xml 文件中指定 solrhome 的路径。 - **SolrHome 和 ...
2. **强大的查询扩展**:Solr扩展了Lucene的查询语言,支持更复杂的查询逻辑。 3. **结果分组与过滤**:支持对搜索结果进行分组和过滤,帮助用户更好地组织和理解数据。 4. **高级文本分析**:提供了丰富的文本处理...
6. Solr扩展与优化 Solr的可扩展性体现在其插件体系上,可以自定义分析器、过滤器等,以适应特定的业务需求。同时,Solr的缓存机制和性能优化策略能确保在高并发场景下的高效运行。 总结,Solr作为基于Lucene的...
可做为电商搜索引擎solr的扩展词库,20W+专业名词,txt文件可修改后缀名,不定时更新。
在这个"solr软件包扩展词典可停词配置学习和开发文档"中,我们将深入探讨Solr在Linux环境下的安装、扩展词典的使用以及停词配置的关键知识点。 **1. Solr的Linux环境安装** 在Linux环境下安装Solr,首先确保系统...
- 扩展Solr/Solr云:探讨了如何将Solr扩展至分布式系统,以及Solr Cloud的原理和使用。 - 多语言搜索:介绍Solr如何支持多语言搜索,包括翻译和语言分析等。 - 复杂数据操作:涉及到如何处理复杂的数据结构,包括...
- **扩展的查询语言**:Solr扩展了Lucene的查询语言,支持更复杂的查询条件。 - **结果分组和过滤**:允许对搜索结果进行动态分组和过滤,以便更好地呈现信息。 - **文本分析**:支持高级的、可配置的文本分析,如...
#### 二、Solr扩展策略详解 ##### 1. Scale High - 单台服务器优化 在单一Solr服务器上通过调整配置最大化其性能,通常适用于初始阶段或小型应用。 - **JVM配置**:Solr运行于Java虚拟机(JVM)之上,合理的JVM配置...