Lucene/Solr Dev 4 Difference between Lucene and Solr

kylinsoong

浏览: 242873 次
性别:
来自: 北京

最近访客更多访客>>

liuqu11

xlzcimos

zjy_369

1621326529

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Search Engine

Solr lucene performance Tomcat Apache

Difference between Lucene and Solr

Text search has been around for perhaps longer then we can remember. No matter what client installed software or server ran web site, they all have search. So if you want the application you’re building to stand for searching, Apache Lucene and Solr may be a good choice for you. Solr is built on Lucene, both Lucene and Solr have great search features, like nearly real-time searching, quickly indexing, high-performance, etc. But there are still some difference between them, Solr’s distributed search, replication which Lucene didn’t have. This article will show this difference. In this article we’re going to cover the following topics.

First meeting Lucene and Solr

Both of them indexing and searching time

Solr’s distributed and replicated function.

Using them to your application

Let’s begin from first meeting both of them.

FIRST MEETING

Lucene is a powerful java search library that lets you easily add search to any application, which is written in java, has no dependencies. It’s a high performance, scalable Information Retrieval (IR) library. It let you add searching capability to your applications. It’s a mature, sophisticated, most popular free IR library in recent years. It also licenses under the liberal Apache Software License.

Solr is a standalone full-text enterprise search server, which can run on any Java Servlet Container. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

It is obviously that Lucene and Solr are totally different. Lucene is only a java-based library, or means a jar package, if you want to your application support search function, you just need add this jar package to your project, according to Lucene API to write corresponding code to complete your search requirement. However, Apache Solr has been recognized as a sophisticated application in full-text searing area, if your want to use it, you need to deploy it to a Java Servlet Container, and then it’s will go on.

INDEXING AND SEARCHING TIME

Both Lucene and Solr, before offering searching function, they need index resource file which contain text-based information to its index folder. So in this section we first delve into some indexing detail, mainly indexing time, and then we will explore their real-time searching.

Before we start this section, I want to give statement, all though this part I will give my viewpoint through experiment statistic. So I will give an explicit description of the my experiment: I indexed 1GB resource file (XML-based file) to Lucene index folder or Solr data folder, but I did not indexed the 1GB resource file directly, I divided this 1GB into 10 part, every party 100MB, I indexed this resource file 10 times, and every time I record the indexing time, indexed file size, etc. Based on this experiment, I will give the conclusion bellow;

1. Lucene indexed file size almost equals Resource file * 35%, table 1-1 will account for this.

File Size (Mb)
resource data	indexed
100	37
200	74
300	112
400	149
500	187
600	224
700	261
800	299
900	336
1000	374

Table 1-1

The number in yellow background grid represent file size, from top to bottom there are 10 row that background is yellow, which represent 10 times indexing I have taken, in each row the left is resource file size, and the right is indexed file size, it is not difficult to see the left * 35% almost equals the right.

2. Lucene indexing is the most important compared with other part of Lucene and Solr, because Solr is built on Lucene, Solr’s indexing is actually Lucene’s indexing, so I take more time on it, I indexed the 1GB resource file 3 times, each time I set different optimize parameter and merge policy to learn more detail. I divided total Lucene index time ito two part, index to disk time and optimize time.

The figure 1-1 to 1-3 will give your some detail of Lucene indexing, as following.

Figure 1-1 indexing without any operation

Figure 1-2 indexing set optimize parameter maxNumSegments = 5

Figure 1-3 indexing set mergePolicy is LogByteSizeMergePolicy

Each of this three figure contain three X-Y axis, the left X-Y axis represents index to file time, the right represent optimize time, and each of X-Y axis, X axis represent indexed file size, unit is MB, the Y axis represent indexing or optimizing spent time, unit is seconds. From the figure we can know:

1. Index 100MB resource file to index directory need nearly 20 seconds, and index all 1GB resource file to index directory need nearly 200 seconds;

2. Set IndexWriter’s merge policy is LogByteSizeMergePolicy spend time almost equal indexing without any operation.

3. Set optimizing parameter is a really good method to speed index time, especially when the resource file is very large.

And then we will explore the solr’s indexing time.

Resource File(MB)	Post to Solr(seconds)	Others(seconds)	Total time(seconds)
100	80	4	84
200	84	4	88
300	80	4	84
400	88	4	92
500	95	4	99
600	88	3	91
700	90	4	94
800	83	4	87
900	90	3	93
1000	82	3	85

Table 1-2

Table 1-2 show Solr’s indexing time, as the same way, I indexed 100MB resource XML-based file, like first column yellow background grid described. From the table we can see: Indexing 100MB resource data to Solr’s data folder need nearly 85 seconds, and indexing all 1 GB resource need nearly 850 seconds. Compared with Lucene, Solr spend more time, indexing the same 100 MB file Solr need 4 times time compared Lucene.

By now, we have known both Lucene and Solr’s indexing time, and then we will detect their searing time, the follow statistic is also come from my experiment, I searched 400 MB indexed file, and find 756448 documents, only spend 200 milliseconds, on the other way, I started 50 threads, searching the same indexed file, and also find 756448 documents, each thread spend nearly 500 milliseconds, it is obvious that Lucene searching is very quickly, it’s really a real-time search.

SOLR’S REPLICATION AND SHARDING INDEXES

In actual application, we usually need to run Solr to multiple server. There are two solutions to move multiple server: replication and sharding indexes, we will first discuss replication and then sharding indexes.

Figure 1-4 Solr’s replication

Figure 1-4 show the basic theory of Solr’s replication, there are three Server, one is Master Solr, and the other two are Slave instance, when the configure info has been configured in both Master and Salves, when the data add to the Master, it can replicate to their slaves automatic. It is useful to enhance Solr’s search function.

We also give a experiment to account for Solr’s application, we run three Tomcat represent three Solr server, their port number are 8080, 8888, 8983, the 8080 represent Master Solr, 8888 and 8983 represent slaves, we post some data to 8080, and then search from 8888 or 8983, it can get the data which we has post to 8080.

Figure 1-5 Solr’s sharding indexes

Figure 1-5 show the detail of Solr’s sharding indexes, Solr has the ability to take a single query and break it up to run over multiple Solr shards, and then aggregate the results together into a single result set. You should use sharding if your queries take too long to execute on a single server that isn't otherwise heavily taxed, by combining the power of multiple servers to work together to perform a single query. You typically only need sharding when you have millions of records of data to be searched. There are a few key points to keep in mind when using shards to support distributed search, Each document must have a unique ID. This is how Solr figures out how to

merge the documents back together. If multiple shards return documents with the same ID, then the first document is selected and the rest are discarded. This can happen if you have issues in cleanly distributing your documents over your shards.

We also can make a experiment to explain Solr’s sharding indexes, we also run three Tomcat represent three Solr server, their port number are 8080, 8888, 8983, and then we post data to 8888 and 8983 with unique ID, and then we search from 8080, and the result is we can get any data which we posted in 8888 and 8983.

ADDING LUCENE SOLR TO OUR APPLICTION

Adding Lucene to our application is more easily, we just need to add Lucene jar file to our application, and write corresponding code according Lucene API. But there are some advices I got through experiment.

Firstly, use NumericField in indexing. NumericField provides a Field that enables indexing of numeric values for efficient range filtering and sorting, like many other Lucene users, you’ll need to index dates and times. Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number.

Secondly, use NumericRangeQuery in query. NumericRangeQuery reduced the query time, the performance of NumericRangeQuery is much better than other RangeQuery, because the number of terms that must be searched is usually far fewer.

Lastly, use RMADirectory should be careful. If your indexed file is much smaller, less than5 MB, your should select RMADirectory as your index directory, because query RAM is faster than query Disk, but if your indexd file is very large, you should be careful.

To add Solr to our application, I also give some advices following.

Firstly, Solrj is a really cool java client to access Solr, which offers a java interface to add, update, and query the Solr index. Based on this java interface we can use Solr to our application more flexible.

Secondly, to enhance Solr’s performance, we should care Solr’s cache. Solr uses multiple Least Recently Used in-memory caches. The caches are associated with individual Index Searchers, which represent a snapshot view of data, Solr use FIFO strategy to remove cache, this ensured caches fresh.

Lastly, moving Solr to multiple server, we should care Solr’s Load balancing, In distributing search, we should use HAPolicy, which a simple and powerful HTTP proxy can keep load balancing between slaves server.