`
kylinsoong
  • 浏览: 240094 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Lucene/Solr Dev 4 Difference between Lucene and Solr

阅读更多

Difference between Lucene and Solr

Text search has been around for perhaps longer then we can remember. No matter what client installed software or server ran web site, they all have search. So if you want the application you’re building to stand for searching, Apache Lucene and Solr may be a good choice for you. Solr is built on Lucene, both Lucene and Solr have great search features, like nearly real-time searching, quickly indexing, high-performance, etc. But there are still some difference between them, Solr’s distributed search, replication which Lucene didn’t have. This article will show this difference. In this article we’re going to cover the following topics.

First meeting Lucene and Solr

Both of them indexing and searching time

Solr’s distributed and replicated function.

Using them to your application

Let’s begin from first meeting both of them.

FIRST MEETING

Lucene is a powerful java search library that lets you easily add search to any application, which is written in java, has no dependencies. It’s a high performance, scalable Information Retrieval (IR) library. It let you add searching capability to your applications. It’s a mature, sophisticated, most popular free IR library in recent years. It also licenses under the liberal Apache Software License.

       Solr is a standalone full-text enterprise search server, which can run on any Java Servlet Container. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

       It is obviously that Lucene and Solr are totally different. Lucene is only a java-based library, or means a jar package, if you want to your application support search function, you just need add this jar package to your project, according to Lucene API to write corresponding code to complete your search requirement. However, Apache Solr has been recognized as a sophisticated application in full-text searing area, if your want to use it, you need to deploy it to a Java Servlet Container, and then it’s will go on.

INDEXING AND SEARCHING TIME

       Both Lucene and Solr, before offering searching function, they need index resource file which contain text-based information to its index folder. So in this section we first delve into some indexing detail, mainly indexing time, and then we will explore their real-time searching.

Before we start this section, I want to give statement, all though this part I will give my viewpoint through experiment statistic. So I will give an explicit description of the my experiment: I indexed 1GB resource file (XML-based file) to Lucene index folder or Solr data folder, but I did not indexed the 1GB resource file directly, I divided this 1GB into 10 part, every party 100MB, I indexed this resource file 10 times, and every time I record the indexing time, indexed file size, etc. Based on this experiment, I will give the conclusion bellow;

1.      Lucene indexed file size almost equals Resource file * 35%, table 1-1 will account for this.

File Size (Mb)

resource data

indexed

100

37

200

74

300

112

400

149

500

187

600

224

700

261

800

299

900

336

1000

374

Table 1-1

The number in yellow background grid represent file size, from top to bottom there are 10 row that background is yellow, which represent 10 times indexing I have taken, in each row the left is resource file size, and the right is indexed file size, it is not difficult to see the left * 35% almost equals the right.

2.      Lucene indexing is the most important compared with other part of Lucene and Solr, because Solr is built on Lucene, Solr’s indexing is actually Lucene’s indexing, so I take more time on it, I indexed the 1GB resource file 3 times, each time I set different optimize parameter and merge policy to learn more detail. I divided total Lucene index time ito two part, index to disk time and optimize time.

The figure 1-1 to 1-3 will give your some detail of Lucene indexing, as following.



 

Figure 1-1 indexing without any operation



 Figure 1-2 indexing set optimize parameter maxNumSegments = 5



 

Figure 1-3 indexing set mergePolicy is LogByteSizeMergePolicy

Each of this three figure contain three X-Y axis, the left X-Y axis represents index to file time, the right represent optimize time, and each of X-Y axis, X axis represent indexed file size, unit is MB, the Y axis represent indexing or optimizing spent time, unit is seconds. From the figure we can know:

1.      Index 100MB resource file to index directory need nearly 20 seconds, and index all 1GB resource file to index directory need nearly 200 seconds;

2.      Set IndexWriter’s merge policy is LogByteSizeMergePolicy spend time almost equal indexing without any operation.

3.      Set optimizing parameter is a really good method to speed index time, especially when the resource file is very large.

And then we will explore the solr’s indexing time.

 

Resource File(MB)

Post to Solr(seconds)

Others(seconds)

Total time(seconds)

100

80

4

84

200

84

4

88

300

80

4

84

400

88

4

92

500

95

4

99

600

88

3

91

700

90

4

94

800

83

4

87

900

90

3

93

1000

82

3

85

Table 1-2

Table 1-2 show Solr’s indexing time, as the same way, I indexed 100MB resource XML-based file, like first column yellow background grid described. From the table we can see: Indexing 100MB resource data to Solr’s data folder need nearly 85 seconds, and indexing all 1 GB resource need nearly 850 seconds. Compared with Lucene, Solr spend more time, indexing the same 100 MB file Solr need 4 times time compared Lucene.

By now, we have known both Lucene and Solr’s indexing time, and then we will detect their searing time, the follow statistic is also come from my experiment, I searched 400 MB indexed file, and find 756448 documents, only spend 200 milliseconds, on the other way, I started 50 threads, searching the same indexed file, and also find 756448 documents, each thread spend nearly 500 milliseconds, it is obvious that Lucene searching is very quickly, it’s really a real-time search.

SOLR’S REPLICATION AND SHARDING INDEXES

In actual application, we usually need to run Solr to multiple server. There are two solutions to move multiple server: replication and sharding indexes, we will first discuss replication and then sharding indexes.

 



 

Figure 1-4 Solr’s replication

Figure 1-4 show the basic theory of Solr’s replication, there are three Server, one is Master Solr, and the other two are Slave instance, when the configure info has been configured in both Master and Salves, when the data add to the Master, it can replicate to their slaves automatic. It is useful to enhance Solr’s search function.

We also give a experiment to account for Solr’s application, we run three Tomcat represent three Solr server, their port number are 8080, 8888, 8983, the 8080 represent Master Solr, 8888 and 8983 represent slaves, we post some data to 8080, and then search from 8888 or 8983, it can get the data which we has post to 8080.



 Figure 1-5 Solr’s sharding indexes

Figure 1-5 show the detail of Solr’s sharding indexes, Solr has the ability to take a single query and break it up to run over multiple Solr shards, and then aggregate the results together into a single result set. You should use sharding if your queries take too long to execute on a single server that isn't otherwise heavily taxed, by combining the power of multiple servers to work together to perform a single query. You typically only need sharding when you have millions of records of data to be searched. There are a few key points to keep in mind when using shards to support distributed search, Each document must have a unique ID. This is how Solr figures out how to

merge the documents back together. If multiple shards return documents with the same ID, then the first document is selected and the rest are discarded. This can happen if you have issues in cleanly distributing your documents over your shards.

       We also can make a experiment to explain Solr’s sharding indexes, we also run three Tomcat represent three Solr server, their port number are 8080, 8888, 8983, and then we post data to 8888 and 8983 with unique ID, and then we search from 8080, and the result is we can get any data which we posted in 8888 and 8983.

ADDING LUCENE SOLR TO OUR APPLICTION

  

 

 

Adding Lucene to our application is more easily, we just need to add Lucene jar file to our application, and write corresponding code according Lucene API. But there are some advices I got through experiment.

       Firstly, use NumericField in indexing. NumericField provides a Field that enables indexing of numeric values for efficient range filtering and sorting, like many other Lucene users, you’ll need to index dates and times. Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number.

       Secondly, use NumericRangeQuery in query. NumericRangeQuery reduced the query time, the performance of NumericRangeQuery is much better than other RangeQuery, because the number of terms that must be searched is usually far fewer.

       Lastly, use RMADirectory should be careful. If your indexed file is much smaller, less than5 MB, your should select RMADirectory as your index directory, because query RAM is faster than query Disk, but if your indexd file is very large, you should be careful.

       To add Solr to our application, I also give some advices following.

       Firstly, Solrj is a really cool java client to access Solr, which offers a java interface to add, update, and query the Solr index. Based on this java interface we can use Solr to our application more flexible.

       Secondly, to enhance Solr’s performance, we should care Solr’s cache. Solr uses multiple Least Recently Used in-memory caches. The caches are associated with individual Index Searchers, which represent a snapshot view of data, Solr use FIFO strategy to remove cache, this ensured caches fresh.

       Lastly, moving Solr to multiple server, we should care Solr’s Load balancing, In distributing search, we should use HAPolicy, which a simple and powerful HTTP proxy can keep load balancing between slaves server.

 

  • 大小: 8.3 KB
  • 大小: 8.2 KB
  • 大小: 18.4 KB
  • 大小: 20.2 KB
分享到:
评论

相关推荐

    为Lucene/Solr 4.7重新打包的IKAnalyzer 2012 FF

    当前的IKAnalyzer官方版在用于Solr4以上高版本时,由于没有TokenizerFactory而造成诸多不便,于是有了为Lucene/Solr 4.7重新打包的IKAnalyzer 2012 FF

    mmseg4j-solr总共4个文件

    里面包含了mmseg4j-solr-2.0.0.jar,mmseg4j-solr-2.1.0.jar,mmseg4j-solr-2.2.0.jar,mmseg4j-solr-2.3.0.jar总共4个文件,其中: mmseg4j-solr-2.0.0.jar 要求 lucene/solr >= 4.3.0。在 lucene/solr [4.3.0, 4.7.1]...

    lucene&solr原理分析

    lucene&solr原理分析,lucene搜索引擎和solr搜索服务器原理分析。

    apache lucene solr 官网历史版本 免费下载地址

    http://archive.apache.org/dist/lucene/java/ 这个是lucene的历史版本 http://archive.apache.org/dist/lucene/solr/ 这个是solr的历史版本

    lucene,solr的使用

    ### Lucene与Solr的使用详解 #### 一、Lucene概述 Lucene是一款高性能、全功能的文本搜索引擎库,由Java语言编写而成。它能够为应用系统提供强大的全文检索能力,是当前最为流行的开源搜索库之一。由于其高度可...

    lucene4 solr4j arIk4

    《深入理解Lucene4、Solr4J与AriK4:构建高效全文搜索引擎》 在信息化时代,数据量呈爆炸性增长,如何快速、准确地检索信息成为了一个至关重要的问题。为此,开源社区提供了强大的全文搜索引擎框架——Lucene4,...

    Apache+Solr+Reference+Guide 2018.pdf

    Solr is free to download from http://lucene.apache.org/solr/. Designed to provide high-level documentation, this guide is intended to be more encyclopedic and less of a cookbook. It is structured to ...

    在tomcat环境下搭建solr3.5和mmseg4j搜索引擎

    - 访问官方下载页面:[http://www.apache.org/dyn/closer.cgi/lucene/solr/](http://www.apache.org/dyn/closer.cgi/lucene/solr/) - 选择版本3.5并将其解压到D盘,例如路径为`D:/solr/apache-solr-3.5.0` 2. **...

    Lucene&solr.zip

    Lucene和Solr是两个非常重要的开源搜索引擎工具,它们在大数据处理、信息检索以及网站全文搜索等领域发挥着至关重要的作用。本篇将详细阐述Lucene和Solr的基本概念、工作原理以及如何在实际应用中使用它们。 **1. ...

    LoremIpsumSearch:包含与 lucene 和 solr 一起使用的搜索算法

    LoremIpsum搜索 包含与 lucene 和 solr 一起使用的搜索算法... export CLASSPATH="<lucene>/lucene/replicator/lib/*:<nutch>/build/*:<nutch>/build/lib/*:<lucene>/solr/dist/*:<lucene>/solr/ dist/solrj-lib/*:*:.

    解决solr启动404问题

    描述中提到,你需要修改`wen.xml`中的`solrhome`路径,但默认情况下,Solr的配置文件应该是`server/solr/solr.xml`。在这个文件中,你应找到类似`<solr persistent="true">`的元素,里面包含各个核心的配置。确保这...

    lucene-solr源码,编译成的idea项目源码

    本人用ant idea命令花了214分钟,35秒编译的lucene-solr源码,可以用idea打开,把项目放在D:\space\study\java\lucene-solr路径下,再用idea打开就行了

    lucene-solr-analysis-turkish:Apache LuceneSolr的土耳其语分析组件

    适用于Apache Lucene / Solr的土耳其语分析组件 在土耳其,开源软件的使用正日益增长。 Apache Lucene / Solr(和其他 )邮件列表上的土耳其用户正在增加。 该项目利用公共可用的土耳其语NLP工具从中创建。 我创建...

    搜索引擎 Lucene、Solr

    5. Solr是一个基于Lucene构建的企业级搜索服务器,它提供了搜索引擎的索引、搜索、排序等功能,并通过RESTful API与各种客户端进行交互。Solr在实现搜索引擎方面,不仅继承了Lucene的强大功能,还提供了分布式搜索、...

    solr安装设置资料

    - **Windows版本**: [http://labs.xiaonei.com/apache-mirror/lucene/solr/1.3.0/apache-solr-1.3.0.zip](http://labs.xiaonei.com/apache-mirror/lucene/solr/1.3.0/apache-solr-1.3.0.zip) - **Linux版本**: ...

    IKAnalyzer 适用 lucene和solr 5.4.0版本

    3. **配置Lucene或Solr**:将生成的jar包添加到Lucene或Solr的lib目录下,然后在相应配置文件中指定IKAnalyzer为Analyzer。 4. **测试与应用**:编写测试代码验证IKAnalyzer的分词效果,确认无误后即可在实际项目中...

Global site tag (gtag.js) - Google Analytics