- 浏览: 239679 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
akka_li:
学习了!但是为什么后续的没有了?!
EJB4:RMI和RMI-IIOP -
springaop_springmvc:
apache lucene开源框架demo使用实例教程源代码下 ...
Lucene学习笔记(一)Lucene入门实例 -
qepipnu:
求solr 客户端 jar包
Solr学习笔记(三)Solr客户端开发实例 -
zhangbc:
是这问题,赞!
Oracle Start Up 2 Oracle 框架构件、启动、解决一个问题 -
feilian09:
查询 select hibernate jdbc 那个效率快
Hibernate,JDBC性能探讨
Lucene/Solr Dev 4 Difference between Lucene and Solr
- 博客分类:
- Search Engine
Difference between Lucene and Solr
Text search has been around for perhaps longer then we can remember. No matter what client installed software or server ran web site, they all have search. So if you want the application you’re building to stand for searching, Apache Lucene and Solr may be a good choice for you. Solr is built on Lucene, both Lucene and Solr have great search features, like nearly real-time searching, quickly indexing, high-performance, etc. But there are still some difference between them, Solr’s distributed search, replication which Lucene didn’t have. This article will show this difference. In this article we’re going to cover the following topics.
First meeting Lucene and Solr
Both of them indexing and searching time
Solr’s distributed and replicated function.
Using them to your application
Let’s begin from first meeting both of them.
FIRST MEETING
Lucene is a powerful java search library that lets you easily add search to any application, which is written in java, has no dependencies. It’s a high performance, scalable Information Retrieval (IR) library. It let you add searching capability to your applications. It’s a mature, sophisticated, most popular free IR library in recent years. It also licenses under the liberal Apache Software License.
Solr is a standalone full-text enterprise search server, which can run on any Java Servlet Container. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.
It is obviously that Lucene and Solr are totally different. Lucene is only a java-based library, or means a jar package, if you want to your application support search function, you just need add this jar package to your project, according to Lucene API to write corresponding code to complete your search requirement. However, Apache Solr has been recognized as a sophisticated application in full-text searing area, if your want to use it, you need to deploy it to a Java Servlet Container, and then it’s will go on.
INDEXING AND SEARCHING TIME
Both Lucene and Solr, before offering searching function, they need index resource file which contain text-based information to its index folder. So in this section we first delve into some indexing detail, mainly indexing time, and then we will explore their real-time searching.
Before we start this section, I want to give statement, all though this part I will give my viewpoint through experiment statistic. So I will give an explicit description of the my experiment: I indexed 1GB resource file (XML-based file) to Lucene index folder or Solr data folder, but I did not indexed the 1GB resource file directly, I divided this 1GB into 10 part, every party 100MB, I indexed this resource file 10 times, and every time I record the indexing time, indexed file size, etc. Based on this experiment, I will give the conclusion bellow;
1. Lucene indexed file size almost equals Resource file * 35%, table 1-1 will account for this.
File Size (Mb) |
|
resource data |
indexed |
100 |
37 |
200 |
74 |
300 |
112 |
400 |
149 |
500 |
187 |
600 |
224 |
700 |
261 |
800 |
299 |
900 |
336 |
1000 |
374 |
Table 1-1
The number in yellow background grid represent file size, from top to bottom there are 10 row that background is yellow, which represent 10 times indexing I have taken, in each row the left is resource file size, and the right is indexed file size, it is not difficult to see the left * 35% almost equals the right.
2. Lucene indexing is the most important compared with other part of Lucene and Solr, because Solr is built on Lucene, Solr’s indexing is actually Lucene’s indexing, so I take more time on it, I indexed the 1GB resource file 3 times, each time I set different optimize parameter and merge policy to learn more detail. I divided total Lucene index time ito two part, index to disk time and optimize time.
The figure 1-1 to 1-3 will give your some detail of Lucene indexing, as following.
Figure 1-1 indexing without any operation
Figure 1-2 indexing set optimize parameter maxNumSegments = 5
Figure 1-3 indexing set mergePolicy is LogByteSizeMergePolicy
Each of this three figure contain three X-Y axis, the left X-Y axis represents index to file time, the right represent optimize time, and each of X-Y axis, X axis represent indexed file size, unit is MB, the Y axis represent indexing or optimizing spent time, unit is seconds. From the figure we can know: 1. Index 100MB resource file to index directory need nearly 20 seconds, and index all 1GB resource file to index directory need nearly 200 seconds; 2. Set IndexWriter’s merge policy is LogByteSizeMergePolicy spend time almost equal indexing without any operation. 3. Set optimizing parameter is a really good method to speed index time, especially when the resource file is very large. And then we will explore the solr’s indexing time. Resource File(MB) Post to Solr(seconds) Others(seconds) Total time(seconds) 100 80 4 84 200 84 4 88 300 80 4 84 400 88 4 92 500 95 4 99 600 88 3 91 700 90 4 94 800 83 4 87 900 90 3 93 1000 82 3 85
Table 1-2
Table 1-2 show Solr’s indexing time, as the same way, I indexed 100MB resource XML-based file, like first column yellow background grid described. From the table we can see: Indexing 100MB resource data to Solr’s data folder need nearly 85 seconds, and indexing all 1 GB resource need nearly 850 seconds. Compared with Lucene, Solr spend more time, indexing the same 100 MB file Solr need 4 times time compared Lucene.
By now, we have known both Lucene and Solr’s indexing time, and then we will detect their searing time, the follow statistic is also come from my experiment, I searched 400 MB indexed file, and find 756448 documents, only spend 200 milliseconds, on the other way, I started 50 threads, searching the same indexed file, and also find 756448 documents, each thread spend nearly 500 milliseconds, it is obvious that Lucene searching is very quickly, it’s really a real-time search.
SOLR’S REPLICATION AND SHARDING INDEXES
In actual application, we usually need to run Solr to multiple server. There are two solutions to move multiple server: replication and sharding indexes, we will first discuss replication and then sharding indexes.
Figure 1-4 Solr’s replication
Figure 1-4 show the basic theory of Solr’s replication, there are three Server, one is Master Solr, and the other two are Slave instance, when the configure info has been configured in both Master and Salves, when the data add to the Master, it can replicate to their slaves automatic. It is useful to enhance Solr’s search function.
We also give a experiment to account for Solr’s application, we run three Tomcat represent three Solr server, their port number are 8080, 8888, 8983, the 8080 represent Master Solr, 8888 and 8983 represent slaves, we post some data to 8080, and then search from 8888 or 8983, it can get the data which we has post to 8080.
Figure 1-5 Solr’s sharding indexes
Figure 1-5 show the detail of Solr’s sharding indexes, Solr has the ability to take a single query and break it up to run over multiple Solr shards, and then aggregate the results together into a single result set. You should use sharding if your queries take too long to execute on a single server that isn't otherwise heavily taxed, by combining the power of multiple servers to work together to perform a single query. You typically only need sharding when you have millions of records of data to be searched. There are a few key points to keep in mind when using shards to support distributed search, Each document must have a unique ID. This is how Solr figures out how to merge the documents back together. If multiple shards return documents with the same ID, then the first document is selected and the rest are discarded. This can happen if you have issues in cleanly distributing your documents over your shards. We also can make a experiment to explain Solr’s sharding indexes, we also run three Tomcat represent three Solr server, their port number are 8080, 8888, 8983, and then we post data to 8888 and 8983 with unique ID, and then we search from 8080, and the result is we can get any data which we posted in 8888 and 8983. ADDING LUCENE SOLR TO OUR APPLICTION
Adding Lucene to our application is more easily, we just need to add Lucene jar file to our application, and write corresponding code according Lucene API. But there are some advices I got through experiment.
Firstly, use NumericField in indexing. NumericField provides a Field that enables indexing of numeric values for efficient range filtering and sorting, like many other Lucene users, you’ll need to index dates and times. Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number.
Secondly, use NumericRangeQuery in query. NumericRangeQuery reduced the query time, the performance of NumericRangeQuery is much better than other RangeQuery, because the number of terms that must be searched is usually far fewer.
Lastly, use RMADirectory should be careful. If your indexed file is much smaller, less than5 MB, your should select RMADirectory as your index directory, because query RAM is faster than query Disk, but if your indexd file is very large, you should be careful.
To add Solr to our application, I also give some advices following.
Firstly, Solrj is a really cool java client to access Solr, which offers a java interface to add, update, and query the Solr index. Based on this java interface we can use Solr to our application more flexible.
Secondly, to enhance Solr’s performance, we should care Solr’s cache. Solr uses multiple Least Recently Used in-memory caches. The caches are associated with individual Index Searchers, which represent a snapshot view of data, Solr use FIFO strategy to remove cache, this ensured caches fresh.
Lastly, moving Solr to multiple server, we should care Solr’s Load balancing, In distributing search, we should use HAPolicy, which a simple and powerful HTTP proxy can keep load balancing between slaves server.
发表评论
-
Lucene/Solr Dev 3 Solr Cache and Load Balance
2010-07-30 17:32 1413Warmming up: Analysing t ... -
Lucene/Solr Dev 2 : Indexing and Searching Simultaneously
2010-07-28 23:41 1414Abstract First i give my co ... -
Lucene/Solr Dev 1:Lucene indexing Time(Date)& Lucene Query Time(Date)
2010-07-27 18:17 2005先看一段代码及其运行结果: File indexFile = ... -
Lucene学习笔记(二)Lucene的使用
2010-07-26 11:34 2256如果你想快速查询你磁盘上文件,或查询邮件、Web页面,甚至查询 ... -
Lucene学习笔记(一)Lucene入门实例
2010-07-23 09:31 2333学习Lucene有两周时间,现就这两周的学习做一个简单小结,写 ... -
Solr学习笔记(一)Tomcat下配置实例
2010-07-19 17:55 2833准备: Tomcat版本:apache-tomcat-7. ... -
Solr学习笔记(四)Solr索引复制-示例说明
2010-07-19 16:32 3107准备: 为什么要进行索引复制呢?当有很多个客户并发访问某S ... -
Solr学习笔记(三)Solr客户端开发实例
2010-07-16 15:56 13672Warming Up: 本文章通过两种方法索引数据为主线,说 ... -
Solr 学习笔记(五)-Solr扩展之分布式索引实例
2010-07-14 16:17 3910准备: 当你的所有数据非常多,一台服务器无法承担 ...
相关推荐
当前的IKAnalyzer官方版在用于Solr4以上高版本时,由于没有TokenizerFactory而造成诸多不便,于是有了为Lucene/Solr 4.7重新打包的IKAnalyzer 2012 FF
里面包含了mmseg4j-solr-2.0.0.jar,mmseg4j-solr-2.1.0.jar,mmseg4j-solr-2.2.0.jar,mmseg4j-solr-2.3.0.jar总共4个文件,其中: mmseg4j-solr-2.0.0.jar 要求 lucene/solr >= 4.3.0。在 lucene/solr [4.3.0, 4.7.1]...
lucene&solr原理分析,lucene搜索引擎和solr搜索服务器原理分析。
http://archive.apache.org/dist/lucene/java/ 这个是lucene的历史版本 http://archive.apache.org/dist/lucene/solr/ 这个是solr的历史版本
### Lucene与Solr的使用详解 #### 一、Lucene概述 Lucene是一款高性能、全功能的文本搜索引擎库,由Java语言编写而成。它能够为应用系统提供强大的全文检索能力,是当前最为流行的开源搜索库之一。由于其高度可...
《深入理解Lucene4、Solr4J与AriK4:构建高效全文搜索引擎》 在信息化时代,数据量呈爆炸性增长,如何快速、准确地检索信息成为了一个至关重要的问题。为此,开源社区提供了强大的全文搜索引擎框架——Lucene4,...
Solr is free to download from http://lucene.apache.org/solr/. Designed to provide high-level documentation, this guide is intended to be more encyclopedic and less of a cookbook. It is structured to ...
- 访问官方下载页面:[http://www.apache.org/dyn/closer.cgi/lucene/solr/](http://www.apache.org/dyn/closer.cgi/lucene/solr/) - 选择版本3.5并将其解压到D盘,例如路径为`D:/solr/apache-solr-3.5.0` 2. **...
Lucene和Solr是两个非常重要的开源搜索引擎工具,它们在大数据处理、信息检索以及网站全文搜索等领域发挥着至关重要的作用。本篇将详细阐述Lucene和Solr的基本概念、工作原理以及如何在实际应用中使用它们。 **1. ...
LoremIpsum搜索 包含与 lucene 和 solr 一起使用的搜索算法... export CLASSPATH="<lucene>/lucene/replicator/lib/*:<nutch>/build/*:<nutch>/build/lib/*:<lucene>/solr/dist/*:<lucene>/solr/ dist/solrj-lib/*:*:.
描述中提到,你需要修改`wen.xml`中的`solrhome`路径,但默认情况下,Solr的配置文件应该是`server/solr/solr.xml`。在这个文件中,你应找到类似`<solr persistent="true">`的元素,里面包含各个核心的配置。确保这...
本人用ant idea命令花了214分钟,35秒编译的lucene-solr源码,可以用idea打开,把项目放在D:\space\study\java\lucene-solr路径下,再用idea打开就行了
适用于Apache Lucene / Solr的土耳其语分析组件 在土耳其,开源软件的使用正日益增长。 Apache Lucene / Solr(和其他 )邮件列表上的土耳其用户正在增加。 该项目利用公共可用的土耳其语NLP工具从中创建。 我创建...
5. Solr是一个基于Lucene构建的企业级搜索服务器,它提供了搜索引擎的索引、搜索、排序等功能,并通过RESTful API与各种客户端进行交互。Solr在实现搜索引擎方面,不仅继承了Lucene的强大功能,还提供了分布式搜索、...
3. **配置Lucene或Solr**:将生成的jar包添加到Lucene或Solr的lib目录下,然后在相应配置文件中指定IKAnalyzer为Analyzer。 4. **测试与应用**:编写测试代码验证IKAnalyzer的分词效果,确认无误后即可在实际项目中...