- 浏览: 239969 次
- 性别:
- 来自: 南京
最新评论
-
gonglil:
貌似不行呢?
java 字符串和二进制相互转换 -
robingdo:
这样关闭proxool连接池以后,项目没报那个错,但是数据库用 ...
Proxool连接池在reload web容器时出现HouseKeeper的空指针异常 -
xb12369:
ezfantasy 写道lord_is_layuping 写道 ...
Java忽略大小写替换和提取字符信息 -
ezfantasy:
lord_is_layuping 写道不区分大小写应该是(?i ...
Java忽略大小写替换和提取字符信息 -
狂盗一枝梅:
hex2byte函数功能是转换成十六进制吧?上面写的是转换成二 ...
java 转换图片为字符串,将字符串转换成图片显示
转载自:http://onjava.com/lpt/a/3273
Lucene is a free text-indexing and -searching API written in Java. To appreciate indexing techniques described later in this article, you need a basic understanding of Lucene's index structure. As I mentioned in the previous article in this series , a typical Lucene index is stored in a single directory in the filesystem on a hard disk.
The core elements of such an index are segments, documents, fields, and
terms. Every index consists of one or more segments. Each segment contains
one or more documents. Each document has one or more fields, and each field
contains one or more terms. Each term is a pair of String
s
representing a field name and a value. A segment consists of a series of
files. The exact number of files that constitute each segment varies from
index to index, and depends on the number of fields that the index contains.
All files belonging to the same segment share a common prefix and differ in the
suffix. You can think of a segment as a sub-index, although each segment is
not a fully-independent index.
-rw-rw-r-- 1 otis otis 4 Nov 22 22:43 deletable -rw-rw-r-- 1 otis otis 1000000 Nov 22 22:43 _lfyc.f1 -rw-rw-r-- 1 otis otis 1000000 Nov 22 22:43 _lfyc.f2 -rw-rw-r-- 1 otis otis 31030502 Nov 22 22:28 _lfyc.fdt -rw-rw-r-- 1 otis otis 8000000 Nov 22 22:28 _lfyc.fdx -rw-rw-r-- 1 otis otis 16 Nov 22 22:28 _lfyc.fnm -rw-rw-r-- 1 otis otis 1253701335 Nov 22 22:43 _lfyc.frq -rw-rw-r-- 1 otis otis 1871279328 Nov 22 22:43 _lfyc.prx -rw-rw-r-- 1 otis otis 14122 Nov 22 22:43 _lfyc.tii -rw-rw-r-- 1 otis otis 1082950 Nov 22 22:43 _lfyc.tis -rw-rw-r-- 1 otis otis 18 Nov 22 22:43 segments
Example 1: An index consisting of a single segment.
Note that all files that belong to this segment start with a common prefix:
_lfyc
. Because this index contains two fields, you will notice
two files with the fN
suffix, where N
is a number. If
this index had three fields, a file named _lfyc.f3
would also be
present in the index directory.
The number of segments in an index is fixed once the index is fully built, but it varies while indexing is in progress. Lucene adds segments as new documents are added to the index, and merges segments every so often. In the next section we will learn how to control creation and merging of segments in order to improve indexing speed.
For more information about the files that make up a Lucene index, please see the File Formats document on Lucene's web site. You can find the URL in the Reference section at the end of this article.
Indexing Speed Factors
The previous article demonstrated how to index text using the
LuceneIndexExample
class. Because the example was so basic, there
was no need to think about speed. If you are using Lucene in a non-trivial
application, you will want to ensure optimal indexing performance. The
bottleneck of a typical text-indexing application is the process of writing
index files onto a disk. Therefore, we need to instruct Lucene to be smart
about adding and merging segments while indexing documents.
When new documents are added to a Lucene index, they are initially stored in
memory instead of being immediately written to the disk. This is done for
performance reasons. The simplest way to improve Lucene's indexing performance
is to adjust the value of IndexWriter
's mergeFactor
instance variable. This value tells Lucene how many documents to store in
memory before writing them to the disk, as well as how often to merge multiple
segments together. With the default value of 10, Lucene will store 10
documents in memory before writing them to a single segment on the disk. The
mergeFactor
value of 10 also means that once the number of
segments on the disk has reached the power of 10, Lucene will merge these
segments into a single segment. (There is a small exception to this rule,
which I shall explain shortly.)
For instance, if we set mergeFactor
to 10, a new segment will
be created on the disk for every 10 documents added to the index. When the
10th segment of size 10 is added, all 10 will be merged into a single segment
of size 100. When 10 such segments of size 100 have been added, they will be
merged into a single segment containing 1000 documents, and so on. Therefore,
at any time, there will be no more than 9 segments in each power of 10 index
size.
The exception noted earlier has to do with another IndexWriter
instance variable: maxMergeDocs
. While merging segments, Lucene
will ensure that no segment with more than maxMergeDocs
is
created. For instance, if we set maxMergeDocs
to 1000, when we
add the 10,000th document, instead of merging multiple segments into a single
segment of size 10,000, Lucene will create a 10th segment of size 1000, and
keep adding segments of size 1000 for every 1000 documents added.
The default value of maxMergeDocs
is
Integer#MAX_VALUE
. In my experience, one rarely needs to change
this value.
Now that I have explained how mergeFactor
and
maxMergeDocs
work, you can see that using a higher value for
mergeFactor
will cause Lucene to use more RAM, but will let Lucene
write data to disk less frequently, which will speed up the indexing process.
A smaller mergeFactor
will use less memory and will cause the
index to be updated more frequently, which will make it more up-to-date, but
will also slow down the indexing process. Similarly, a larger
maxMergeDocs
is better suited for batch indexing, and a smaller
maxMergeDocs
is better for more interactive indexing.
To get a better feel for how different values of mergeFactor
and maxMergeDocs
affect indexing speed, take a look at the
IndexTuningDemo
class below. This class takes three arguments on
the command line: the total number of documents to add to the index, the value
to use for mergeFactor
, and the value to use for
maxMergeDocs
. All three arguments must be specified, must be
integers, and must be in this order. In order to keep the code short and
clean, there are no checks for improper usage.
import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; /** * Creates an index called 'index' in a temporary directory. * The number of documents to add to this index, the mergeFactor and * the maxMergeDocs must be specified on the command line * in that order - this class expects to be called correctly. * * Note: before running this for the first time, manually create the * directory called 'index' in your temporary directory. */ public class IndexTuningDemo { public static void main(String[] args) throws Exception { int docsInIndex = Integer.parseInt(args[0]); // create an index called 'index' in a temporary directory String indexDir = System.getProperty("java.io.tmpdir", "tmp") + System.getProperty("file.separator") + "index"; Analyzer analyzer = new StopAnalyzer(); IndexWriter writer = new IndexWriter(indexDir, analyzer, true); // set variables that affect speed of indexing writer.mergeFactor = Integer.parseInt(args[1]); writer.maxMergeDocs = Integer.parseInt(args[2]); long startTime = System.currentTimeMillis(); for (int i = 0; i < docsInIndex; i++) { Document doc = new Document(); doc.add(Field.Text("fieldname", "Bibamus, moriendum est")); writer.addDocument(doc); } writer.close(); long stopTime = System.currentTimeMillis(); System.out.println("Total time: " + (stopTime - startTime) + " ms"); } }
Here are some results:
prompt> time java IndexTuningDemo 100000 10 1000000
Total time: 410092 ms
real 6m51.801s
user 5m30.000s
sys 0m45.280s
prompt> time java IndexTuningDemo 100000 1000 100000
Total time: 249791 ms
real 4m11.470s
user 3m46.330s
sys 0m3.660s
As you can see, both invocations created an index with 100,000 documents,
but the first one took much longer to complete. That is because it used the
default mergeFactor
of 10, which caused Lucene to write documents
to the disk more often than the mergeFactor
of 1000 used in the
second invocation.
Note that while these two variables can help improve indexing performance, they also affect the number of file descriptors that Lucene uses, and can therefore cause the "Too many open files" exception. If you get this error, you should first see if you can optimize the index, as will be described shortly. Optimization may help indexes that contain more than one segment. If optimizing the index does not solve the problem, you could try increasing the maximum number of open files allowed on your computer. This is usually done at the operating-system level and varies from OS to OS. If you are using Lucene on a computer that uses a flavor of the UNIX OS, you can see the maximum number of open files allowed from the command line.
Under bash
, you can see the current settings with the built-in
ulimit
command:
prompt> ulimit -n
Under tcsh
, the equivalent is:
prompt> limit descriptors
To change the value under bash
, use this:
prompt> ulimit -n <max number of open files here>
Under tcsh
, use the following:
prompt> limit descriptors <max number of open files here>
To estimate a setting for the maximum number of open files allowed while
indexing, keep in mind that the maximum number of files Lucene will open is
(1 + mergeFactor) * FilesPerSegment
.
For instance, with a default mergeFactor
of 10 and an index of
1 million documents, Lucene will require 110 open files on an unoptimized
index. When IndexWrite
's optimize()
method is
called, all segments are merged into a single segment, which minimizes the
number of open files that Lucene needs.
<!-- CS_PAGE_INDEX--> |
<!-- CS_PAGE_BREAK-->
<!-- CS_PAGE_INDEX-->
In-Memory Indexing
In the previous section, I mentioned that new documents added to an index
are stored in memory before being written to the disk. You also saw how to
control the rate at which this is done via IndexWriter
's instance
variables. The Lucene distribution contains the RAMDirectory
class, which gives even more control over this process. This class implements
the Directory
interface, just like FSDirectory
does,
but stores indexed documents in memory, while FSDirectory
stores
them on disk.
Because RAMDirectory
does not write anything to the disk, it
is faster than FSDirectory
. However, since computers usually come
with less RAM than hard disk space, RAMDirectory
is not suitable
for very large indices.
The MemoryVsDisk
class demonstrates how to use
RAMDirectory
as an in-memory buffer in order to improve the
indexing speed.
import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import java.io.IOException; /** * Creates an index called 'index' in a temporary directory. * The number of documents to add to this index, the mergeFactor and * the maxMergeDocs must be specified on the command line * in that order - this class expects to be called correctly. * Additionally, if the fourth command line argument is '-r' this * class will first index all documents in RAMDirectory before * flushing them to the disk in the end. To make this class use the * regular FSDirectory use '-f' as the fourth command line argument. * * Note: before running this for the first time, manually create the * directory called 'index' in your temporary directory. */ public class MemoryVsDisk { public static void main(String[] args) throws Exception { int docsInIndex = Integer.parseInt(args[0]); // create an index called 'index' in a temporary directory String indexDir = System.getProperty("java.io.tmpdir", "tmp") + System.getProperty("file.separator") + "index"; Analyzer analyzer = new StopAnalyzer(); long startTime = System.currentTimeMillis(); if ("-r".equalsIgnoreCase(args[3])) { // if -r argument was specified, use RAMDirectory RAMDirectory ramDir = new RAMDirectory(); IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true); addDocs(ramWriter, docsInIndex); IndexWriter fsWriter = new IndexWriter(indexDir, analyzer, true); fsWriter.addIndexes(new Directory[] { ramDir }); ramWriter.close(); fsWriter.close(); } else { // create an index using FSDirectory IndexWriter fsWriter = new IndexWriter(indexDir, analyzer, true); fsWriter.mergeFactor = Integer.parseInt(args[1]); fsWriter.maxMergeDocs = Integer.parseInt(args[2]); addDocs(fsWriter, docsInIndex); fsWriter.close(); } long stopTime = System.currentTimeMillis(); System.out.println("Total time: " + (stopTime - startTime) + " ms"); } private static void addDocs(IndexWriter writer, int docsInIndex) throws IOException { for (int i = 0; i < docsInIndex; i++) { Document doc = new Document(); doc.add(Field.Text("fieldname", "Bibamus, moriendum est")); writer.addDocument(doc); } } }
To create an index with 10,000 documents and only use FSDirectory
, use this:
prompt> time java MemoryVsDisk 10000 10 100000 -f
Total time: 41380 ms
real 0m42.739s
user 0m36.750s
sys 0m4.180s
To create the index of the same size but do it faster, with
RAMDirectory
, call MemoryVsDisk
as follows:
prompt> time java MemoryVsDisk 10000 10 100000 -r
Total time: 27325 ms
real 0m28.695s
user 0m27.920s
sys 0m0.610s
However, note that you can achieve the same, or even better, performance by
choosing a more suitable value for mergeFactor
:
prompt> time java MemoryVsDisk 10000 1000 100000 -f
Total time: 24724 ms
real 0m26.108s
user 0m25.280s
sys 0m0.620s
Be careful, however, when tuning mergeFactor
. A value that
requires more memory than your JVM can access may cause the
java.lang.OutOfMemoryError
error.
Finally, do not forget that you can greatly influence the performance of any Java application by giving the JVM more memory to work with:
prompt> time java -Xmx300MB -Xms200MB MemoryVsDisk 10000 10 100000 -r
Total time: 15166 ms
real 0m17.311s
user 0m15.400s
sys 0m1.590s
Merging Indices
If you want to improve indexing performance with Lucene, and manipulating
IndexWriter
's mergeFactor
and
maxMergeDocs
prove insufficient, you can use
RAMDirectory
to create in-memory indices. You could create a
multi-threaded indexing application that uses multiple
RAMDirectory
-based indices in parallel, one in each thread, and
merges them into a single index on the disk using IndexWriter
's
addIndexes(Directory[])
method. Taking this idea further, a
sophisticated indexing application could even create in-memory indices on
multiple computers in parallel. To make full use of this approach, one needs
to ensure that the thread that performs the actual indexing on the disk is
never idle, as that translates to wasted time.
Indexing in Multi-Threaded Environments
While multiple threads or processes can search (i.e. read) a single Lucene
index simultaneously, only a single thread or process is allowed to modify
(write) an index at a time. If your indexing application uses multiple
indexing threads that are adding documents to the same index, you must
serialize their calls to the IndexWriter.addDocument(Document)
method. Leaving these calls unserialized may cause threads to get in each
other's way and modify the index in unwanted ways, causing Lucene to throw
exceptions. In addition, to prevent misuse, Lucene uses file-based locks in
order to stop multiple threads or processes from creating
IndexWriter
s with the same index directory at the same time.
For instance, this code:
import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; /** * Demonstrates how Lucene uses locks to prevent multiple processes from * writing to the same index at the same time. * Note: before running this for the first time, manually create the * directory called 'index' in your temporary directory. */ public class DoubleTrouble { public static void main(String[] args) throws Exception { // create an index called 'index' in a temporary directory String indexDir = System.getProperty("java.io.tmpdir", "tmp") + System.getProperty("file.separator") + "index"; Analyzer analyzer = new StopAnalyzer(); IndexWriter firstWriter = new IndexWriter(indexDir, analyzer, true); // the following line will cause an exception IndexWriter secondWriter = new IndexWriter(indexDir, analyzer, false); // the following two lines will never even be reached firstWriter.close(); secondWriter.close(); } }
will cause the following exception:
Exception in thread "main" java.io.IOException: \
Index locked for write: Lock@/tmp/index/write.lock
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:145)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:122)
at DoubleTrouble.main(DoubleTrouble.java:23)
Optimizing Indices
I have mentioned index optimization a few times in this article, but I have
not yet explained it. To optimize an index, one has to call
optimize()
on an IndexWriter
instance. When this
happens, all in-memory documents are flushed to the disk and all index segments
are merged into a single segment, reducing the number of files that make up the
index. However, optimizing an index does not help improve indexing
performance. As a matter of fact, optimizing an index during the indexing
process will only slow things down. Despite this, optimizing may sometimes be
necessary in order to keep the number of open files under control. For
instance, optimizing an index during the indexing process may be needed in
situations where searching and indexing happen concurrently, since both
processes keep their own set of open files. A good rule of thumb is that if
more documents will be added to the index soon, you should avoid calling
optimize()
. If, on the other hand, you know that the index will
not be modified for a while, and the index will only be searched, you should
optimize it. That will reduce the number of segments (files on the disk), and
consequently improve search performance--the fewer files Lucene has to open
while searching, the faster the search.
To illustrate the effect of optimizing an index, we can use the
IndexOptimizeDemo
class:
import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; /** * Creates an index called 'index' in a temporary directory. * If you want the index to optimize the index at the end use '-o' * command line argument. If you do not want to optimize the index * at the end use any other value for the command line argument. * This class expects to be called correctly. * * Note: before running this for the first time, manually create the * directory called 'index' in your temporary directory. */ public class IndexOptimizeDemo { public static void main(String[] args) throws Exception { // create an index called 'index' in a temporary directory String indexDir = System.getProperty("java.io.tmpdir", "tmp") + System.getProperty("file.separator") + "index"; Analyzer analyzer = new StopAnalyzer(); IndexWriter writer = new IndexWriter(indexDir, analyzer, true); for (int i = 0; i < 15; i++) { Document doc = new Document(); doc.add(Field.Text("fieldname", "Bibamus, moriendum est")); writer.addDocument(doc); } if ("-o".equalsIgnoreCase(args[0])) { System.out.println("Optimizing the index..."); writer.optimize(); } writer.close(); } }
As you can see from the class Javadoc and code, the created index will be
optimized only if -o
command line argument is used. To create an
unoptimized index with this class, use this:
prompt> java IndexOptimizeDemo -n
-rw-rw-r-- 1 otis otis 10 Feb 18 23:50 _a.f1
-rw-rw-r-- 1 otis otis 260 Feb 18 23:50 _a.fdt
-rw-rw-r-- 1 otis otis 80 Feb 18 23:50 _a.fdx
-rw-rw-r-- 1 otis otis 14 Feb 18 23:50 _a.fnm
-rw-rw-r-- 1 otis otis 30 Feb 18 23:50 _a.frq
-rw-rw-r-- 1 otis otis 30 Feb 18 23:50 _a.prx
-rw-rw-r-- 1 otis otis 11 Feb 18 23:50 _a.tii
-rw-rw-r-- 1 otis otis 41 Feb 18 23:50 _a.tis
-rw-rw-r-- 1 otis otis 4 Feb 18 23:50 deletable
-rw-rw-r-- 1 otis otis 5 Feb 18 23:50 _g.f1
-rw-rw-r-- 1 otis otis 130 Feb 18 23:50 _g.fdt
-rw-rw-r-- 1 otis otis 40 Feb 18 23:50 _g.fdx
-rw-rw-r-- 1 otis otis 14 Feb 18 23:50 _g.fnm
-rw-rw-r-- 1 otis otis 15 Feb 18 23:50 _g.frq
-rw-rw-r-- 1 otis otis 15 Feb 18 23:50 _g.prx
-rw-rw-r-- 1 otis otis 11 Feb 18 23:50 _g.tii
-rw-rw-r-- 1 otis otis 41 Feb 18 23:50 _g.tis
-rw-rw-r-- 1 otis otis 22 Feb 18 23:50 segments
Example 2: An unoptimized index usually contains more than one segment.
This index contains two segments. To create a fully-optimized index, call
this class with -o
command line argument:
prompt> java IndexOptimizeDemo -o
-rw-rw-r-- 1 otis otis 4 Feb 18 23:50 deletable
-rw-rw-r-- 1 otis otis 15 Feb 18 23:50 _h.f1
-rw-rw-r-- 1 otis otis 390 Feb 18 23:50 _h.fdt
-rw-rw-r-- 1 otis otis 120 Feb 18 23:50 _h.fdx
-rw-rw-r-- 1 otis otis 14 Feb 18 23:50 _h.fnm
-rw-rw-r-- 1 otis otis 45 Feb 18 23:50 _h.frq
-rw-rw-r-- 1 otis otis 45 Feb 18 23:50 _h.prx
-rw-rw-r-- 1 otis otis 11 Feb 18 23:50 _h.tii
-rw-rw-r-- 1 otis otis 41 Feb 18 23:50 _h.tis
-rw-rw-r-- 1 otis otis 15 Feb 18 23:50 segments
Example 3: A fully-optimized index contains only a single segment.
Conclusion
This article has discussed the basic structure of a Lucene index and has demonstrated a few techniques for improving indexing performance. You also learned about potential problems with indexing in multi-threaded environments, about what it means to optimize an index, and how this affects indexing. This knowledge should allow you to gain more control over Lucene's indexing process to improve its performance. The next article will examine Lucene's text-searching capabilities.
相关推荐
1. **索引构建(Indexing)**: Lucene 允许开发者将非结构化的文本数据转换为倒排索引(Inverted Index),这是一种优化的存储结构,便于快速查找包含特定词汇的文档。在 4.6.0 版本中,这个过程更为优化,支持多...
4. **高级特性(Advanced Features)**: 可能包括模糊搜索、短语搜索、近似搜索、多字段搜索、评分策略等。 **使用 Lucene 的 jar 包** "lucene示例 demo+jar包"中提到的 jar 包是 Lucene 库的运行时依赖,它们...
3. 索引(Index):Lucene通过构建索引来实现快速搜索,索引是对原始数据的一种结构化表示,使得搜索过程无需遍历所有数据。 4. 分词器(Analyzer):对输入文本进行分词,是Lucene处理自然语言的关键步骤。 5. ...
- **Advanced search techniques (高级搜索技术)**:探讨了复杂的搜索技巧,如近义词搜索、模糊匹配等。 - **Extending search (扩展搜索)**:讨论了如何通过插件和自定义组件来增强Lucene的功能。 #### 第二部分...
5. **Advanced search techniques**: 这里会介绍更高级的搜索技术,如布尔查询、短语查询、模糊查询、近似搜索、范围查询、高亮显示搜索结果等。还会讨论评分机制和相关性计算,以提供更精确的搜索排名。 6. **...
5. **Advanced search techniques(高级搜索技术)** - **章节概述**:探讨更复杂的搜索场景,如多语言支持、地理空间搜索等。 - **知识点详解**: - **多语言搜索**:如何通过配置不同语言的分析器来支持多语言...
Use and configure the new and improved default text scoring mechanism in Apache Lucene 6 Know how to overcome the pitfalls while handling relational data in Elasticsearch Write and develop customized ...
Title: Apache Solr Search Patterns Author: Jayant Kumar Length: 250 pages Edition: 1 Language: English Publisher: Packt Publishing Publication Date: 2015-03-31 ... Text Tagging with Lucene FST
We will guide you through the intermediate and advanced functionalities of Elasticsearchsuch as querying, indexing, searching, and modifying data. We’ll also explore advanced concepts including ...
标题中的“message-indexing:RabbitMQ、Solr 和 AES”揭示了这是一个关于构建消息索引系统的项目,其中涉及三个关键技术:RabbitMQ(一个消息队列系统)、Solr(一个全文搜索引擎)以及AES(高级加密标准)。...
10.3.3. Advanced Profiler Usage 10.3.3.1. Filter by query elapsed time 10.3.3.2. Filter by query type 10.3.3.3. Retrieve profiles by query type 10.4. Zend_Db_Select 10.4.1. 简介 10.4.2. 同一表中...