`

Advanced Text Indexing with Lucene

阅读更多

Advanced Text Indexing with Lucene

by Otis Gospodnetic
03/05/2003

 

Lucene Index Structure

Lucene is a free text-indexing and -searching API written in Java. To appreciate indexing techniques described later in this article, you need a basic understanding of Lucene's index structure. As I mentioned in the previous article in this series , a typical Lucene index is stored in a single directory in the filesystem on a hard disk.

The core elements of such an index are segments, documents, fields, and terms. Every index consists of one or more segments. Each segment contains one or more documents. Each document has one or more fields, and each field contains one or more terms. Each term is a pair of String s representing a field name and a value. A segment consists of a series of files. The exact number of files that constitute each segment varies from index to index, and depends on the number of fields that the index contains. All files belonging to the same segment share a common prefix and differ in the suffix. You can think of a segment as a sub-index, although each segment is not a fully-independent index.

-rw-rw-r--    1 otis     otis            4   Nov 22 22:43 deletable
-rw-rw-r--    1 otis     otis      1000000   Nov 22 22:43 _lfyc.f1
-rw-rw-r--    1 otis     otis      1000000   Nov 22 22:43 _lfyc.f2
-rw-rw-r--    1 otis     otis     31030502   Nov 22 22:28 _lfyc.fdt
-rw-rw-r--    1 otis     otis      8000000   Nov 22 22:28 _lfyc.fdx
-rw-rw-r--    1 otis     otis           16   Nov 22 22:28 _lfyc.fnm
-rw-rw-r--    1 otis     otis   1253701335   Nov 22 22:43 _lfyc.frq
-rw-rw-r--    1 otis     otis   1871279328   Nov 22 22:43 _lfyc.prx
-rw-rw-r--    1 otis     otis        14122   Nov 22 22:43 _lfyc.tii
-rw-rw-r--    1 otis     otis      1082950   Nov 22 22:43 _lfyc.tis
-rw-rw-r--    1 otis     otis           18   Nov 22 22:43 segments

Example 1: An index consisting of a single segment.

Note that all files that belong to this segment start with a common prefix: _lfyc . Because this index contains two fields, you will notice two files with the fN suffix, where N is a number. If this index had three fields, a file named _lfyc.f3 would also be present in the index directory.

The number of segments in an index is fixed once the index is fully built, but it varies while indexing is in progress. Lucene adds segments as new documents are added to the index, and merges segments every so often. In the next section we will learn how to control creation and merging of segments in order to improve indexing speed.

For more information about the files that make up a Lucene index, please see the File Formats document on Lucene's web site. You can find the URL in the Reference section at the end of this article.

 

 

 

Indexing Speed Factors

The previous article demonstrated how to index text using the LuceneIndexExample class. Because the example was so basic, there was no need to think about speed. If you are using Lucene in a non-trivial application, you will want to ensure optimal indexing performance. The bottleneck of a typical text-indexing application is the process of writing index files onto a disk. Therefore, we need to instruct Lucene to be smart about adding and merging segments while indexing documents.

When new documents are added to a Lucene index, they are initially stored in memory instead of being immediately written to the disk. This is done for performance reasons. The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter 's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene will store 10 documents in memory before writing them to a single segment on the disk. The mergeFactor value of 10 also means that once the number of segments on the disk has reached the power of 10, Lucene will merge these segments into a single segment. (There is a small exception to this rule, which I shall explain shortly.)

For instance, if we set mergeFactor to 10, a new segment will be created on the disk for every 10 documents added to the index. When the 10th segment of size 10 is added, all 10 will be merged into a single segment of size 100. When 10 such segments of size 100 have been added, they will be merged into a single segment containing 1000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each power of 10 index size.

The exception noted earlier has to do with another IndexWriter instance variable: maxMergeDocs . While merging segments, Lucene will ensure that no segment with more than maxMergeDocs is created. For instance, if we set maxMergeDocs to 1000, when we add the 10,000th document, instead of merging multiple segments into a single segment of size 10,000, Lucene will create a 10th segment of size 1000, and keep adding segments of size 1000 for every 1000 documents added.

The default value of maxMergeDocs is Integer#MAX_VALUE . In my experience, one rarely needs to change this value.

Now that I have explained how mergeFactor and maxMergeDocs work, you can see that using a higher value for mergeFactor will cause Lucene to use more RAM, but will let Lucene write data to disk less frequently, which will speed up the indexing process. A smaller mergeFactor will use less memory and will cause the index to be updated more frequently, which will make it more up-to-date, but will also slow down the indexing process. Similarly, a larger maxMergeDocs is better suited for batch indexing, and a smaller maxMergeDocs is better for more interactive indexing.

To get a better feel for how different values of mergeFactor and maxMergeDocs affect indexing speed, take a look at the IndexTuningDemo class below. This class takes three arguments on the command line: the total number of documents to add to the index, the value to use for mergeFactor , and the value to use for maxMergeDocs . All three arguments must be specified, must be integers, and must be in this order. In order to keep the code short and clean, there are no checks for improper usage.

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

/**
 * Creates an index called 'index' in a temporary directory.
 * The number of documents to add to this index, the mergeFactor and
 * the maxMergeDocs must


 be specified on the command line
 * in that order - this class expects to be called correctly.
 * 
 * Note: before running this for the first time, manually create the
 * directory called 'index' in your temporary directory.
 */
public class IndexTuningDemo
{
    public static void main(String[] args) throws Exception
    {
        int docsInIndex  = Integer.parseInt(args[0]);

        // create an index called 'index' in a temporary directory
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "index";

        Analyzer    analyzer = new StopAnalyzer();
        IndexWriter writer   = new IndexWriter(indexDir, analyzer, true);

        // set variables that affect speed of indexing
        writer.mergeFactor   = Integer.parseInt(args[1]);
        writer.maxMergeDocs  = Integer.parseInt(args[2]);

        long startTime = System.currentTimeMillis();
        for (int i = 0; i < docsInIndex; i++)
        {
            Document doc = new Document();
            doc.add(Field.Text("fieldname", "Bibamus, moriendum est"));
            writer.addDocument(doc);
        }
        writer.close();
        long stopTime = System.currentTimeMillis();
        System.out.println("Total time: " + (stopTime - startTime) + " ms");
    }
}


Here are some results:

prompt> time java IndexTuningDemo 100000 10 1000000

Total time: 410092 ms

real    6m51.801s
user    5m30.000s
sys     0m45.280s

prompt> time java IndexTuningDemo 100000 1000 100000

Total time: 249791 ms

real    4m11.470s
user    3m46.330s
sys     0m3.660s


As you can see, both invocations created an index with 100,000 documents, but the first one took much longer to complete. That is because it used the default mergeFactor of 10, which caused Lucene to write documents to the disk more often than the mergeFactor of 1000 used in the second invocation.

Note that while these two variables can help improve indexing performance, they also affect the number of file descriptors that Lucene uses, and can therefore cause the "Too many open files" exception. If you get this error, you should first see if you can optimize the index, as will be described shortly. Optimization may help indexes that contain more than one segment. If optimizing the index does not solve the problem, you could try increasing the maximum number of open files allowed on your computer. This is usually done at the operating-system level and varies from OS to OS. If you are using Lucene on a computer that uses a flavor of the UNIX OS, you can see the maximum number of open files allowed from the command line.

Under bash , you can see the current settings with the built-in ulimit command:

prompt> ulimit -n


Under tcsh , the equivalent is:

prompt> limit descriptors


To change the value under bash , use this:

prompt> ulimit -n <max number of open files here>


Under tcsh , use the following:

prompt> limit descriptors <max number of open files here>


To estimate a setting for the maximum number of open files allowed while indexing, keep in mind that the maximum number of files Lucene will open is (1 + mergeFactor) * FilesPerSegment .

For instance, with a default mergeFactor of 10 and an index of 1 million documents, Lucene will require 110 open files on an unoptimized index. When IndexWrite 's optimize() method is called, all segments are merged into a single segment, which minimizes the number of open files that Lucene needs.

 

In-Memory Indexing

In the previous section, I mentioned that new documents added to an index are stored in memory before being written to the disk. You also saw how to control the rate at which this is done via IndexWriter 's instance variables. The Lucene distribution contains the RAMDirectory class, which gives even more control over this process. This class implements the Directory interface, just like FSDirectory does, but stores indexed documents in memory, while FSDirectory stores them on disk.

Because RAMDirectory does not write anything to the disk, it is faster than FSDirectory . However, since computers usually come with less RAM than hard disk space, RAMDirectory is not suitable for very large indices.

The MemoryVsDisk class demonstrates how to use RAMDirectory as an in-memory buffer in order to improve the indexing speed.

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

import java.io.IOException;

/**
 * Creates an index called 'index' in a temporary directory.
 * The number of documents to add to this index, the mergeFactor and
 * the maxMergeDocs must


 be specified on the command line
 * in that order - this class expects to be called correctly.
 * Additionally, if the fourth command line argument is '-r' this
 * class will first index all documents in RAMDirectory before
 * flushing them to the disk in the end.  To make this class use the
 * regular FSDirectory use '-f' as the fourth command line argument.
 * 
 * Note: before running this for the first time, manually create the
 * directory called 'index' in your temporary directory.
 */
public class MemoryVsDisk
{
    public static void main(String[] args) throws Exception
    {
        int docsInIndex  = Integer.parseInt(args[0]);

        // create an index called 'index' in a temporary directory
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "index";

        Analyzer analyzer  = new StopAnalyzer();
        long     startTime = System.currentTimeMillis();

        if ("-r".equalsIgnoreCase(args[3]))
        {
            // if -r argument was specified, use RAMDirectory
            RAMDirectory ramDir    = new RAMDirectory();
            IndexWriter  ramWriter = new IndexWriter(ramDir, analyzer, true);
            addDocs(ramWriter, docsInIndex);
            IndexWriter fsWriter   = new IndexWriter(indexDir, analyzer, true);
            fsWriter.addIndexes(new Directory[] { ramDir });
            ramWriter.close();
            fsWriter.close();
        }
        else
        {
            // create an index using FSDirectory
            IndexWriter fsWriter  = new IndexWriter(indexDir, analyzer, true);
            fsWriter.mergeFactor  = Integer.parseInt(args[1]);
            fsWriter.maxMergeDocs = Integer.parseInt(args[2]);
            addDocs(fsWriter, docsInIndex);
            fsWriter.close();
        }

        long stopTime = System.currentTimeMillis();
        System.out.println("Total time: " + (stopTime - startTime) + " ms");
    }

    private static void addDocs(IndexWriter writer, int docsInIndex)
        throws IOException
    {
        for (int i = 0; i < docsInIndex; i++)
        {
            Document doc = new Document();
            doc.add(Field.Text("fieldname", "Bibamus, moriendum est"));
            writer.addDocument(doc);
        }
    }
}


To create an index with 10,000 documents and only use FSDirectory , use this:

prompt> time java MemoryVsDisk 10000 10 100000 -f

Total time: 41380 ms

real    0m42.739s
user    0m36.750s
sys     0m4.180s


To create the index of the same size but do it faster, with RAMDirectory , call MemoryVsDisk as follows:

prompt>  time java MemoryVsDisk 10000 10 100000 -r
Total time: 27325 ms

real    0m28.695s
user    0m27.920s
sys     0m0.610s


However, note that you can achieve the same, or even better, performance by choosing a more suitable value for mergeFactor :

prompt> time java MemoryVsDisk 10000 1000 100000 -f

Total time: 24724 ms

real    0m26.108s
user    0m25.280s
sys     0m0.620s


Be careful, however, when tuning mergeFactor . A value that requires more memory than your JVM can access may cause the java.lang.OutOfMemoryError error.

Finally, do not forget that you can greatly influence the performance of any Java application by giving the JVM more memory to work with:

prompt> time java -Xmx300MB -Xms200MB MemoryVsDisk 10000 10 100000 -r

Total time: 15166 ms

real    0m17.311s
user    0m15.400s
sys     0m1.590s


Merging Indices

If you want to improve indexing performance with Lucene, and manipulating IndexWriter 's mergeFactor and maxMergeDocs prove insufficient, you can use RAMDirectory to create in-memory indices. You could create a multi-threaded indexing application that uses multiple RAMDirectory -based indices in parallel, one in each thread, and merges them into a single index on the disk using IndexWriter 's addIndexes(Directory[]) method. Taking this idea further, a sophisticated indexing application could even create in-memory indices on multiple computers in parallel. To make full use of this approach, one needs to ensure that the thread that performs the actual indexing on the disk is never idle, as that translates to wasted time.

Indexing in Multi-Threaded Environments

While multiple threads or processes can search (i.e. read) a single Lucene index simultaneously, only a single thread or process is allowed to modify (write) an index at a time. If your indexing application uses multiple indexing threads that are adding documents to the same index, you must serialize their calls to the IndexWriter.addDocument(Document) method. Leaving these calls unserialized may cause threads to get in each other's way and modify the index in unwanted ways, causing Lucene to throw exceptions. In addition, to prevent misuse, Lucene uses file-based locks in order to stop multiple threads or processes from creating IndexWriter s with the same index directory at the same time.

For instance, this code:

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;

/**
 * Demonstrates how Lucene uses locks to prevent multiple processes from
 * writing to the same index at the same time.
 * Note: before running this for the first time, manually create the
 * directory called 'index' in your temporary directory.
 */
public class DoubleTrouble
{
    public static void main(String[] args) throws Exception
    {
        // create an index called 'index' in a temporary directory
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "index";

        Analyzer    analyzer     = new StopAnalyzer();
        IndexWriter firstWriter  = new IndexWriter(indexDir, analyzer, true);

        // the following line will cause an exception
        IndexWriter secondWriter = new IndexWriter(indexDir, analyzer, false);

        // the following two lines will never even be reached
        firstWriter.close();
        secondWriter.close();
    }
}


will cause the following exception:

Exception in thread "main" java.io.IOException: \
        Index locked for write: Lock@/tmp/index/write.lock
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:145)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:122)
        at DoubleTrouble.main(DoubleTrouble.java:23)


Optimizing Indices

I have mentioned index optimization a few times in this article, but I have not yet explained it. To optimize an index, one has to call optimize() on an IndexWriter instance. When this happens, all in-memory documents are flushed to the disk and all index segments are merged into a single segment, reducing the number of files that make up the index. However, optimizing an index does not help improve indexing performance. As a matter of fact, optimizing an index during the indexing process will only slow things down. Despite this, optimizing may sometimes be necessary in order to keep the number of open files under control. For instance, optimizing an index during the indexing process may be needed in situations where searching and indexing happen concurrently, since both processes keep their own set of open files. A good rule of thumb is that if more documents will be added to the index soon, you should avoid calling optimize() . If, on the other hand, you know that the index will not be modified for a while, and the index will only be searched, you should optimize it. That will reduce the number of segments (files on the disk), and consequently improve search performance--the fewer files Lucene has to open while searching, the faster the search.

To illustrate the effect of optimizing an index, we can use the IndexOptimizeDemo class:

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

/**
 * Creates an index called 'index' in a temporary directory.
 * If you want the index to optimize the index at the end use '-o'
 * command line argument.  If you do not want to optimize the index
 * at the end use any other value for the command line argument.
 * This class expects to be called correctly.
 * 
 * Note: before running this for the first time, manually create the
 * directory called 'index' in your temporary directory.
 */
public class IndexOptimizeDemo
{
    public static void main(String[] args) throws Exception
    {
        // create an index called 'index' in a temporary directory
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "index";

        Analyzer    analyzer = new StopAnalyzer();
        IndexWriter writer   = new IndexWriter(indexDir, analyzer, true);

        for (int i = 0; i < 15; i++)
        {
            Document doc = new Document();
            doc.add(Field.Text("fieldname", "Bibamus, moriendum est"));
            writer.addDocument(doc);
        }
        if ("-o".equalsIgnoreCase(args[0]))
        {
            System.out.println("Optimizing the index...");
            writer.optimize();
        }
        writer.close();
    }
}


As you can see from the class Javadoc and code, the created index will be optimized only if -o command line argument is used. To create an unoptimized index with this class, use this:

prompt> java IndexOptimizeDemo -n


-rw-rw-r--    1 otis     otis           10 Feb 18 23:50 _a.f1
-rw-rw-r--    1 otis     otis          260 Feb 18 23:50 _a.fdt
-rw-rw-r--    1 otis     otis           80 Feb 18 23:50 _a.fdx
-rw-rw-r--    1 otis     otis           14 Feb 18 23:50 _a.fnm
-rw-rw-r--    1 otis     otis           30 Feb 18 23:50 _a.frq
-rw-rw-r--    1 otis     otis           30 Feb 18 23:50 _a.prx
-rw-rw-r--    1 otis     otis           11 Feb 18 23:50 _a.tii
-rw-rw-r--    1 otis     otis           41 Feb 18 23:50 _a.tis
-rw-rw-r--    1 otis     otis            4 Feb 18 23:50 deletable
-rw-rw-r--    1 otis     otis            5 Feb 18 23:50 _g.f1
-rw-rw-r--    1 otis     otis          130 Feb 18 23:50 _g.fdt
-rw-rw-r--    1 otis     otis           40 Feb 18 23:50 _g.fdx
-rw-rw-r--    1 otis     otis           14 Feb 18 23:50 _g.fnm
-rw-rw-r--    1 otis     otis           15 Feb 18 23:50 _g.frq
-rw-rw-r--    1 otis     otis           15 Feb 18 23:50 _g.prx
-rw-rw-r--    1 otis     otis           11 Feb 18 23:50 _g.tii
-rw-rw-r--    1 otis     otis           41 Feb 18 23:50 _g.tis
-rw-rw-r--    1 otis     otis           22 Feb 18 23:50 segments


Example 2: An unoptimized index usually contains more than one segment.

This index contains two segments. To create a fully-optimized index, call this class with -o command line argument:

prompt> java IndexOptimizeDemo -o


-rw-rw-r--    1 otis     otis            4 Feb 18 23:50 deletable
-rw-rw-r--    1 otis     otis           15 Feb 18 23:50 _h.f1
-rw-rw-r--    1 otis     otis          390 Feb 18 23:50 _h.fdt
-rw-rw-r--    1 otis     otis          120 Feb 18 23:50 _h.fdx
-rw-rw-r--    1 otis     otis           14 Feb 18 23:50 _h.fnm
-rw-rw-r--    1 otis     otis           45 Feb 18 23:50 _h.frq
-rw-rw-r--    1 otis     otis           45 Feb 18 23:50 _h.prx
-rw-rw-r--    1 otis     otis           11 Feb 18 23:50 _h.tii
-rw-rw-r--    1 otis     otis           41 Feb 18 23:50 _h.tis
-rw-rw-r--    1 otis     otis           15 Feb 18 23:50 segments


Example 3: A fully-optimized index contains only a single segment.

Conclusion

This article has discussed the basic structure of a Lucene index and has demonstrated a few techniques for improving indexing performance. You also learned about potential problems with indexing in multi-threaded environments, about what it means to optimize an index, and how this affects indexing. This knowledge should allow you to gain more control over Lucene's indexing process to improve its performance. The next article will examine Lucene's text-searching capabilities.

References

Otis Gospodnetic is an active Apache Jakarta member, a member of Apache Jakarta Project Management Committee, a developer of Lucene and maintainer of the jGuru's Lucene FAQ.


 

[转自 :http://onjava.com/lpt/a/3273 ]

 

分享到:
评论

相关推荐

    Efficient in-memory indexing with Generalized Prefix trees.pptx

    Efficient in-memory indexing with Generalized Prefix trees的PPT

    A modern text indexing library for go.zip

    本文将详细介绍标题为"A modern text indexing library for go.zip"的资源,探讨它如何帮助开发者高效地管理和搜索大量文本数据。 首先,让我们理解“文本索引库”的概念。文本索引库是一种软件工具,它可以快速地...

    web lucene

    Lucene Web interface, use XML as a ... etc) into xml format, indexing with lucene engine, and get full text search result via HTTP, with XML format output, user can easily intergrated with JSP ASP ...

    Lucene 4 Cookbook(PACKT,2015)

    Furthermore, the book walks you through analyzing your text and indexing your data to leverage the performance of your search application. As you progress through the chapters, you will learn to ...

    lucene示例 demo+jar包

    1. **索引(Indexing)**: Lucene 首先需要对数据进行索引,这个过程会将文本数据转换为倒排索引结构。倒排索引允许快速查找包含特定单词或短语的文档。 2. **文档(Documents)**: 在 Lucene 中,文档是搜索的基本...

    lucene

    Lucene的核心架构主要分为三个部分:索引(Indexing)、查询(Query)和搜索(Searching)。首先,索引过程将原始文档转换为倒排索引(Inverted Index),这是一个经过优化的数据结构,便于快速查找包含特定词项的...

    lucene4.6.0 jar包

    1. **索引构建(Indexing)**: Lucene 允许开发者将非结构化的文本数据转换为倒排索引(Inverted Index),这是一种优化的存储结构,便于快速查找包含特定词汇的文档。在 4.6.0 版本中,这个过程更为优化,支持多...

    lucene实例lucene实例

    1. 创建索引(Indexing):首先,我们需要创建一个索引Writer,然后添加文档到索引中。每个文档包含多个字段,每个字段都有其特定的属性,如是否可搜索、是否存储原始内容等。 ```java IndexWriterConfig config = ...

    lucene_jar包

    - **索引(Indexing)**: Lucene首先对输入文档进行分析,将其拆分成称为"术语"(Term)的单元,然后创建一个索引结构,允许快速查找这些术语。索引过程包括分词、去除停用词、词干化等步骤。 - **搜索(Searching...

    lucene4.10.3

    Lucene包括了索引(Indexing)、查询(Querying)和文档处理(Document Handling)等关键组件,如分词器(Tokenizer)、过滤器(Filter)、查询解析器(Query Parser)等。 二、索引过程 2.1 文档分析 在Lucene中...

    luceneDemo

    1. **索引(Indexing)**: Lucene 的首要任务是将非结构化的文本数据转化为可搜索的索引。这个过程包括分词(Tokenization)、词干提取(Stemming)、停用词处理(Stop Word Removal)等步骤,目的是创建一个倒排...

    lucene for java 简单demo

    - **索引(Indexing)**:Lucene首先对文档进行分析和处理,将其转化为结构化的索引数据。这个过程包括分词(Tokenization)、去除停用词(Stopword Removal)、词干提取(Stemming)等。 - **文档(Document)**:...

    全文检索引擎Lucene

    1. **创建索引(Indexing)**: 首先,通过Analyzer处理文档内容,生成一系列术语。然后,将这些术语与文档关联,构建倒排索引,并写入磁盘。 2. **搜索(Searching)**: 用户提交查询后,查询解析器将查询字符串...

    lucene站内搜索

    1. **创建索引(Indexing)**: 遍历要搜索的文档,使用Analyzer处理文本,然后将处理后的词项与文档信息一起写入索引。 2. **查询解析(Query Parsing)**: 用户输入查询字符串,通过QueryParser转换成Lucene能理解...

    apache下的lucene教程

    索引构建(Indexing) - **文档添加**:如何向索引中添加文档,并设置各个字段的存储方式。 - **分析过程**:深入理解分析器的工作原理,包括不同的分析策略,如标准分析器、自定义分析器等。 - **索引优化**:介绍...

    lucene 小资源

    1. **索引(Indexing)**:这是Lucene的基础,它将文本数据转化为可供快速搜索的结构。索引过程包括分析(分词)、字段处理和倒排索引的构建。 2. **分析器(Analyzer)**:分析器是Lucene中处理文本的关键组件,它...

    lucene学习资料收集

    3. **文档索引(Document Indexing)**:索引是Lucene的重要环节,包括创建索引、添加、删除和更新文档。索引过程涉及字段定义、分词、词频计算等。 4. **查询解析(Query Parsing)**:用户输入的查询字符串需要被...

    最新版windows lucene-8.6.1.zip

    2. 索引(Indexing):将文档转换为可搜索的索引结构,包括文档的分词、词频统计、倒排索引等。 3. 查询解析(Query Parsing):将用户的输入转换为Lucene可执行的查询对象。 4. 搜索(Searching):根据查询对象在...

    lucene-4.7.0.zip

    2. 索引(Indexing):将文档转化为可以被搜索的结构,包括Term、Document和Field等概念。 3. 查询(Query):用户输入的搜索条件被转换为Lucene理解的查询对象,如TermQuery、BooleanQuery等。 4. 搜索...

Global site tag (gtag.js) - Google Analytics