`

Lucene 4.3 官方例子 建立索引 搜索

 
阅读更多

-----------------------------------------------------------

IndexFiles

As we discussed in the previous walk-through, the IndexFiles class creates a Lucene Index. Let's take a look at how it does this.

The main() method parses the command-line parameters, then in preparation for instantiating IndexWriter, opens aDirectory, and instantiates StandardAnalyzerand IndexWriterConfig.

The value of the -index command-line parameter is the name of the filesystem directory where all index information should be stored. If IndexFiles is invoked with a relative path given in the -index command-line parameter, or if the -index command-line parameter is not given, causing the default relative index path "index" to be used, the index path will be created as a subdirectory of the current working directory (if it does not already exist). On some platforms, the index path may be created in a different directory (such as the user's home directory).

The -docs command-line parameter value is the location of the directory containing files to be indexed.

The -update command-line parameter tellsIndexFiles not to delete the index if it already exists. When -update is not given, IndexFiles will first wipe the slate clean before indexing any documents.

Lucene Directorys are used by the IndexWriter to store information in the index. In addition to the FSDirectory implementation we are using, there are several other Directory subclasses that can write to RAM, to databases, etc.

Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms, and optionally perform other operations on these tokens, e.g. downcasing, synonym insertion, filtering out unwanted tokens, etc. The Analyzer we are using is StandardAnalyzer, which creates tokens using the Word Break rules from the Unicode Text Segmentation algorithm specified in Unicode Standard Annex #29; converts tokens to lowercase; and then filters out stopwords. Stopwords are common language words such as articles (a, an, the, etc.) and other tokens that may have less value for searching. It should be noted that there are different rules for every language, and you should use the proper analyzer for each. Lucene currently provides Analyzers for a number of different languages (see the javadocs under lucene/analysis/common/src/java/org/apache/lucene/analysis).

The IndexWriterConfig instance holds all configuration for IndexWriter. For example, we set the OpenMode to use here based on the value of the -update command-line parameter.

Looking further down in the file, after IndexWriter is instantiated, you should see the indexDocs() code. This recursive function crawls the directories and creates Document objects. TheDocument is simply a data object to represent the text content from the file as well as its creation time and location. These instances are added to the IndexWriter. If the-update command-line parameter is given, theIndexWriterConfig OpenMode will be set to OpenMode.CREATE_OR_APPEND, and rather than adding documents to the index, the IndexWriter willupdate them in the index by attempting to find an already-indexed document with the same identifier (in our case, the file path serves as the identifier); deleting it from the index if it exists; and then adding the new document to the index.

Searching Files

The SearchFiles class is quite simple. It primarily collaborates with an IndexSearcher,StandardAnalyzer, (which is used in the IndexFiles class as well) and a QueryParser. The query parser is constructed with an analyzer used to interpret your query text in the same way the documents are interpreted: finding word boundaries, downcasing, and removing useless words like 'a', 'an' and 'the'. The Query object contains the results from the QueryParser which is passed to the searcher. Note that it's also possible to programmatically construct a rich Query object without using the query parser. The query parser just enables decoding the Lucene query syntax into the correspondingQuery object.

SearchFiles uses the IndexSearcher.search(query,n) method that returns TopDocs with maxn hits. The results are printed in pages, sorted by score (i.e. relevance).

------------------------------------------------------------

 

建立索引:

 

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

public class IndexFiles {

 private IndexFiles() {
 }

 /**
  * @param args
  */
 public static void main(String[] args) {

  // -index command-line parameter is the name of the filesystem directory
  // where all index information should be stored
  // -docs command-line parameter value is the location of the directory
  // containing files to be indexed
  // -update command-line parameter tells IndexFiles not to delete the
  // index if it already exists. When -update is not given, IndexFiles
  // will first wipe the slate clean before indexing any documents
  String usage = "java org.apache.lucene.demo.IndexFiles"
    + " [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n"
    + "This indexes the documents in DOCS_PATH, creating a Lucene index"
    + "in INDEX_PATH that can be searched with SearchFiles";
  String indexPath = "index";
  String docsPath = null;
  boolean create = true;
  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    indexPath = args[i + 1];
    i++;
   } else if ("-docs".equals(args[i])) {
    docsPath = args[i + 1];
    i++;
   } else if ("-update".equals(args[i])) {
    create = false;
   }
  }
  if (docsPath == null) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }

  final File docDir = new File(docsPath);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out
     .println("Document directory '"
       + docDir.getAbsolutePath()
       + "' does not exist or is not readable, please check the path");
   System.exit(1);
  }
  Date start = new Date();
  
  try {
   System.out.println("Indexing to directory '" + indexPath + "'...");

   Directory dir = FSDirectory.open(new File(indexPath));
   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40,
     analyzer);

   if (create) {
    // Create a new index in the directory, removing any previously indexed documents:
    iwc.setOpenMode(OpenMode.CREATE);
   } else {
    // Add new documents to an existing index:
    iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
   }

   // Optional: for better indexing performance, if you
   // are indexing many documents, increase the RAM
   // buffer. But if you do this, increase the max heap
   // size to the JVM (eg add -Xmx512m or -Xmx1g):
   //
   // iwc.setRAMBufferSizeMB(256.0);

   IndexWriter writer = new IndexWriter(dir, iwc);
   indexDocs(writer, docDir);

   // NOTE: if you want to maximize search performance,
   // you can optionally call forceMerge here. This can be
   // a terribly costly operation, so generally it's only
   // worth it when your index is relatively static (ie
   // you're done adding documents to it):
   //
   // writer.forceMerge(1);

   writer.close();

   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()
     + " total milliseconds");

  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()
     + "\n with message: " + e.getMessage());
  }

 }
 
 static void indexDocs(IndexWriter writer, File file)
     throws IOException {
     // do not try to index files that cannot be read
     if (file.canRead()) {
       if (file.isDirectory()) {
         String[] files = file.list();
         // an IO error could occur
         if (files != null) {
           for (int i = 0; i < files.length; i++) {
             indexDocs(writer, new File(file, files[i]));
           }
         }
       } else {
 
         FileInputStream fis;
         try {
           fis = new FileInputStream(file);
         } catch (FileNotFoundException fnfe) {
           // at least on windows, some temporary files raise this exception with an "access denied" message
           // checking if the file can be read doesn't help
           return;
         }
 
         try {
 
           // make a new, empty document
           Document doc = new Document();
 
           // Add the path of the file as a field named "path".  Use a
           // field that is indexed (i.e. searchable), but don't tokenize
           // the field into separate words and don't index term frequency
           // or positional information:
           Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
           doc.add(pathField);
 
           // Add the last modified date of the file a field named "modified".
           // Use a LongField that is indexed (i.e. efficiently filterable with
           // NumericRangeFilter).  This indexes to milli-second resolution, which
           // is often too fine.  You could instead create a number based on
           // year/month/day/hour/minutes/seconds, down the resolution you require.
           // For example the long value 2011021714 would mean
           // February 17, 2011, 2-3 PM.
           doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
 
           // Add the contents of the file to a field named "contents".  Specify a Reader,
           // so that the text of the file is tokenized and indexed, but not stored.
           // Note that FileReader expects the file to be in UTF-8 encoding.
           // If that's not the case searching for special characters will fail.
           doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));
 
           if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
             // New index, so we just add the document (no old document can be there):
             System.out.println("adding " + file);
             writer.addDocument(doc);
           } else {
             // Existing index (an old copy of this document may have been indexed) so
             // we use updateDocument instead to replace the old one matching the exact
             // path, if present:
             System.out.println("updating " + file);
             writer.updateDocument(new Term("path", file.getPath()), doc);
           }
          
         } finally {
           fis.close();
         }
       }
     }


 }
}

 

 import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class SearchFiles {

 private SearchFiles() {
 }

 /**
  * @param args
  */
 public static void main(String[] args) throws Exception  {

  String usage = "Usage:\tjava org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.";
  if (args.length > 0
    && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
   System.out.println(usage);
   System.exit(0);
  }

  String index = "index";
  String field = "contents";
  String queries = null;
  int repeat = 0;
  boolean raw = false;
  String queryString = null;
  int hitsPerPage = 10;

  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    index = args[i + 1];
    i++;
   } else if ("-field".equals(args[i])) {
    field = args[i + 1];
    i++;
   } else if ("-queries".equals(args[i])) {
    queries = args[i + 1];
    i++;
   } else if ("-query".equals(args[i])) {
    queryString = args[i + 1];
    i++;
   } else if ("-repeat".equals(args[i])) {
    repeat = Integer.parseInt(args[i + 1]);
    i++;
   } else if ("-raw".equals(args[i])) {
    raw = true;
   } else if ("-paging".equals(args[i])) {
    hitsPerPage = Integer.parseInt(args[i + 1]);
    if (hitsPerPage <= 0) {
     System.err
       .println("There must be at least 1 hit per page.");
     System.exit(1);
    }
    i++;
   }
  }

  IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
  IndexSearcher searcher = new IndexSearcher(reader);
  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
  BufferedReader in = null;
  if (queries != null) {
   in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), "UTF-8"));
  } else {
   in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
  }
  QueryParser parser = new QueryParser(Version.LUCENE_40, field, analyzer);
  while (true) {
   if (queries == null && queryString == null) { // prompt the user
    System.out.println("Enter query: ");
   }

   String line = queryString != null ? queryString : in.readLine();

   if (line == null || line.length() == -1) {
    break;
   }

   line = line.trim();
   if (line.length() == 0) {
    break;
   }

   Query query = parser.parse(line);
   System.out.println("Searching for: " + query.toString(field));
   // 如果repeate大于0取出查出结果的前100条数据 这个没有意义,demo里面这么写的 
   if (repeat > 0) { // repeat & time as benchmark
    Date start = new Date();
    for (int i = 0; i < repeat; i++) {
     searcher.search(query, null, 100);
    }
    Date end = new Date();
    System.out.println("Time: " + (end.getTime() - start.getTime())+ "ms");
   }

   doPagingSearch(in, searcher, query, hitsPerPage, raw,queries == null && queryString == null);
   if (queryString != null) {
    break;
   }
  }
  reader.close();

 }

 public static void doPagingSearch(BufferedReader in,IndexSearcher searcher, Query query, int hitsPerPage, boolean raw,
   boolean interactive) throws IOException {

  // Collect enough docs to show 5 pages
  TopDocs results = searcher.search(query, 5 * hitsPerPage);
  ScoreDoc[] hits = results.scoreDocs;

  int numTotalHits = results.totalHits;
  System.out.println(numTotalHits + " total matching documents");

  int start = 0;
  int end = Math.min(numTotalHits, hitsPerPage);

  while (true) {
   if (end > hits.length) {
    System.out.println("Only results 1 - " + hits.length + " of "+ numTotalHits+ " total matching documents collected.");
    System.out.println("Collect more (y/n) ?");
    String line = in.readLine();
    if (line.length() == 0 || line.charAt(0) == 'n') {
     break;
    }

    hits = searcher.search(query, numTotalHits).scoreDocs;
   }

   end = Math.min(hits.length, start + hitsPerPage);

   for (int i = start; i < end; i++) {
    if (raw) { // output raw format
     System.out.println("doc=" + hits[i].doc + " score="
       + hits[i].score);
     continue;
    }

    Document doc = searcher.doc(hits[i].doc);
    String path = doc.get("path");
    if (path != null) {
     System.out.println((i + 1) + ". " + path);
     String title = doc.get("title");
     if (title != null) {
      System.out.println("   Title: " + doc.get("title"));
     }
    } else {
     System.out.println((i + 1) + ". "
       + "No path for this document");
    }

   }

   if (!interactive || end == 0) {
    break;
   }

   if (numTotalHits >= end) {
    boolean quit = false;
    while (true) {
     System.out.print("Press ");
     if (start - hitsPerPage >= 0) {
      System.out.print("(p)revious page, ");
     }
     if (start + hitsPerPage < numTotalHits) {
      System.out.print("(n)ext page, ");
     }
     System.out
       .println("(q)uit or enter number to jump to a page.");

     String line = in.readLine();
     if (line.length() == 0 || line.charAt(0) == 'q') {
      quit = true;
      break;
     }
     if (line.charAt(0) == 'p') {
      start = Math.max(0, start - hitsPerPage);
      break;
     } else if (line.charAt(0) == 'n') {
      if (start + hitsPerPage < numTotalHits) {
       start += hitsPerPage;
      }
      break;
     } else {
      int page = Integer.parseInt(line);
      if ((page - 1) * hitsPerPage < numTotalHits) {
       start = (page - 1) * hitsPerPage;
       break;
      } else {
       System.out.println("No such page");
      }
     }
    }
    if (quit)
     break;
    end = Math.min(numTotalHits, start + hitsPerPage);
   }
  }
 }

}

分享到:
评论

相关推荐

    lucene4.3 按坐标距离排序

    在"lucene4.3 按坐标距离排序"这个主题中,我们将探讨如何在Lucene 4.3版本中利用地理位置信息进行文档排序,特别是在处理地理空间搜索时的应用。 首先,Lucene 4.3引入了对地理空间搜索的支持,这允许我们根据地理...

    lucene4.3源码

    Lucene提供了一个简单却强大的应用程式接口,能够做全文索引和搜寻。在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言,Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。人们经常提到信息...

    lucene4.3工具类

    lucene4.3增删改查的的一个工具类,对新手来说是一份不可多得的入门资料。

    lucene 4.3所用到的包

    全文检索lucene 4.3 所用到的3个jar包,包含lucene-queryparser-4.3.0.jar、 lucene-core-4.3.0.jar、lucene-analyzers-common-4.3.0.jar。

    lucene 对 xml建立索引

    ### Lucene对XML文档建立索引的技术解析与实践 #### 一、引言 随着互联网技术的迅猛发展,非结构化数据(如XML文档)在企业和组织中的应用日益广泛。如何高效地处理这些非结构化的数据,特别是进行快速检索成为了一...

    基于lucene4.3的知识图谱搜索引擎XunTa(一种用"知识点"来找人的搜人引擎).zip

     |---luceneIndex 索引文件夹,下面放置Lucene4.3版本的索引文件,存放了XXX条来自社交网站的“发言”数据。  |---XunTa XunTa项目源代码,可导入Eclipse(javaEE版)并运行。  |---readme.txt 您正在看的该说明...

    Lucene4.3src 源代码

    lucene4.3源代码 censed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information ...

    用Lucene.net对数据库建立索引及搜索

    《使用Lucene.NET对数据库建立索引及搜索》 在信息技术领域,搜索引擎是不可或缺的一部分,尤其是在处理大量数据时。Lucene.NET是一个强大的全文搜索引擎库,它允许开发人员在应用程序中集成高级搜索功能。本文将...

    lucene-4.3.1资源

    1. **创建索引**:首先,我们需要创建一个索引,将要搜索的数据转换成Lucene可以处理的格式。这涉及到对文本的分词、标准化,以及构建倒排索引。Lucene提供了各种分析器,如StandardAnalyzer,用于处理不同语言的...

    用Lucene.net对数据库建立索引及搜索.doc.doc

    用Lucene.net对数据库建立索引及搜索

    用LUCENE连击MYSQL建立索引并搜索的JAVA代码。

    创建好`Document`后,将其添加到`IndexWriter`,这样Lucene就会为这些字段建立索引。 当索引构建完成后,我们就可以实现搜索功能了。首先,创建一个`DirectoryReader`来读取已经建立的索引,然后使用`IndexSearcher...

    Lucene检索文本,建立索引

    在这个场景中,我们将探讨如何利用Lucene来检索文本并建立索引,同时结合Struts框架构建一个Web程序。 首先,**Lucene** 是一个开源的Java库,它提供了完整的搜索功能,包括分词、索引创建、查询解析和结果排序。它...

    用lucene对数据库建立索引及搜索

    **使用Lucene对数据库建立索引及搜索** Lucene是一个高性能、可伸缩的信息检索库,它是Apache软件基金会的顶级项目之一。它提供了一个简单但功能强大的API,用于在各种数据源上创建全文搜索引擎,包括数据库。在本...

    lucene入门小例子

    Lucene通过分析文本,创建倒排索引,使得搜索过程能够快速定位到包含特定词的文档。 文档是Lucene处理的基本单位,可以看作是存储信息的对象,比如一篇文章或者一封电子邮件。每个文档由多个字段组成,每个字段有...

    lucene与quartz例子

    lucene quartz 例子lucene quartz 例子lucene quartz 例子lucene quartz 例子lucene quartz 例子lucene quartz 例子lucene quartz 例子lucene quartz 例子lucene quartz 例子lucene quartz 例子

    lucene全文检索简单索引和搜索实例

    三、Lucene搜索步骤 1. 打开索引:使用Directory对象和IndexReader打开已创建的索引。 2. 创建搜索器:基于IndexReader创建一个IndexSearcher对象,它是实际执行搜索操作的工具。 3. 构建查询:使用QueryParser...

    lucene检索小例子

    在建立索引后,"lucene检索小例子"中的搜索部分使用Analyzer、Query和IndexSearcher等组件来执行查询。Analyzer用于对查询字符串进行相同的预处理,Query对象代表了用户的查询意图,而IndexSearcher则负责在索引中...

    Lucene结合Sql建立索引

    Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个...本源码演示了Lucene结合Sql建立索引,把Sql中的数据通过建立索引用Lucene来检索 支持简单的中文分词,同时提供了Lucene.Net-2.0-004版本的源码给大家

    Lucene结合Sql建立索引Demo源码

    Lucene结合Sql建立索引Demo源码 Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的...

Global site tag (gtag.js) - Google Analytics