Lucene 学习笔记

weidewei

浏览: 176437 次
性别:
来自: 杭州

最近访客更多访客>>

keda_yjm

ycy12ycy

lbkbox

java_min

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索引擎--lucene

lucene Apache F#junit performance

Apache Lucene is a high-performance, full-featured text search engine library.

1.Here's a simple example how to use Lucene for indexing and searching (using JUnit to check if the results are what we expect):

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

/**
 * @since V2.0
 * @author David.Wei
 * @date 2008-4-16
 * @param args
 * @return void
 */
public class Test {

 public static void main(String[] args) throws Exception {

  Analyzer analyzer = new StandardAnalyzer();

  // Store the index in memory:
  Directory directory = new RAMDirectory();
  // To store an index on disk, use this instead:
  // Directory directory = FSDirectory.getDirectory("/tmp/testindex");
  IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
  iwriter.setMaxFieldLength(25000);
  Document doc = new Document();
  String text = "This is the text to be indexed.";
  doc.add(new Field("fieldname", text, Field.Store.YES,
    Field.Index.TOKENIZED));
  iwriter.addDocument(doc);
  iwriter.optimize();
  iwriter.close();

  // Now search the index:
  IndexSearcher isearcher = new IndexSearcher(directory);
  // Parse a simple query that searches for "text":
  QueryParser parser = new QueryParser("fieldname", analyzer);
  Query query = parser.parse("text");
  Hits hits = isearcher.search(query);
  // assertEquals(1, hits.length());
  // Iterate through the results:
  for (int i = 0; i < hits.length(); i++) {
   Document hitDoc = hits.doc(i);
   System.out.println("This is the text to be indexed."
     + hitDoc.get("fieldname"));
  }
  isearcher.close();
  directory.close();
 }

}

2.The Lucene API is divided into several packages:

org.apache.lucene.analysis defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of Tokens. A TokenStream is composed by applying TokenFilters to the output of a Tokenizer. A few simple implemenations are provided, including StopAnalyzer and the grammar-based StandardAnalyzer.
org.apache.lucene.document provides a simple Document class. A document is simply a set of named Fields, whose values may be strings or instances of java.io.Reader.
org.apache.lucene.index provides two primary classes: IndexWriter, which creates and adds documents to indices; and IndexReader, which accesses the data in the index.
org.apache.lucene.search provides data structures to represent queries (TermQuery for individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the abstract Searcher which turns queries into Hits. IndexSearcher implements search over a single IndexReader.
org.apache.lucene.queryParser uses JavaCC to implement a QueryParser.
org.apache.lucene.store defines an abstract class for storing persistent data, the Directory, a collection of named files written by an IndexOutput and read by an IndexInput. Two implementations are provided, FSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.
org.apache.lucene.util contains a few handy data structures, e.g., BitVector and PriorityQueue.

3.To use Lucene, an application should:

Create Documents by adding Fields;
Create an IndexWriter and add documents to it with addDocument();
Call QueryParser.parse() to build a query from a string; and
Create an IndexSearcher and pass the query to its search() method.

4.Some simple examples of code which does this are:

FileDocument.java contains code to create a Document for a file.
IndexFiles.java creates an index for all the files contained in a directory.
DeleteFiles.java deletes some of these files from the index.
SearchFiles.java prompts for queries and searches an index.

code detail:

(1)FileDocument.java

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.File;
import java.io.FileReader;

import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

/** A utility for making Lucene Documents from a File. */

public class FileDocument {
 /**
  * Makes a document for a File.
  * <p>
  * The document has three fields:
  * <ul>
  * <li><code>path</code>--containing the pathname of the file, as a
  * stored, untokenized field;
  * <li><code>modified</code>--containing the last modified date of the
  * file as a field as created by <a
  * href="lucene.document.DateTools.html">DateTools</a>; and
  * <li><code>contents</code>--containing the full contents of the file,
  * as a Reader field;
  */
 public static Document Document(File f)
   throws java.io.FileNotFoundException {

  // make a new, empty document
  Document doc = new Document();

  // Add the path of the file as a field named "path". Use a field that is
  // indexed (i.e. searchable), but don't tokenize the field into words.
  doc.add(new Field("path", f.getPath(), Field.Store.YES,
    Field.Index.UN_TOKENIZED));

  // Add the last modified date of the file a field named "modified". Use
  // a field that is indexed (i.e. searchable), but don't tokenize the
  // field
  // into words.
  doc.add(new Field("modified", DateTools.timeToString(f.lastModified(),
    DateTools.Resolution.MINUTE), Field.Store.YES,
    Field.Index.UN_TOKENIZED));

  // Add the contents of the file to a field named "contents". Specify a
  // Reader,
  // so that the text of the file is tokenized and indexed, but not
  // stored.
  // Note that FileReader expects the file to be in the system's default
  // encoding.
  // If that's not the case searching for special characters will fail.
  doc.add(new Field("contents", new FileReader(f)));

  // return the document
  return doc;
 }

 private FileDocument() {
 }
}

(2)IndexFiles.java

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Date;

/** Index all text files under a directory. */
public class IndexFiles {

 private IndexFiles() {
 }

 static final File INDEX_DIR = new File("index");

 /** Index all text files under a directory. */
 public static void main(String[] args) {
  String usage = "java org.apache.lucene.demo.IndexFiles <root_directory>";
  if (args.length == 0) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }

  if (INDEX_DIR.exists()) {
   System.out.println("Cannot save index to '" + INDEX_DIR
     + "' directory, please delete it first");
   System.exit(1);
  }

  final File docDir = new File(args[0]);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out
     .println("Document directory '"
       + docDir.getAbsolutePath()
       + "' does not exist or is not readable, please check the path");
   System.exit(1);
  }

  Date start = new Date();
  try {
   IndexWriter writer = new IndexWriter(INDEX_DIR,
     new StandardAnalyzer(), true);
   System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
   indexDocs(writer, docDir);
   System.out.println("Optimizing...");
   writer.optimize();
   writer.close();

   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()
     + " total milliseconds");

  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()
     + "\n with message: " + e.getMessage());
  }
 }

 static void indexDocs(IndexWriter writer, File file) throws IOException {
  // do not try to index files that cannot be read
  if (file.canRead()) {
   if (file.isDirectory()) {
    String[] files = file.list();
    // an IO error could occur
    if (files != null) {
     for (int i = 0; i < files.length; i++) {
      indexDocs(writer, new File(file, files[i]));
     }
    }
   } else {
    System.out.println("adding " + file);
    try {
     writer.addDocument(FileDocument.Document(file));
    }
    // at least on windows, some temporary files raise this
    // exception with an "access denied" message
    // checking if the file can be read doesn't help
    catch (FileNotFoundException fnfe) {
     ;
    }
   }
  }
 }

}

(3)DeleteFiles.java

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;

// import org.apache.lucene.index.Term;

/** Deletes documents from an index that do not contain a term. */
public class DeleteFiles {

 private DeleteFiles() {
 } // singleton

 /** Deletes documents from an index that do not contain a term. */
 public static void main(String[] args) {
  String usage = "java org.apache.lucene.demo.DeleteFiles <unique_term>";
  if (args.length == 0) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }
  try {
   Directory directory = FSDirectory.getDirectory("index");
   IndexReader reader = IndexReader.open(directory);

   Term term = new Term("path", args[0]);
   int deleted = reader.deleteDocuments(term);

   System.out.println("deleted " + deleted + " documents containing "
     + term);

   // one can also delete documents by their internal id:

   // for (int i = 0; i < reader.maxDoc(); i++) {
   // System.out.println("Deleting document with id " + i);
   // reader.delete(i);
   //   }

   reader.close();
   directory.close();

  } catch (Exception e) {
   System.out.println(" caught a " + e.getClass()
     + "\n with message: " + e.getMessage());
  }
 }
}

(4)SearchFiles.java

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

/** Simple command-line based search demo. */
public class SearchFiles {

 /**
  * Use the norms from one field for all fields. Norms are read into memory,
  * using a byte of memory per document per searched field. This can cause
  * search of large collections with a large number of fields to run out of
  * memory. If all of the fields contain only a single token, then the norms
  * are all identical, then single norm vector may be shared.
  */
 private static class OneNormsReader extends FilterIndexReader {
  private String field;

  public OneNormsReader(IndexReader in, String field) {
   super(in);
   this.field = field;
  }

  public byte[] norms(String field) throws IOException {
   return in.norms(this.field);
  }
 }

 private SearchFiles() {
 }

 /** Simple command-line based search demo. */
 public static void main(String[] args) throws Exception {
  String usage = "Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-raw] [-norms field]";
  if (args.length > 0
    && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
   System.out.println(usage);
   System.exit(0);
  }

  String index = "index";
  String field = "contents";
  String queries = null;
  int repeat = 0;
  boolean raw = false;
  String normsField = null;

  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    index = args[i + 1];
    i++;
   } else if ("-field".equals(args[i])) {
    field = args[i + 1];
    i++;
   } else if ("-queries".equals(args[i])) {
    queries = args[i + 1];
    i++;
   } else if ("-repeat".equals(args[i])) {
    repeat = Integer.parseInt(args[i + 1]);
    i++;
   } else if ("-raw".equals(args[i])) {
    raw = true;
   } else if ("-norms".equals(args[i])) {
    normsField = args[i + 1];
    i++;
   }
  }

  IndexReader reader = IndexReader.open(index);

  if (normsField != null)
   reader = new OneNormsReader(reader, normsField);

  Searcher searcher = new IndexSearcher(reader);
  Analyzer analyzer = new StandardAnalyzer();

  BufferedReader in = null;
  if (queries != null) {
   in = new BufferedReader(new FileReader(queries));
  } else {
   in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
  }
  QueryParser parser = new QueryParser(field, analyzer);
  while (true) {
   if (queries == null) // prompt the user
    System.out.println("Enter query: ");

   String line = in.readLine();

   if (line == null || line.length() == -1)
    break;

   line = line.trim();
   if (line.length() == 0)
    break;

   Query query = parser.parse(line);
   System.out.println("Searching for: " + query.toString(field));

   Hits hits = searcher.search(query);

   if (repeat > 0) { // repeat & time as benchmark
    Date start = new Date();
    for (int i = 0; i < repeat; i++) {
     hits = searcher.search(query);
    }
    Date end = new Date();
    System.out.println("Time: " + (end.getTime() - start.getTime())
      + "ms");
   }

   System.out.println(hits.length() + " total matching documents");

   final int HITS_PER_PAGE = 10;
   for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
    int end = Math.min(hits.length(), start + HITS_PER_PAGE);
    for (int i = start; i < end; i++) {

     if (raw) { // output raw format
      System.out.println("doc=" + hits.id(i) + " score="
        + hits.score(i));
      continue;
     }

     Document doc = hits.doc(i);
     String path = doc.get("path");
     if (path != null) {
      System.out.println((i + 1) + ". " + path);
      String title = doc.get("title");
      if (title != null) {
       System.out.println("   Title: " + doc.get("title"));
      }
     } else {
      System.out.println((i + 1) + ". "
        + "No path for this document");
     }
    }

    if (queries != null) // non-interactive
     break;

    if (hits.length() > end) {
     System.out.println("more (y/n) ? ");
     line = in.readLine();
     if (line.length() == 0 || line.charAt(0) == 'n')
      break;
    }
   }
  }
  reader.close();
 }
}

1
顶

2
踩

分享到：

lucene 漫谈--入门与介绍(1) | Lucene 入门与实战

2008-04-16 13:09
浏览 2350
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论