文档翻译一点一点（一）（lucene_tutorial） -

nj_link

浏览: 10922 次
性别:
来自: 厦门

最近访客更多访客>>

zhoucanji

Jiy

曾经de迷茫

ykdsg

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

文档翻译一点一点（一）（lucene_tutorial）

博客分类：

文档翻译

lucene 基础 api

lucene简介
lucene是基于java的开源搜索库。它以基于文档，提供高效和简单的搜索方式而流行。本教程将讲解在复杂的企业级应用中所需要的lucene知识。
lucene查询是如何工作的？
任何的lucene查询都要经历以下一个或几个以下步骤。

步骤	标题	描述
2	建立文档	建立能够被应用程序识别的文档
3	分析文档	在索引被建立之前，首先需要明确那一部分内容需要被分词器分析。
4	建立索引	文档被分析完后，然后创建索引，这样文档才能被索引映射获取。索引过程就像一本书的目录，能够根据页码去快速索引，而不是一页一页查找。
5	用户检索接口	索引建立完成后，为了方便用户根据文本进行查询，必须提供接口。
6	创建Query	用户必须创建一个Query对象，以便于作为参数查询相关联的信息。
7	搜索Query	根据Query对象，找到文档中的内容和相关联信息。
8	渲染结果	当查询出结果集后，就必须考虑如何组装它，让用户第一眼看到有用的信息。

除了以上部分，应用程序还可以提供管理页面为管理员根据用户配置文件进行搜索控制。检索结果是评论一个应用程序先进性的一个重要指标。

lucene在应用系统中所当任的角色
lucene在应用系统中一般做上面2-7步骤所需要做的操作。lucene封装了有关索引和搜索的核心操作。具体查询和结果集封装是应用程序来承担。下面看一个用Lucene检索图书馆图书的例子。

Lucene-第一个应用
我们以一个实际的例子来体验一下lucene框架。我们写一个简单的应用程序，打印出我们检索出来的条数，并且查看在这一过程中索引的创建。
1、创建一个java应用程序。
打开Eclipse IDE，依次打开 File -> New -> Project。最后选择Java Project。工程名字输入 LuceneFirstApplication，直接到最后finish。这样我们就创建了一个java工程。
2、添加依赖包
右键你的工程，Build Path -> Configure Build Path 在打开的面板里面增加lucene-core-3.6.2.
3、创建文件
右键点击工程 src目录，New -> Package增加包com.tutorialspoint.lucene。在包下面创建文件LuceneTester.java和其他一些文件
LuceneConstants.java 常量类

package com.tutorialspoint.lucene; 

public class LuceneConstants {
   public static final String CONTENTS="contents";
   public static final String FILE_NAME="filename";
   public static final String FILE_PATH="filepath";
   public static final int MAX_SEARCH = 10;
}

TextFileFilter.java  用来过滤.txt文件
package com.tutorialspoint.lucene;

import java.io.File;
import java.io.FileFilter; 

public class TextFileFilter implements FileFilter {

   @Override
   public boolean accept(File pathname) {
      return pathname.getName().toLowerCase().endsWith(".txt");
   }
}

Indexer.java 通过此类我们能够使用lucene库搜索源数据
package com.tutorialspoint.lucene;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Indexer {

   private IndexWriter writer; 
   public Indexer(String indexDirectoryPath) throws IOException{
      //this directory will contain the indexes
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));

      //create the indexer
      writer = new IndexWriter(indexDirectory, 
         new StandardAnalyzer(Version.LUCENE_36),true,
         IndexWriter.MaxFieldLength.UNLIMITED);
   }

   public void close() throws CorruptIndexException, IOException{
      writer.close();
   }

   private Document getDocument(File file) throws IOException{
      Document document = new Document();

      //index file contents
      Field contentField = new Field(LuceneConstants.CONTENTS,  new FileReader(file));
      //index file name
      Field fileNameField = new Field(LuceneConstants.FILE_NAME,  file.getName(),  Field.Store.YES,Field.Index.NOT_ANALYZED);
      //index file path
      Field filePathField = new Field(LuceneConstants.FILE_PATH,  file.getCanonicalPath(),  Field.Store.YES,Field.Index.NOT_ANALYZED);

      document.add(contentField);
      document.add(fileNameField);
      document.add(filePathField);

      return document;
   }   

   private void indexFile(File file) throws IOException{
      System.out.println("Indexing "+file.getCanonicalPath());
      Document document = getDocument(file);
      writer.addDocument(document);
   }

   public int createIndex(String dataDirPath, FileFilter filter)  throws IOException{
      //get all files in the data directory
      File[] files = new File(dataDirPath).listFiles();

      for (File file : files) {
         if(!file.isDirectory()
            && !file.isHidden()
            && file.exists()
            && file.canRead()
            && filter.accept(file)
         ){
            indexFile(file);
         }
      }
      return writer.numDocs();
   }
}

Searcher.java 这个类是用来查询 Indexer所创建的索引内容

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Searcher {


   IndexSearcher indexSearcher;
   QueryParser queryParser;
   Query query;
   
   public Searcher(String indexDirectoryPath) 
      throws IOException{
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));
      indexSearcher = new IndexSearcher(indexDirectory);
      queryParser = new QueryParser(Version.LUCENE_36,
         LuceneConstants.CONTENTS,
         new StandardAnalyzer(Version.LUCENE_36));
   }
   
   public TopDocs search( String searchQuery) 
      throws IOException, ParseException{
      query = queryParser.parse(searchQuery);
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   public Document getDocument(ScoreDoc scoreDoc) 
      throws CorruptIndexException, IOException{
      return indexSearcher.doc(scoreDoc.doc);

   }

   public void close() throws IOException{
      indexSearcher.close();
   }
}

LuceneTester.java 这个类是用来测试Lucene库的索引和搜索功能。

package com.tutorialspoint.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class LuceneTester {


   String indexDir = "E:\\Lucene\\Index";
   String dataDir = "E:\\Lucene\\Data";
   Indexer indexer;
   Searcher searcher;

   public static void main(String[] args) {
      LuceneTester tester;
      try {
         tester = new LuceneTester();
         tester.createIndex();
         tester.search("Mohan");
      } catch (IOException e) {
         e.printStackTrace();
      } catch (ParseException e) {
         e.printStackTrace();
      }
   }

   private void createIndex() throws IOException{
      indexer = new Indexer(indexDir);
      int numIndexed;
      long startTime = System.currentTimeMillis();

      numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
      long endTime = System.currentTimeMillis();
      indexer.close();
      System.out.println(numIndexed+" File indexed, time taken: "
         +(endTime-startTime)+" ms"); 

   }

   private void search(String searchQuery) throws IOException, ParseException{
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMillis();
      TopDocs hits = searcher.search(searchQuery);
      long endTime = System.currentTimeMillis();
   
      System.out.println(hits.totalHits +
         " documents found. Time :" + (endTime - startTime));
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
            System.out.println("File: "
            + doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }
}

4、数据文件和索引目录创建
我们创建2个文件夹， E:\Lucene\Data和E:\Lucene\Index。第一个是用来存放我们要检索的数据，第二个是用来作为索引的存放目录。在E:\Lucene\Data里面我们创建10个文件，文件名分别为record1.txt到record10.txt。在text里面我们存放了学生的名称和其他基本资料。

5、运行程序
可以用 Run或者Ctrl + F11运行LuceneTester。在控制台将打印以下内容

Indexing E:\Lucene\Data\record1.txt
Indexing E:\Lucene\Data\record10.txt
Indexing E:\Lucene\Data\record2.txt
Indexing E:\Lucene\Data\record3.txt
Indexing E:\Lucene\Data\record4.txt
Indexing E:\Lucene\Data\record5.txt
Indexing E:\Lucene\Data\record6.txt
Indexing E:\Lucene\Data\record7.txt
Indexing E:\Lucene\Data\record8.txt
Indexing E:\Lucene\Data\record9.txt
10 File indexed, time taken: 109 ms
1 documents found. Time :0
File: E:\Lucene\Data\record4.txt

在index文件夹中将存在一下文件

lucene-索引类
索引过程是由Lucene的所提供的核心功能之一。下图说明了索引过程和使用的类。IndexWriter是索引过程中最重要的和核心组件。

我们将使用特定的分析器分析包含索引域的文档他，添加到IndexWriter中，然后创建”创建/打开/编辑“所需的索引存储或者更新到目录下。IndexWriter用于更新或创建索引。它不是用来读取索引。
以下是索引过程要用到的常用类。

序号	类或者描述
1、IndexWriter	此类是索引过程中创建和更新组件的核心类
2、Directory此类表示索引的存储位置
3、Analyzer	分析器类负责分析一个文档，获取文本的词和标记（tockens?）。IndexWriter创建索引必须在分析之后。
4、Document	文档对象是包含索引域的虚拟文档，索引域中可以包含文档的物理存储内容，元数据等。分析器只能够分析文档对象。
5、Field	索引域是可以被索引的最小单元，或者是一个索引的起点。它包含一个key，value键值对。lunece只能够索引文本或者数字内容。

Note:剩下的明天继续