Lucene

jessen163

浏览: 471121 次
性别:
来自: 潘多拉

最近访客更多访客>>

zly84071431

huanggua12353719

leimingchao

hxd1220

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Lucene

lucene

第一部分：Lucene建立索引
Lucene建立索引主要有以下两步：
第一步：建立索引器
第二步：添加索引文件
准备在f盘建立lucene文件夹，然后在lucene下建立文件夹test和index两个文件夹。
在test文件夹下建立如下四个txt文件
a.txt 内容：中华人民共和国
b.txt 内容：人民共和国
c.txt 内容：人民
d.txt 内容：共和国

这四个文件就是我们要建立索引的文件，
Index文件夹作为索引结果输出文件夹

准备工作完成以后，我们开始建立索引。
第一步：建立索引器，如下

IndexWriter writer = new IndexWriter("f:\\lucene\\index",
      new StandardAnalyzer(), true);

第二步：添加索引文件
writer.addDocument(..);
具体完整代码如下：

package com.peng.mylucene;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
public class LuceneIndex {
public static void main(String[] args) {
   try {
    LuceneIndex index = new LuceneIndex();
    Date start = new Date();
    index.writeToIndex();
    Date end = new Date();
    System.out.println("建立索引用时" + (end.getTime() - start.getTime())+" 毫秒");
    index.close();
   } catch (Exception e) {
    e.printStackTrace();
   }
}
//索引器
private IndexWriter writer = null;
public LuceneIndex() {
   try {
    //建立索引器，指定索引存放目录,分析器--new StandardAnalyzer()
    writer = new IndexWriter("f:\\lucene\\index",
      new StandardAnalyzer(), true);
   } catch (Exception e) {
    e.printStackTrace();
   }
}
private Document getDocument(File f) {
   //将要建立索引的文件构造成Document对象，并添加域content
   Document doc = new Document();
   BufferedReader bufReader = null;
   try {
    bufReader = new BufferedReader(new InputStreamReader(
      new FileInputStream(f)));
   } catch (FileNotFoundException e) {
    e.printStackTrace();
   }
   //添加内容
   doc.add(Field.Text("contents", bufReader));
   doc.add(Field.Keyword("path", f.getAbsolutePath()));
   return doc;
}
private void writeToIndex() {
   //将目录f:\\lucene\\test下的文件，先通过getDocument(File)函数，
   //构造成Document， 然后添加到索引器writer
   File folder = new File("f:\\lucene\\test");
   if (folder.isDirectory()) {
    File[] list = folder.listFiles();
    for (File f : list) {
     Document doc = getDocument(f);
     try {
      System.out.println("建立索引:" + f);
      writer.addDocument(doc);
     } catch (IOException e) {
      e.printStackTrace();
     }
    }
   }
}
private void close() {
   try {//关闭索引器
    writer.close();
   } catch (IOException e) {
    e.printStackTrace();
   }
}
}

最后，执行程序，结果如下：
建立索引:f:\lucene\test\a.txt
建立索引:f:\lucene\test\b.txt
建立索引:f:\lucene\test\c.txt
建立索引:f:\lucene\test\d.txt
建立索引用时63 毫秒
在f:\lucene\index下发现索引结果文件
_4.cfs deletable segments

                                     第二部分：在索引上搜索入门实例
在索引上搜索主要包括几个步骤，使用两个对象—IndexSearcher和Query。
检索步骤：
第一步：创建索引器
searcher = new IndexSearcher(IndexReader.open("f:\\lucene\\index"));
第二步：将待检索关键字打包成Query对象
query = QueryParser.parse(key, "contents", new StandardAnalyzer());
第三步：使用索引器检索Query，得到检索结果Hits对象
Hits hit = searcher.search(query);
最后，将检索到的结果Hits打印出来：
for (int i = 0; i < h.length(); ++i) {
   Document doc = h.doc(i);
   System.out.println("这是第 " + i + " 个检索到的结果，文件名为:"
        + doc.get("path"));
}
全部程序如下：

package com.peng.mylucene;
import java.io.IOException;
import java.util.Date;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
public class LuceneSearch {
public static void main(String[] args) {
   LuceneSearch test = new LuceneSearch();
   Hits hit = null;// new Hits();
   hit = test.search("中华");
   test.dispalyResult(hit);
   hit = test.search("人民");
   test.dispalyResult(hit);
   hit = test.search("共和国");
   test.dispalyResult(hit);
}
public LuceneSearch() {
   try {// IndexReader.open()指名索引所在文件夹
    searcher = new IndexSearcher(IndexReader.open("f:\\lucene\\index"));
   } catch (IOException e) {
    e.printStackTrace();
   }
}
// 声明IndexSearcher对象
private IndexSearcher searcher = null;
// 声明Query对象
private Query query = null;
public Hits search(String key) {
   System.out.println("正在检索关键字：" + key);
   try {// 将关键字包装为Query对象
    query = QueryParser.parse(key, "contents", new StandardAnalyzer());
    Date start = new Date();
    Hits hit = searcher.search(query);
    Date end = new Date();
    System.out.println("检索完成，用时：" + (end.getTime() - start.getTime())
      + " 毫秒");
    return hit;
   } catch (Exception e) {
    e.printStackTrace();
   }
   return null;
}
public void dispalyResult(Hits h) {
   if (h.length() < 1) {
    System.out.println("no result !");
    return;
   } else {
    for (int i = 0; i < h.length(); ++i) {
     try {
      Document doc = h.doc(i);
      System.out.println("这是第 " + i + " 个检索到的结果，文件名为:"
        + doc.get("path"));
     } catch (IOException e) {
      e.printStackTrace();
     }
    }
    System.out.println("----------------------");
   }
}
}

在执行第一部分的程序得到索引后，执行搜索程序LuceneSearch，在控制台下得到结果如下：
（对比我们在f:\lucene\test下的四个文件可知，检索结果正确）
正在检索关键字：中华
检索完成，用时：47 毫秒
这是第 0 个检索到的结果，文件名为:f:\lucene\test\a.txt
----------------------
正在检索关键字：人民
检索完成，用时：0 毫秒
这是第 0 个检索到的结果，文件名为:f:\lucene\test\c.txt
这是第 1 个检索到的结果，文件名为:f:\lucene\test\b.txt
这是第 2 个检索到的结果，文件名为:f:\lucene\test\a.txt
----------------------
正在检索关键字：共和国
检索完成，用时：0 毫秒
这是第 0 个检索到的结果，文件名为:f:\lucene\test\d.txt
这是第 1 个检索到的结果，文件名为:f:\lucene\test\b.txt
这是第 2 个检索到的结果，文件名为:f:\lucene\test\a.txt
----------------------

总结
通过以上两篇文章我们看以看到使用lucene建立索引过程主要有一下4步：
1.提取文本
2.构建Document
3.分析
4.建立索引

==================================================
Lucene1.4主要提供下列四种不同类型的Field:
Keyword,UnStored,UnIndexed,Text

在Lucene2.0中是通过三个内部类Field.Index,Field.Store,Field.termVector(项向量)的组合来区分Field的具体类型.具体如下:
Field.Store.COMPRESS:压缩保存,用于长文本或二进制数据
Field.Store.YES:保存
Field.Store.NO:不保存

Field.Index.NO:不建立索引
Field.Index.TOKENIZED:分词,建索引
Field.Index.UN_TOKENIZED:不分词,建索引
Field.Index.NO_NORMS:不分词,建索引.但是Field的值不像通常那样被保存，而是只取一个byte，这样节约存储空间

Field.TermVector.NO:不保存term vectors
Field.TermVector.YES:保存term vectors
Field.TermVector.WITH_POSITIONS:保存term vectors.(保存值和token位置信息)
Field.TermVector.WITH_OFFSETS:保存term vectors.(保存值和Token的offset)
Field.TermVector.WITH_POSITIONS_OFFSETS:保存term vectors.(保存值和token位置信息和Token的offset)

而Field的构造函数也用到了这三个内部类:
Field(String, byte[],Field.Store)
Field(String, Reader)
Field(String, Reader, Field.TermVector)
Field(String, String, Field.Store, Field.Index)
Field(String, String, Field.Store, Field.Index, Field.TermVector)

其中Field(String, Reader)和Field(String, Reader, Field.TermVector)默认为Field.Index.TOKENIZED和Field.Store.NO的.我们可以很简单的建立起1.4版本的Field类型和2.0间的转换(这看上去似乎没有什么必要,只是觉得对于理解还是有点帮助的)
Keyword <==> Store.YES,Index.UN_TOKENIZED;
UnIndexed <==> Store.YES,Index.NO;
UnStored <==> Store.NO,Index.TOKENIZED;
Text(String, Reader) <==> Store.NO,Index.TOKENIZED;
Text(String,String) <==> Store.YES,Index.TOKENIZED.

Field.Store 表示“是否存储”，即该Field内的信息是否要被原封不动的保存在索引中。

Field.Index 表示“是否索引”，即在这个Field中的数据是否在将来检索时需要被用户检索到，一个“不索引”的Field通常仅是提供辅助信息储存的功能。

Field.TermVector 表示“是否切词”，即在这个Field中的数据是否需要被切词。

通常，参数用Reader，表示在文本流数据源中获取数据，数据量一般会比较大。像链接地址URL、文件系统路径信息、时间日期、人名、居民身份证、电话号码等等通常将被索引并且完整的存储在索引中，但一般不需要切分词，通常用上面的第四个构造函数，第三四个参数分别为Field.Store.YES, Field.Index.YES。

lucene_index.zip (1.2 MB)
下载次数: 5

分享到：

log4j中关闭Hibernate调试信息 | 分页显示json数据

2011-09-09 14:08
浏览 984
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Lucene

评论

发表评论

相关推荐

用Lucene做一个简单的Java搜索工具

Lucene 索引数据库

最近访客更多访客>>