Lucene简单介绍

jefferson

浏览: 280421 次
性别:

最近访客更多访客>>

coolboy8008

星空不远

ForLove_ForYOU

danielleeht

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

学习笔记或摘录

lucene 搜索引擎 Apache .net 工作

使用Lucene作为搜索引擎，应用系统需要做两件事情：

（1）建立索引文件。下面给一个接口SearchManager来定义一般要用到的方法。

SearchManager代码如下：

java 代码

public interface SearchManager {
public boolean isSearchEnabled();
public void setSearchEnabled(boolean searchEnabled);
/**
//如果SearchManage正在工作，返回真
public boolean isBusy();
//返回索引完成率
public int getPercentComplete();
//是否自动建立索引
//通过TaskEngine.scheduleTask方法实现定期自动索引
public boolean isAutoIndexEnabled();
public void setAutoIndexEnabled(boolean value);
//自动索引间隔的分钟数
public int getAutoIndexInterval();
public void setAutoIndexInterval(int minutes);
//获得上次建立索引的时间
public Date getLastIndexedDate();
//在实时建立索引时，将当前帖子加入索引
public void addToIndex(ForumMessage message);
public void removeFromIndex(ForumMessage message);
//手动更新自上次建立索引后的新内容
public void updateIndex();
//手动重新建立全部的索引
public void rebuildIndex();
//优化
public void optimize();
}

· IndexWriter用户建立新的索引，当然也可以将文档加入已经存在的索引。

在文本被索引之前，它必须通过一个分析器Analyzer。分析器Analyzer 负责从文本中分离出索引关键字。Lucene有几种不同类型的分析器：

· SimpleAnalyzer是将英文转换为小写字母，按空格和标点符号切分出英文单词，

如I am Java这一句，使用SimpleAnalyzer切词就会切分出下列词语：

token1=I

token2=am

token3=Java

· StandardAnalyzer是对英文进行了较为复杂的处理。除了按词语建立索引关键字（token）外，还能够为特殊名称、邮件地址、缩写格式等建立索引单元，而且对“and”、“ the”等词语做了过滤。

· ChineseAnalyzer是专门用来分析中文的索引的。关于中文分析器，有很多尝试，如车东的http://sourceforge.net/projects/weblucene/；等，该问题将在后面章节继续讨论。

一个索引是由一系列Document组成，每个Document是由一个或多个Field组成，每个Field都有一个名字和值，可以把Document作为关系数据库中一条记录，而Field则是记录中某列字段。一般建立索引如下：

java 代码

//指定将在哪个目录建立索引
String indexDir = "/home/";
//指定将要建立索引的文本
String text = "welcom here, I am Java,";
Analyzer analyzer = new StandardAnalyzer(); //使用StandardAnalyzer
//建立一个IndexWriter
IndexWriter writer = new IndexWriter(indexDir, analyzer, true);
//建立Document
Document document = new Document();
//进行切词、索引
document.add(Field.Text("fieldname", text));
//加入索引中
writer.addDocument(document);
writer.close();

其中，Field根据具体要求有不同用法，Lucene提供4种类型的Field: Keyword、 UnIndexed、 UnStored和 Text。

· Keyword 不实现切词，逐字地保存在索引中，这种类型适合一些如URL、日期、个人姓名、社会安全号码、电话号码等需要原封不动保留的词语。

· UnIndexed既不实现切词也不索引，但是其值是一个词一个词地保存在索引中，这不适合很大很长的词语，适合于显示一些不经过直接搜索的结果值。

· UnStored与UnIndexed正好相反，将被切词和索引，但是不保存在索引中，这适合巨大文本，如帖子内容、页面内容等。

· Text是实现切词、索引，并且保存在索引中。

上面是建立或者添加索引，那么如何删除索引呢？

建立索引后，每个所有对应一个org.apache.lucene.index.Term对象，那么可以根据索引的KEYWORD创建一个Term对象，

Term messageIDTerm =new Term("mID", Long.toString(indexId));

然后：

IndexReader reader = IndexReader.open(indexDir);

通过IndexReader的delete方法删除指定的索引

reader.delete(messageIDTerm);

这里，建立索引的时候，keyword名称指定为mID

（2）建立完成后，就可以直接搜索特定的词语了。搜索语句一般代码如下：

Searcher searcher = new IndexSearcher((indexDir); //创建一个搜索器

也可以这样创建：

Directory searchDirectory = FSDirectory.getDirectory(indexPath, false);
IndexReader reader = IndexReader.open(searchDirectory);
Searcher searcher = new IndexSearcher(reader);

//使用和索引同样的语言分析器

Query query = QueryParser.parse(queryString, "body", new StandardAnalyzer());

//搜索结果使用Hits存储

Hits hits = searcher.search(query);

//通过hits得到相应字段的数据和查询的匹配度

for (int i=0; i

System.out.println(hits.doc(i).get("fieldname "));

};

分享到：

回调的妙用 | 关于OSCache的使用

2006-11-16 15:49
浏览 4215
评论(1)
论坛回复 / 浏览 (1 / 8574)
分类:非技术
查看更多

1 楼 jefferson 2006-11-16

继续，过滤器的使用
1、field的过滤器：
Filter.add(new FieldFilter("字段名", 字段值));
2、日期的过滤器：
a、
Filter.add(new DateFilter("creationDate", beforeDate, afterDate));

b、
Filter.add(DateFilter.After("creationDate", afterDate));
c、
Filter.add(DateFilter.Before("creationDate", beforeDate));
最后，将过滤器作为参数传入查询方法：
searcher.search(query, Filter);

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论