lucence 高亮

he3109006290

浏览: 27758 次
性别:
来自: 广州

最近访客更多访客>>

xlzcimos

minxiaomin

zhanghua499

113779479

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

lucence

一、Lucene 中文引擎，庖丁解牛的辞典参数配置方法(转)

随机文档指示可以在环境变量里配置。原文如下
庖丁中文分词需要一套词典，这些词典需要统一存储在某个目录下，这个目录称为词典安装目录。词典安装目录可以是文件系统的任何目录，它不依赖于应用程序的运行目录。将词典拷贝到词典安装目录的过程称为安装词典。增加、删除、修改词典目录下的词典的过程称为自定制词典。

在linux下，我们可以考虑将词典安装在一个专门存储数据的分区下某目录，以笔者为例，笔者将/data作为系统的一个独立分区，笔者便是将词典保存在/data/paoding/dic下。
在windows下，我们可以考虑将词典安装在非系统盘的另外分区下的某个目录，以笔者为例，笔者可能将词典保存在E:/data/paoding/dic下。
使用者安装辞典后，应该设置系统环境变量PAODING_DIC_HOME指向词典安装目录。
在linux下，通过修改/etc/profile，在文件末尾加上以下2行，然后保存该文件并退出即可。
PAODING_DIC_HOME=/data/paoding/dic
export PAODING_DIC_HOME
在windows下，通过“我的电脑”属性之“高级”选项卡，然后在进入“环境变量”编辑区，新建环境变量，设置“变量名”为PAODING_DIC_HOME；“变量值”为E:/data/paoding/dic

不过我在错误信息里面发现了另外一个配置方式，那就是修改paoding-dic-home.properties 里面的 paoding.dic.home 配置
这个文件在
paoding-analysis-2.0.4-beta\classes
有一个，我们可以修改这个，原始内容如下

#values are "system-env" or "this";
#if value is "this" , using the paoding.dic.home as dicHome if configed!
#paoding.dic.home.config-fisrt=system-env

#dictionary home (directory)
#"classpath:xxx" means dictionary home is in classpath.
#e.g "classpath:dic" means dictionaries are in "classes/dic" directory or any other classpath directory
#paoding.dic.home=dic

#seconds for dic modification detection
#paoding.dic.detector.interval=60

我们修改成如下内容
#values are "system-env" or "this";
#if value is "this" , using the paoding.dic.home as dicHome if configed!
# 这里修改为 this 代表使用这个配置而不是环境变量的配置
paoding.dic.home.config-fisrt=this

#dictionary home (directory)
#"classpath:xxx" means dictionary home is in classpath.
#e.g "classpath:dic" means dictionaries are in "classes/dic" directory or any other classpath directory
# 这里修改为我们辞典所在的目录
paoding.dic.home=E:/lib/paoding-analysis-2.0.4-beta/dic/

#seconds for dic modification detection
#paoding.dic.detector.interval=60

最后一步，用winrar/winzip等打开 paoding-analysis.jar 然后更新里面的 paoding-dic-home.properties

OK, 这个jar就是我们自己使用的了。

二、实例

任务1 到此完成，新闻显示工作结束。下面是搜索引擎部分。

搜索的工具类放置在com.zly.indexManager包下面

说明，本程序使用了庖丁解牛中文分词，用户使用时需要中文字典，我的字典放在了c:\dic下面，使用庖丁还需要配置环境变量PAODING_DIC_HOME ，其值为c:\dic ， (就是你的字典文件所在的目录)

代码如下：

创建索引类IndexCreateUtil

Java代码

package com.zly.indexManager;
import java.io.File;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.List;
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.hibernate.SessionFactory;
import org.hibernate.cfg.AnnotationConfiguration;
import org.hibernate.cfg.Configuration;
import org.hibernate.Session;
import org.htmlparser.Parser;
import com.zly.test.entity.NewsItem;
public class IndexCreateUtil {
@SuppressWarnings("unchecked")
public void createIndexForNews() throws Exception {
//存放索引的文件夹
File indexFile = new File("c:/index/news");
//使用了庖丁解牛分词器
Analyzer analyzer = new PaodingAnalyzer();
//使用索引文件夹，庖丁解牛分词器创建IndexWriter
IndexWriter indexWriter = new IndexWriter(indexFile , analyzer , true);
//从数据库中读取出所有的新闻记录以便进行索引的创建
Configuration cfg = new AnnotationConfiguration().configure();
SessionFactory factory = cfg.buildSessionFactory();
Session session = factory.openSession();
List<NewsItem> list = session.createQuery(" from NewsItem").list();
DateFormat format = new SimpleDateFormat("yyyy年MM月dd日 HH时mm分ss秒");
//对所有的新闻实体进行索引创建
for (NewsItem newsItem : list) {
//建立一个lucene文档
Document doc = new Document();
//得到新闻标题
String newsTitle = newsItem.getNewsTitle();
//得到新闻内容
String newsContent = newsItem.getNewsContent();
//得到新闻事件
String publishDate = format.format(newsItem.getPublishTime());
//得到新闻主键id
String id = newsItem.getId() + "";
//将新闻标题加入文档，因为要搜索和高亮，所以index是tokennized，TermVector是WITH_POSITIONS_OFFSETS
doc.add(new Field("title" , newsTitle , Field.Store.YES , Field.Index.TOKENIZED , Field.TermVector.WITH_POSITIONS_OFFSETS));
//利用htmlparser得到新闻内容html的纯文本
Parser parser = new Parser();
parser.setInputHTML(newsContent);
String strings = parser.parse(null).elementAt(0).toPlainTextString().trim();
//添加新闻内容至文档，与标题相似
doc.add(new Field("content" , strings , Field.Store.COMPRESS , Field.Index.TOKENIZED , Field.TermVector.WITH_POSITIONS_OFFSETS));
//添加时间至文档，因为要按照此字段降序排列排序，所以tokenzied,不用高亮所以TermVector是no就行了
doc.add(new Field("date" , publishDate , Field.Store.YES , Field.Index.TOKENIZED , Field.TermVector.NO));
//添加主键至文档，不分词，不高亮。
doc.add(new Field("id" , id , Field.Store.YES , Field.Index.NO , Field.TermVector.NO));
indexWriter.addDocument(doc);
}
//创建索引
indexWriter.optimize();
indexWriter.close();
//关闭session
session.close();
}
public static void main(String[] args) throws Exception {
IndexCreateUtil util = new IndexCreateUtil();
util.createIndexForNews();
}
}

对索引进行搜索的代码如下：

Java代码

package com.zly.indexManager;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import com.zly.test.entity.SearchResultBean;
public class IndexSearchUtil {
public List<SearchResultBean> getSearchResult(String searchWhich , String searchParam , int firstResult , int maxResult) throws Exception{
//索引所在文件夹
File indexFile = new File("c:/index/news");
//读取索引的indexReader
IndexReader reader = IndexReader.open(indexFile);
//庖丁解牛分词器
Analyzer analyzer = new PaodingAnalyzer();
//指定对content还是title进行查询
QueryParser parser = new QueryParser(searchWhich , analyzer);
//创建indexSearcher
IndexSearcher searcher = new IndexSearcher(reader);
//对用户的输入进行查询
Query query = parser.parse(searchParam);
//根据date字段进行排序，得到查询结果
Hits hits = searcher.search(query , new Sort("date" , true));
//创建list，将结果保存其中，以便在jsp页面中进行显示
List<SearchResultBean> list = new ArrayList<SearchResultBean>();
//模拟hibernate的serFirstResult和setMaxResult以便返回指定条目的结果
for (int i = firstResult - 1; i < firstResult + maxResult - 1; i++) {
Document doc = hits.doc(i);
//取得该条索引文档
SearchResultBean srb = new SearchResultBean();
//从中取出标题
String title = doc.get("title");
//从中取出内容
String content = doc.get("content");
//从中取出主键id
String id = doc.get("id");
//从中取出发布时间
String date = doc.get("date");
//高亮htmlFormatter对象
SimpleHTMLFormatter sHtmlF = new SimpleHTMLFormatter("<b><font color='red'>", "</font></b>");
//高亮对象
Highlighter highlighter = new Highlighter(sHtmlF,new QueryScorer(query));
//设置高亮附近的字数
highlighter.setTextFragmenter(new SimpleFragmenter(100));
//如果查询的是标题，进行处理
if(searchWhich.equals("title")) {
String bestFragment = highlighter.getBestFragment(analyzer,searchWhich,title);
//获得高亮后的标题内容
srb.setTitle(bestFragment);
//如果内容不足150个字，全部设置
if(content.length() < 150) {
srb.setContent(content);
}else {
//如果内容多于150个字，只取出前面150个字
srb.setContent(content.substring(0 , 150));
}
} else {
//如果查询的是内容字段
String bestFragment = highlighter.getBestFragment(analyzer,searchWhich,content);
//取得高亮内容并设置
srb.setContent(bestFragment);
//设置标题，全部设置
srb.setTitle(title);
}
//设置日期
srb.setDate(date);
//设置主键
srb.setId(id);
//添加到list中，以便在jsp页面上显示
list.add(srb);
}
return list;
}
//取得符合搜索条件的所有记录总数，以便分页 , 与上面方法类似
public int getResultCount(String searchWhich , String searchParam) throws Exception {
File indexFile = new File("c:/index/news");
IndexReader reader = IndexReader.open(indexFile);
Analyzer analyzer = new PaodingAnalyzer();
QueryParser parser = new QueryParser(searchWhich , analyzer);
IndexSearcher searcher = new IndexSearcher(reader);
Query query = parser.parse(searchParam);
Hits hits = searcher.search(query);
return hits.length();
}
}

分页action代码如下：

Java代码

package com.zly.test.action;
import java.util.List;
import com.zly.indexManager.IndexSearchUtil;
import com.zly.test.entity.PageControl;
import com.zly.test.entity.SearchResultBean;
public class SearchAction extends BaseAction {
private static final long serialVersionUID = -2387037924517370511L;
//查询索引实体类
private IndexSearchUtil indexSearcher;
//对应搜索字段是标题还是内容
private String searchWhich;
//对应用户输入的搜索内容
private String searchParam;
//对应分页跳转到的页面
private String jumpPage;
public String getJumpPage() {
return jumpPage;
}
public void setJumpPage(String jumpPage) {
this.jumpPage = jumpPage;
}
public String getSearchWhich() {
return searchWhich;
}
public void setSearchWhich(String searchWhich) {
this.searchWhich = searchWhich;
}
public String getSearchParam() {
return searchParam;
}
public void setSearchParam(String searchParam) {
this.searchParam = searchParam;
}
public String search() throws Exception {
//如果为空，说明第一次进入分页
if(jumpPage == null) {
jumpPage = "1";
}
//从request范围内取得pageControl对象
PageControl pageControl = (PageControl) this.getRequest().getAttribute("pageControl");
//如果为空，则是第一次分页，创建分页对象，并且设置总的记录条数，以便设置最大页数
if(pageControl == null) {
pageControl = new PageControl();
pageControl.setMaxRowCount((long)indexSearcher.getResultCount(searchWhich, searchParam));
pageControl.countMaxPage();
}
//设置当前页
pageControl.setCurPage(Integer.parseInt(jumpPage));
//计算firstResult
int firstResult = (pageControl.getCurPage() - 1) * pageControl.getRowsPerPage() + 1;
//计算从当前条数算还有多少条记录
long left = pageControl.getMaxRowCount() - firstResult;
int maxResult = -1;
//如果剩余的记录数不如每页显示数，就设置maxResult为剩余条数
if(left < pageControl.getRowsPerPage()) {
maxResult = Integer.valueOf(left + "");
//如果剩余记录数大于每页显示页数，就设置maxResult为每页条数
}else {
maxResult = pageControl.getRowsPerPage();
}
//取得查询结果集
List<SearchResultBean> userList = indexSearcher.getSearchResult(searchWhich, searchParam, firstResult, maxResult);
//设置为pageControl
pageControl.setData(userList);
//将pageControl设置到request范围，以便在jsp现实结果
this.getRequest().setAttribute("pageControl", pageControl);
//将searchWhich和searchParam设置到request范围，以便添加到分页jsp的form里面的hidden表单域，以便下次分页时，能够将值提交过来
this.getRequest().setAttribute("searchWhich", searchWhich);
this.getRequest().setAttribute("searchParam", searchParam);
//跳转到分页视图
return SUCCESS;
}
public IndexSearchUtil getIndexSearcher() {
return indexSearcher;
}
public void setIndexSearcher(IndexSearchUtil indexSearcher) {
this.indexSearcher = indexSearcher;
}
}

搜索的action在struts.xml中设置如下：

Xml代码

<action name="searchAction" class="searchAction" method="search">
<result>/searchResult.jsp</result>
</action>

//searchResult.jsp代码如下：

Html代码

<%@ page language="java" contentType="text/html;charset=utf-8"
pageEncoding="utf-8"%>
<%@taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<%@taglib prefix="fmt" uri="http://java.sun.com/jsp/jstl/fmt" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>${searchParam} 的搜查结果 -- 最快新闻网</title>
</head>
<body>
<jsp:include page="index.jsp"></jsp:include>
<div id="content">
<div id="searchResults" >
<c:forEach items="${pageControl.data}" var="result">
<div style="margin-top: 20px;">
<span>
<a href="detailAction.action?id=${result.id }">${result.title}</a><br />
${result.content }
<font color="green">http://localhost:8080/NewsWithSearch/detailAction.action?id=${result.id } ${result.date }</font>
</span>
<br />
</div>
</c:forEach>
</div>
<br /><br /><br /><br />
<%@ include file="searchPage.jsp" %>
</div>
</body>
</html>