lucene-对每个字段指定分析器及较复杂搜索页面(对QQ国内新闻搜索)

deepfuture

浏览: 4430604 次
性别:
来自: 湛江

最近访客更多访客>>

linxl2011

mars36

jccz_zys

zkm0309

博主相关

博客

微博

相册

留言

关于我

博客专栏

: SQLite源码剖析
浏览量：80331

: WIN32汇编语言学习应用...
浏览量：70814

: 神奇的perl
浏览量：104011

: lucene等搜索引擎解析...
浏览量：287483

: 深入lucene3.5源码...
浏览量：15130

: VB.NET并行与分布式编...
浏览量：68333

: silverlight 5...
浏览量：32552

: 算法下午茶系列
浏览量：46260

文章分类

社区版块

存档分类

博客分类：

搜索引擎

QQ lucene Apache 腾讯 Servlet

1、

JAVA代码（索引）

package bindex;

import java.io.IOException;
import java.net.URL;

import jeasy.analysis.MMAnalyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.LockObtainFailedException;
import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.beans.LinkBean;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.NotFilter;
import org.htmlparser.filters.OrFilter;
import org.htmlparser.filters.RegexFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

public class perfieldindextest {

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
String indexpath="./indexes";

IndexWriter writer;
PerFieldAnalyzerWrapper wr;
Document doc;
try {
writer=new IndexWriter(indexpath,new StandardAnalyzer());
wr=new PerFieldAnalyzerWrapper(new StandardAnalyzer());
wr.addAnalyzer("title",new MMAnalyzer());
wr.addAnalyzer("content", new MMAnalyzer());
wr.addAnalyzer("author", new MMAnalyzer());
wr.addAnalyzer("time", new StandardAnalyzer());
//提取腾迅国内新闻链接
LinkBean lb=new LinkBean();
lb.setURL("http://news.qq.com/china_index.shtml");
URL[] urls=lb.getLinks();
for (int i=0;i<urls.length;i++){
doc=new Document();
String title="";
String content="";
String time="";
String author="";
System.out.println("正在提取网页第"+i+"个链接("+(int)(100*(i+1)/urls.length)+"%)["+urls[i].toString()+"].....");
if (!(urls[i].toString().startsWith("http://news.qq.com/a/"))){
System.out.println("非新闻链接，忽略......");continue;
}
System.out.println("新闻链接，正在处理");
Parser parser=new Parser(urls[i].toString());
parser.setEncoding("GBK");
String url=urls[i].toString();
NodeFilter filter_title=new TagNameFilter("title");
NodeList nodelist=parser.parse(filter_title);
Node node_title=nodelist.elementAt(0);
title=node_title.toPlainTextString();
System.out.println("标题："+title);
parser.reset();
NodeFilter filter_auth=new OrFilter(new HasAttributeFilter("class","auth"),new HasAttributeFilter("class","where"));
nodelist=parser.parse(filter_auth);
Node node_auth=nodelist.elementAt(0);
if (node_auth != null) author=node_auth.toPlainTextString();
else author="腾讯网";
node_auth=nodelist.elementAt(1);
if (node_auth != null) author+=node_auth.toPlainTextString();
System.out.println("作者:"+author);
parser.reset();
NodeFilter filter_time=new OrFilter(new HasAttributeFilter("class","info"),new RegexFilter("[0-9]{4}年[0-9]{1,2}月[0-9]{1,2}日[' ']*[0-9]{1,2}:[0-9]{1,2}"));
nodelist=parser.parse(filter_time);
Node node_time=nodelist.elementAt(0);
if (node_time.getChildren()!=null) node_time=node_time.getFirstChild();
time=node_time.toPlainTextString().replaceAll("[ |\t|\n|\f|\r\u3000]","").substring(0,16);
System.out.println("时间:"+time);
parser.reset();
NodeFilter filter_content=new OrFilter(new OrFilter(new HasAttributeFilter("style","TEXT-INDENT: 2em"),new HasAttributeFilter("id","Cnt-Main-Article-QQ")),new HasAttributeFilter("id","ArticleCnt"));
nodelist=parser.parse(filter_content);
Node node_content=nodelist.elementAt(0);
content=node_content.toPlainTextString().replaceAll("(#.*)|([a-z].*;)|}","").replaceAll(" |\t|\r|\n|\u3000","");
System.out.println("内容:"+content);
System.out.println("正在索引.....");
Field field=new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED);
doc.add(field);
field=new Field("content",content,Field.Store.YES,Field.Index.TOKENIZED);
doc.add(field);
field=new Field("author",author,Field.Store.YES,Field.Index.UN_TOKENIZED);
doc.add(field);
field=new Field("time",time,Field.Store.YES,Field.Index.NO);
doc.add(field);
field=new Field("url",url,Field.Store.YES,Field.Index.NO);
doc.add(field);
writer.addDocument(doc,new MMAnalyzer());
System.out.println("<"+title+"索引成功>");
}
writer.close();
wr.close();
} catch (ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

}

笔者BLOG：http://blog.163.com/sukerl@126/

Servlet代码(搜索)：

package bservlet;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.*;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;

import java.io.*;

import jeasy.analysis.MMAnalyzer;

public class SluceneSearcher extends HttpServlet {
private String indexpath="D:/workspace/testsearch2/indexes";
public void doPost(HttpServletRequest request,HttpServletResponse response){
StringBuffer sb=new StringBuffer("");
try {
request.setCharacterEncoding("GBK");
String phrase=request.getParameter("phrase");
Analyzer analyzer=new MMAnalyzer();
IndexSearcher searcher;
searcher = new IndexSearcher(indexpath);
QueryParser parser=new QueryParser("content",analyzer);
Query q= parser.parse(phrase);
Hits hs=searcher.search(q);
int num=hs.length();
sb.append("<h1>您搜索到的记录数:"+num+"</h1>");
for (int i=0;i<num;i++){
Document doc=hs.doc(i);
if (doc==null){
continue;
}
Field field_title=doc.getField("title");
String title="<br><a href="+doc.getField("url").stringValue()+" target='_blank'>"+field_title.stringValue()+"</a><br>";
Field field_author=doc.getField("author");
String author="<br>author:<br>"+field_author.stringValue();
Field field_time=doc.getField("time");
String time="<br>time:<br>"+field_time.stringValue();
sb.append(title);
sb.append(author);
sb.append(time);
}
searcher.close();
} catch (CorruptIndexException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
PrintWriter out;
try {
response.setContentType("text/html;charset=GBK");
out = response.getWriter();
out.print(sb.toString());
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}
public void doGet(HttpServletRequest request,HttpServletResponse response){
doPost(request,response);
}

}

WEB.XML：
<?xml version="1.0" encoding="ISO-8859-1"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

<web-app xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"
version="2.5">

<display-name>news-search</display-name>
<description>
news-search
</description>
<servlet>
<servlet-name>newssearch</servlet-name>
<servlet-class>bservlet.SluceneSearcher</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>newssearch</servlet-name>
<url-pattern>/deepfuturesou</url-pattern>
</servlet-mapping>

</web-app>

注意deepfuturesou是虚拟路径，不要实际建立该目录，但必须注意要和搜索网页中指定的保持一致，与对应的servlet保持一致。

搜索网页：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<title>腾讯国内新闻搜索</title>
</head>

<body>
<form id="form1" name="form1" method="post" action="deepfuturesou">
搜索关键字
<input name="phrase" type="text" id="phrase" />
<input type="submit" name="Submit" value="搜索" />
</form>
</body>
</html>

2、效果(对QQ国内新闻搜索)

正在提取网页第0个链接(0%)[http://news.qq.com/china_index.shtml#].....
非新闻链接，忽略......
正在提取网页第1个链接(1%)[http://3g.qq.com].....
非新闻链接，忽略......
正在提取网页第2个链接(1%)[http://www.qq.com].....
非新闻链接，忽略......
正在提取网页第3个链接(2%)[http://news.qq.com/].....
非新闻链接，忽略......
正在提取网页第4个链接(3%)[http://news.qq.com/photo.shtml].....
非新闻链接，忽略......
正在提取网页第5个链接(3%)[http://news.qq.com/scroll/now.htm].....
非新闻链接，忽略......
正在提取网页第6个链接(4%)[http://news.qq.com/paihang.htm].....
非新闻链接，忽略......
正在提取网页第7个链接(5%)[http://news.qq.com/china_index.shtml].....
非新闻链接，忽略......
正在提取网页第8个链接(5%)[http://news.qq.com/world_index.shtml].....
非新闻链接，忽略......
正在提取网页第9个链接(6%)[http://news.qq.com/society_index.shtml].....
非新闻链接，忽略......
正在提取网页第10个链接(6%)[http://report.qq.com/].....
非新闻链接，忽略......
正在提取网页第11个链接(7%)[http://news.qq.com/military.shtml].....
非新闻链接，忽略......
正在提取网页第12个链接(8%)[http://view.news.qq.com/index/zhuanti/zt_more.htm].....
非新闻链接，忽略......
正在提取网页第13个链接(8%)[http://view.news.qq.com/].....
非新闻链接，忽略......
正在提取网页第14个链接(9%)[http://news.qq.com/topic/feature.htm].....
非新闻链接，忽略......
正在提取网页第15个链接(10%)[http://blog.qq.com/news/].....
非新闻链接，忽略......
正在提取网页第16个链接(10%)[http://news.qq.com/photon/videonews/morevideo.htm].....
非新闻链接，忽略......
正在提取网页第17个链接(11%)[http://bj.qq.com/].....
非新闻链接，忽略......
正在提取网页第18个链接(11%)[http://sh.qq.com/].....
非新闻链接，忽略......
正在提取网页第19个链接(12%)[http://gd.qq.com/].....
非新闻链接，忽略......
正在提取网页第20个链接(13%)[http://cq.qq.com/].....
非新闻链接，忽略......
正在提取网页第21个链接(13%)[http://xian.qq.com/].....
非新闻链接，忽略......
正在提取网页第22个链接(14%)[http://cd.qq.com/].....
非新闻链接，忽略......
正在提取网页第23个链接(15%)[http://js.qq.com/].....
非新闻链接，忽略......
正在提取网页第24个链接(15%)[http://zj.qq.com/].....
非新闻链接，忽略......
正在提取网页第25个链接(16%)[http://sd.qq.com/].....
非新闻链接，忽略......
正在提取网页第26个链接(16%)[http://news.qq.com/{clickurl}].....
非新闻链接，忽略......
正在提取网页第27个链接(17%)[http://news.qq.com/{clickurl}].....
非新闻链接，忽略......
正在提取网页第28个链接(18%)[http://news.qq.com/{clickurl}].....
非新闻链接，忽略......
正在提取网页第29个链接(18%)[http://news.qq.com/china_index.shtml#].....
非新闻链接，忽略......
正在提取网页第30个链接(19%)[http://news.qq.com/a/20091127/000644.htm].....
新闻链接，正在处理
标题：组图：武汉东湖上千万摇蚊引发多起车祸_新闻国内_新闻_腾讯网
作者:中国新闻网
时间:2009年11月27日10:00
内容:中&白色大理石护栏被摇蚊“刷黑”。中新社发楚天行摄functionSplitPages(name,pageID,listID){SplitPages.prototype.checkPages=function(){SplitPages.prototype.createHtml=function(mode){if(this.pageCount>this.page+2){else{i++){if(i>0){if(i==this.page){else{if(i!=1&&i!=this.pageCount){SplitPages.prototype.Output=function(mode){SplitPages.prototype.setPage=function(mode){$(window.onload=function(){varimgsimgs=$("imgsimgs")changeImg(imgsimgs)近日由于气温上升，武汉东湖沙滩浴场附近的环湖路上落下大量摇蚊，过往汽车碾压后，成“油垢”致路面异常光滑，引发多起车祸。2009年11月24日7时许，一辆黑色轿车，在东湖沙滩浴场旁的弯道处突然失控，撞到路旁的石头上，车头面目全非。这是当天早晨在这里发生的第4起，一辆汽车还将一棵脸盆粗的大树撞到湖里。另外还有5、6辆摩托车也在这里滑倒。东湖环卫管理处派出职工，用高压水枪来清洗路面的“油垢”，以防汽车打滑。[责任编辑：morganli]
正在索引.....
<组图：武汉东湖上千万摇蚊引发多起车祸_新闻国内_新闻_腾讯网索引成功>
正在提取网页第31个链接(20%)[http://news.qq.com/a/20091127/000644.htm].....

分享到：

lucene-使用lius解析html | lucene-索引文件格式

2009-12-23 16:47
浏览 4653
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论