一种非常简单,但是不是很优化的方法,继承Lucene.Net.Analysis.Analyzer,实现了Lucene.Net.Analysis.Analyzer,Lucene.Net.Analysis.Tokenizer,Lucene.Net.Analysis.TokenFilter的子类.参考了Lucene.Net.Analysis.Cn的实现,该项目采用对汉语进行一元分词.
ChineseAnalyer类,继承自Lucene.Net.Analysis.Analyzer
using System;
using System.IO;
using System.Text;
using System.Collections;
using ShootSeg; //分词类的命名空间,该分词组件来源于http://www.shootsoft.net,开源项目,感谢作者
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{
public class ChineseAnalyzer : Analyzer
{
private Segment segment = new Segment(); //这个是自己的中文分词类
public ChineseAnalyzer()
{
segment.InitWordDics(); //在构造函数装载词典
segment.Separator = "|"; //分词间隔符号
}
public override sealed TokenStream TokenStream(String fieldName, TextReader reader)
{
TokenStream result = new ChineseTokenizer(reader,segment); //把分词类传引用进去
result = new ChineseFilter(result); //对处理好的结果进行过滤
return result;
}
}
}
ChineseTokenizer类继承自Lucene.Net.Analysis.Tokenizer
using System;
using System.IO;
using System.Text;
using System.Collections;
using System.Globalization;
using ShootSeg;
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{
public sealed class ChineseTokenizer : Tokenizer
{
private Segment segment;
private string[] Wordlist; //切好的词放入此数组中
private string Allstr; //对传入的流转成此string
private int offset = 0; int start = 0; int step = 0; //offset偏移量,start开始位置,step次数
public ChineseTokenizer(TextReader _in,Segment segment)
{
input = _in;
Allstr = input.ReadToEnd(); //把流读到Allstr
this.segment = segment; //继续传引用,这会才发现写的时候糊涂了,完全可以不用写
Wordlist = segment.SegmentText(Allstr).Split('|'); //把分好的词装入wordlist
}
private Token Flush(string str)
{
if (str.Length > 0)
{
return new Token(str,start, start + str.Length); //返回一个Token 包含词,词在流中的开始位置和结束位置.
}
else
return null;
}
public override Token Next() //重载Next函数,就是返回Token
{
Token token = null;
if (step <= Wordlist.Length)
{
start = Allstr.IndexOf(Wordlist[step], offset); //从Allstr里找每个分出来词汇的开始位置
offset = start + 1; //计算偏移量
token = Flush(Wordlist[step]); //返回已分词汇
step = step + 1; //变量+1,移动到wordlist的下一个词汇
}
return token;
}
}
}
这个ChineseFilter继承自Lucene.Net.Analysis.TokenFilter,完全照抄Lucene.Net.Analysis.Cn工程的同名类(此类过滤了数字及符号,英文助词,需要过滤其他相应增加代码)
using System;
using System.IO;
using System.Collections;
using System.Globalization;
using Lucene.Net.Analysis;
namespace Lucene.Net.Analysis.CnByKing
{
/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2004 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/
/// <summary>
/// Title: ChineseFilter
/// Description: Filter with a stop word table
/// Rule: No digital is allowed.
/// English word/token should larger than 1 character.
/// One Chinese character as one Chinese word.
/// TO DO:
/// 1. Add Chinese stop words, such as \ue400
/// 2. Dictionary based Chinese word extraction
/// 3. Intelligent Chinese word extraction
///
/// Copyright: Copyright (c) 2001
/// Company:
/// @author Yiyi Sun
/// @version $Id: ChineseFilter.java, v 1.4 2003/01/23 12:49:33 ehatcher Exp $
/// </summary>
public sealed class ChineseFilter : TokenFilter
{
// Only English now, Chinese to be added later.
public static String[] STOP_WORDS =
{
"and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};
private Hashtable stopTable;
public ChineseFilter(TokenStream _in)
: base(_in)
{
stopTable = new Hashtable(STOP_WORDS.Length);
for (int i = 0; i < STOP_WORDS.Length; i++)
stopTable[STOP_WORDS[i]] = STOP_WORDS[i];
}
public override Token Next()
{
for (Token token = input.Next(); token != null; token = input.Next())
{
String text = token.TermText();
// why not key off token type here assuming ChineseTokenizer comes first?
if (stopTable[text] == null)
{
switch (Char.GetUnicodeCategory(text[0]))
{
case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.UppercaseLetter:
// English word/token should larger than 1 character.
if (text.Length > 1)
{
return token;
}
break;
case UnicodeCategory.OtherLetter:
// One Chinese character as one Chinese word.
// Chinese word extraction to be added later here.
return token;
}
}
}
return null;
}
}
}
以上基本没什么技术含量,好处就是增加新的中文分词不管什么算法,只需要简单几行代码搞定.中文分词完全和DotLucene/Lucene.net本身无关. 使用的时候用ChineseAnalyzer替换 StandardAnalyzer就OK了.
分享到:
相关推荐
Lucene.Net是一个开源的全文搜索引擎库,它是Apache Lucene项目在.NET平台上的实现,由DotLucene发展而来,广泛应用于各种信息检索和文本挖掘场景。这个资料包包含了Lucene.Net 2.0的源码和相关文档,对于想要深入...
在“Lucene.net与Sql索引源码”这个主题中,我们将探讨如何将Lucene.NET与SQL数据库集成,构建一个强大的混合搜索解决方案。 首先,让我们理解Lucene.NET的工作原理。Lucene.NET是基于倒排索引(Inverted Index)的...
对初学使用dotlucent作站...利用dotlucene为网站做的索引文件的应用程序。 数据库源是SQL Server,项目是用VS.NET2008开发的。 应用程序界面可以配置数据库链接,生成报告,定时执行增量索引,对单条索引进行更新操作。
Lucene结合Sql建立索引Demo源码 Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用...支持简单的中文分词,同时提供了Lucene.Net-2.0-004版本的源码给大家
当我们将视线转向.NET平台,Lucene.Net就成为了一个关键的工具,它为.NET开发者提供了在C#、VB.NET等语言中使用Lucene的强大支持。本文将重点探讨基于.NET的Lucene应用实例——"dotLuceneTest.jar",并解析其中的...
该示例中DotLucene版本为 1.3,Highlighter版本为1.3.2.1,如果下载最新的lucene(Lucene.Net-2.0-004) 【该源码由51aspx提供】 源码 " width="468" resize="true" onerror="this.src='/images/ifnoimg....
Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. ...支持简单的中文分词,同时提供了Lucene.Net-2.0-004版本的源码给大家
可以在这里查看http://www.shootsoft.net/dotlucene在线测试. 支持微软标准IFilter,支持自己写插件. 没有使用自己写的分词程序,时间不是很充足... dotLucene下的高亮显示好像有问题,不是很好用
在这个"DotLucene演示源码"中,我们可以深入理解如何在.NET环境中利用Lucene进行全文检索、智能分词和关键字高亮等操作。 首先,让我们来了解一下DotLucene的核心概念和功能。DotLucene提供了一个高效的索引机制,...
Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...
dotLucene,全称为Lucene.NET,是Apache Lucene的.NET版本,它为.NET开发人员提供了一个完整的、高性能的全文检索库。Lucene以其强大的索引和搜索功能著称,它能够快速地处理大量文本数据,构建索引并在短时间内检索...
Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...
DotLucene是一款基于.NET平台的全文搜索引擎库,它是Apache Lucene的.NET版本。这个压缩包“商业源码-编程源码-DotLucene演示源码.zip”包含的是使用DotLucene进行开发的示例代码,可以帮助开发者理解如何在实际项目...
DotLucene实际是Lucene的Asp.net版本,也称为lucene.net 该项目的原型为DotLuceneAPISearchDemo-1.1,后经51aspx升级为Asp.net2.0版本并改为WebApplication类型 该demo演示了Lucene的常用功能(智能分词、关键字高亮...
- 探讨Lucene中排序的相关概念和技术,如ScoreDoc、Sort、SortField等。 - 了解如何根据不同的字段对搜索结果进行排序,以及如何实现自定义排序逻辑。 #### 八、过滤 - 过滤器(Filter)允许开发者在查询结果中...
Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能. Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,...
dotlucene 源码 开发包 包括dotlucene1.9.1源码+Demo 学习资料:http://www.cnblogs.com/idior/category/21216.html http://www.cnblogs.com/kwklover/category/62322.html
Lucene结合Sql建立索引Demo源码,Lucene(这里用到的是Lucene.net版本也成为DotLucene)是一个信息检索的函数库(Library),利用它你可以为你的应用加上索引和搜索的功能。Lucene的使用者不需要深入了解有关全文检索的...