原创中文分词代码分享（2.1）——基于词典的分词接口

billgmh

浏览: 65965 次
性别:
来自: 广东广州

最近访客更多访客>>

ssssss_s

yinogong

Dawakarpo

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

分词与索引

Java C C++C#

现在来看一下基于词典的分词接口（最大匹配法）。先来看一下分词处理的接口SegmentProcessorImpl：

java 代码

/*
* @作者:Hades , 创建日期:2006-11-17
*
* 汕头大学03计算机本科
*
*/
package edu.stu.cn.segment.matching.processor;
import java.util.LinkedList;
import edu.stu.cn.segment.matching.dictionary.DictionaryImpl;
/**
*
* @author Hades Guan 中文分词接口
*/
public interface SegmentProcessorImpl
{
/**
* 对srcFile文件进行分词，把结果保存为到tagFile文件中
*
* @param srcFile
* 待分词的文本文件
* @param tagFile
* 分词结果保存目的文件
*/
public void fileProcessor(String srcFile, String tagFile);
/**
* @return 返回 dic。
*/
public DictionaryImpl getDic();
/**
* @param dic
* 要设置的 dic。
*/
public void setDic(DictionaryImpl dic);
/**
* 对text文本进行分词，把结果保存为字符串链表
*
* @param text
* 待分词的文本
* @return 分词结果
*/
public LinkedList<string> textProcess(String text); </string>
}

接口中定义了4个方法：设置词典setDic，获取词典getDic，对源文件分词后写入目标文件fileProcessor，对text字符串进行分词后返回结果链表textProcess。

接着是实现了SegmentProcessorImpl接口的抽象类MaxSegmentProcessor：

java 代码

/*
* @作者:Hades , 创建日期:2006-11-17
*
* 汕头大学03计算机本科
*
*/
package edu.stu.cn.segment.matching.processor;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.LinkedList;
import edu.stu.cn.segment.matching.dictionary.DictionaryImpl;
/**
* @author Hades Guan 基于词典匹配的中文分词抽象类
*/
public abstract class MatchSegmentProcessor implements SegmentProcessorImpl
{
/**
* 词典操作类
*/
protected DictionaryImpl dic = null;
/**
* 分隔符字符串
*/
protected String seperator = null;
/**
* 英文数字字符集
*/
protected final String CHAR_AND_NUM = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
/**
* 初始化分隔符的方法
*/
protected void initSeperator()
{
// 初始化分隔符
StringBuffer buffer = new StringBuffer();
for (char c = '\u0000'; c <= '\u007F'; c++)
{
// 不过滤英文、数字字符
if (this.CHAR_AND_NUM.indexOf(c) < 0)
buffer.append(c);
}
for (char c = '\uFF00'; c <= '\uFFEF'; c++)
buffer.append(c);
buffer.append("《》？，。、：“；‘’”『』【】－―—─＝÷＋§·～！◎＃￥％…※×（）　");
this.seperator = buffer.toString();
}
/**
* 对srcFile文件进行分词，把结果保存为到tagFile文件中
*
* @param srcFile
* 待分词的文本文件
* @param tagFile
* 分词结果保存目的文件
*/
public void fileProcessor(String srcFile, String tagFile)
{
try
{
// 初始化输入输出
BufferedReader in = new BufferedReader(new FileReader(srcFile));
PrintWriter out = new PrintWriter(new BufferedWriter(
new FileWriter(tagFile)));
// 读入文件
String line = null;
StringBuffer buffer = new StringBuffer();
while ((line = in.readLine()) != null)
{
buffer.append(line);
}
// 关闭输入
in.close();
// 分词处理
LinkedList<string> result = this.textProcess(buffer.toString() </string>
.trim());
// 将结果写入文件
for (String w : result)
out.println(w);
// 关闭输出
out.flush();
out.close();
}
catch (FileNotFoundException e)
{
// TODO 自动生成 catch 块
e.printStackTrace();
}
catch (IOException e)
{
// TODO 自动生成 catch 块
e.printStackTrace();
}
}
/**
* @return 返回 dic。
*/
public DictionaryImpl getDic()
{
return dic;
}
/**
* @param dic
* 要设置的 dic。
*/
public void setDic(DictionaryImpl dic)
{
this.dic = dic;
}
/**
* 对text文本进行分词，把结果保存为字符串链表
*
* @param text
* 待分词的文本
* @return 分词结果
*/
abstract public LinkedList<string> textProcess(String text); </string>
}

抽象类中实现了具体实现类中相同的操作：设置词典setDic，获取词典getDic，初始化分隔字符（如：逗号，句号等）initSeperator，文件操作fileProcessor（先从源文件中读入内容构建成为字符串后，调用textProcess操作进行分词，最后将结果输出到目标文件中）。

分享到：

原创中文分词代码分享（2.2）——基于词典 ... | 原创中文分词代码分享（1.2）——词典接口

2006-12-28 08:32
浏览 2660
评论(0)
论坛回复 / 浏览 (0 / 3021)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

原创中文分词代码分享（2.1）——基于词典的分词接口

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

原创中文分词代码分享（2.1）——基于词典的分词接口

评论

发表评论

相关推荐

初试Hibernate Search

引入局部统计识别高频词汇的Lucene中文分词程序

基于词典的最大匹配的Lucene中文分词程序

原创中文分词代码分享（2.2）——基于词典的分词接口

原创中文分词代码分享（1.2）——词典接口

原创中文分词代码分享（1.1）——词典接口

原创中文分词代码分享（0）——序言

最近访客更多访客>>