分词词典的构造机制（一）

wzhiju

浏览: 141932 次
性别:
来自: 北京

最近访客更多访客>>

ybenx

wql07131003

mshareyou

tuche

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

分词词典构造

算法

对于初步接触分词的朋友们来说，分词词典的构造是一件不可小觑的事情。因为词典的好坏直接影响到算法的性能、运行时间。换句话说，分词词典构造的好，将会极大地改观分词的性能，而各种复杂的分词算法，直接依赖于分词词典的构造机制（是进行分词的根基）。下面将分几个部分进行词典构造机制几种方法的介绍。
在这片文章中，根据我所用过的最基础的方法进行词典的构造，即拼音的索引方法。（也是大家最能直接想到的方法）
下面结合我的应用，分享一下具体的做法。
1. hashMap.java 文件
文件用来将拼音生成一个 LinkedHashMap 表，并对应与相应的键值。如下所示：
hashMap.put("a", 0);
        hashMap.put("ai", 1);
        hashMap.put("an", 2);
        hashMap.put("ang", 3);
        hashMap.put("ao", 4);
        hashMap.put("ba", 5);
        hashMap.put("bai", 6);
hashMap.put("ban", 7);
…
将拼音“a”放在哈希散列图的第0个位置，依次类推。
2. CnToSpell.java 文件
此文件用于将汉字转化为读音。例如输入“你好”将返回“nihao”。其中用到的getCnAscii() 方法，是根据汉字的国标码将其对应于相应int型的数值。如下：
public static int getCnAscii(char cn) {
byte[] bytes = null;
try {
bytes = (String.valueOf(cn)).getBytes("gbk");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
if (bytes == null || bytes.length > 2 || bytes.length <= 0) {
return 0;
}
if (bytes.length == 1) {
return bytes[0];
}
if (bytes.length == 2) {
int hightByte = 256 + bytes[0];
int lowByte = 256 + bytes[1];
int ascii = (256 * hightByte + lowByte) - 256 * 256;
// System.out.println("ASCII=" + ascii);
return ascii;
}
return 0;
}

而initialize()方法则将对应的数值和拼音的对应关系存入哈希散列图中。
（LinkedHashMap<String, Integer>spellMap = new LinkedHashMap<String, Integer>(400);）
spellPut("a", -20319);
spellPut("ai", -20317);
spellPut("an", -20304);
spellPut("ang", -20295);
spellPut("ao", -20292);
spellPut("ba", -20283);
spellPut("bai", -20265);
spellPut("ban", -20257);
spellPut("bang", -20242);
spellPut("bao", -20230);
spellPut("bei", -20051);
spellPut("ben", -20036);
spellPut("beng", -20032);
……

所以对于一个要转化为拼音的汉字而言，首先用getCnAscii() 方法获得int型数值，而后根据哈希散列图寻找相应的读音。
特别地，对于多音字的部分词汇将归为排在前面的第一个拼音。此外，其他情况，例如某些复杂汉字无法找到其对应的拼音，将要单独归为一类。（因为此种情况的汉字不算太多，所以对于性能而言，没有影响）
3. Dictionary.java文件
遍历一个txt 词典文件，根据第一个汉字的读音，将它们按照读音的不同归为不同的类中。
public void makeDictionary() throws IOException {// 打开一个到文件WordTable.txt的一个流
BufferedReader bf = new BufferedReader(new FileReader("stopword.txt"));
String dicItem = "";// 每一行的词语
int currentZiYinIndex = 0;// 当期字音的键值
while ((dicItem = bf.readLine()) != null) {
String firstCharacter = dicItem.substring(0, 1);// 获取第一个字
String ziYin = CnToSpell.getFullSpell(firstCharacter);// 获取第一个字的字音
if (hashMap.get(ziYin) != null) {
currentZiYinIndex = hashMap.get(ziYin).intValue(); // 获得第一个字音对应的键值
if (dicStr[currentZiYinIndex] == null) {
dicStr[currentZiYinIndex] = dicItem + " ";
} else {
dicStr[currentZiYinIndex] += dicItem + " ";
}
}// 否则存入未能识别的数组
else {// unknownWords 每次都叠加
if (dicStr[395] == null) {
dicStr[395] = dicItem + " ";
} else {
dicStr[395] += dicItem + " ";
}
}
}
}
同样的道理，查找某一个词条时，首先根据首字转化为拼音，然后，根据该拼音缩小查找范围，提高查询速率。程序如下：
public boolean lookDictionary(String words) {
boolean flag = false;// 定义布尔型变量用来标志是否找到
String firstWord = "";
if (words.length() > 1) {
firstWord = words.substring(0, 1);
} else if (words.length() == 1) {
firstWord = String.valueOf(words.charAt(0));
}
// System.out.println(firstWord);
String ziYin = CnToSpell.getFullSpell(firstWord);// 获取第一个字的字音
int index = 0;// 用来标记键值
if (hashMap.get(ziYin) != null) {
index = hashMap.get(ziYin).intValue();// 获得键值，找到相应的数组
} else {
index = 395;
}
if (dicStr[index] == null)
return false;
String singleWord[] = dicStr[index].split(" ");// 将一个长的字符串按照空格划分为多个单词
int numOfSpace = 0;// 定义一个变量来统计空格的个数，从而来确定划分完后的数组个数
for (int i = 0; i < dicStr[index].length(); i++) {// 统计有多少个空格，从而确定数组的个数
if (dicStr[index].charAt(i) == ' ') {
numOfSpace++;
}
}
// 在划分完的数组中查找该词，如果找到将flag设为true
for (int ind = 0; ind < (numOfSpace); ind++) {
if (singleWord[ind].equals(words)) {
flag = true;
break;
}
}
return flag;
}
当然，分词词典的构成机制还有其他的集中主流方法，在下面的文档中将会和大家一起交流。

dic.rar (6.9 KB)
下载次数: 28

0
顶

0
踩

分享到：

JTextArea append() 方法的每次循环不能正 ... | 自己DIY一个简单的邮箱登陆页面

2010-10-05 16:08
浏览 1974
评论(0)
论坛回复 / 浏览 (0 / 2672)
分类:互联网
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论