paoding分词工具的字典如何构建

单眼皮大娘

浏览: 113263 次
性别:
来自: 上海

最近访客更多访客>>

yujicun

yangjb

大口仔

VK血狼

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

分词

paoding 分词词典结构 hashbinarydictionary binarydictionary

    分词工具不管如何变，其肯定会包含字典管理模块（当然，这是针对按字符串匹配分词），就算是基于语义分词也得有语义字典，基于统计需要词频字典等等。

    在调研了mmseg4j，ictclas4j（imdict和ictclas4j属于一类，只不过其为了效率去掉了ictclas4j的命名实体识别部分），IKAnalyzer，paoding 等分词器后，发现他们的字典管理基本大同小异。一下以paoding为例，解释下分词工具的字典管理模块。

    先说下paoding的字典数据结构。下面代码是字典接口，BinaryDictionary 和 HashBinaryDictionary 都实现该接口。其采用面向接口编程思想，其好处就是主要逻辑不用修改，易扩展。

public interface Dictionary {
	public int size();
	public Word get(int index);
	public Hit search(CharSequence input, int offset, int count);
}

HashBinaryDictionary的数据结构

public class HashBinaryDictionary implements Dictionary {

	/**
	 * 字典中所有词语，用于方便{@link #get(int)}方法
	 */
	private Word[] ascWords;

	/**
	 * 首字符到分词典的映射
	 */
	private Map/* <Object, SubDictionaryWrap> */subs;
	private final int hashIndex;

	private final int start;
	private final int end;
	private final int count;
}

BinaryDictionary 的数据结构

public class BinaryDictionary implements Dictionary {

	// -------------------------------------------------

	private Word[] ascWords;

	private final int start;
	private final int end;
	private final int count;
}

字典文件首先是加载到一个HashSet中，这样的好处是可以去掉冗余的词，然后倒入到一个数组中，接着用

Arrays.sort(array);

这个方法对数组中的字典按升序排序，这样方便后续的二叉查找。

    下面就看一下其如何把一个array变成 BinaryDictionary的（注：HashBinaryDictionary最后也是以BinaryDictionary结构存储的）。

     首先通过FileDictionaries类中的

public synchronized Dictionary getVocabularyDictionary()

这个方法，加载字典文件中的词条到数组中，然后通过HashBinaryDictionary的构造方法开始构建字典的Hash数据结构，以key=词汇的首字母为分词典的索引键值（这个是BinaryDictionary的最终方式，HashBinaryDictionary如果词条的个数大于一定值，就按照词条的第二个字建立BinaryDictionary结构，这个过程是一个递归的过程）。看一下paoding中的这段代码：

public HashBinaryDictionary(Word[] ascWords, int hashIndex, int start,
			int end, int initialCapacity, float loadFactor) {
		this.ascWords = ascWords;
		this.start = start;
		this.end = end;
		this.count = end - start;
		this.hashIndex = hashIndex;
		subs = new HashMap(initialCapacity , loadFactor);
		createSubDictionaries();
	}

start记录的是分词典的开始偏移量，end记录的是分词典的末偏移量，initialCapacity 和 loadFactor两个值确定分词典的容积（目的估计是为了节省空间，因为构建这样的数据结构时确实是以空间换取时间来提升性能的）。
看一下createSubDictionaries()这个函数：

protected void createSubDictionaries() {
		if (this.start >= ascWords.length) {
			return;
		}
		
		// 定位相同头字符词语的开头和结束位置以确认分字典
		int beginIndex = this.start;
		int endIndex = this.start + 1;
		
		char beginHashChar = getChar(ascWords[start], hashIndex);
		char endHashChar;
		for (; endIndex < this.end; endIndex++) {
			endHashChar = getChar(ascWords[endIndex], hashIndex);
			if (endHashChar != beginHashChar) {
				addSubDictionary(beginHashChar, beginIndex, endIndex);
				beginIndex = endIndex;
				beginHashChar = endHashChar;
			}
		}
		addSubDictionary(beginHashChar, beginIndex, this.end);
	}

其大致流程就是以词典的首个字惊醒对比，目的是分块。即把每个首个字一样的词划分为一个子词典，hashIndex开始的时候是0，表示从第一个字开始，什么时候hashIndex的值变呢？
看一下addSubDictionary(beginHashChar, beginIndex, this.end)会有些眉目。

protected void addSubDictionary(char hashChar, int beginIndex, int endIndex) {
		Dictionary subDic = createSubDictionary(ascWords, beginIndex, endIndex);
		SubDictionaryWrap subDicWrap = new SubDictionaryWrap(hashChar,
				subDic, beginIndex);
		subs.put(keyOf(hashChar), subDicWrap);
	}

貌似没看到啥，在来看一下createSubDictionary函数：

protected Dictionary createSubDictionary(Word[] ascWords, int beginIndex,
			int endIndex) {
		int count = endIndex - beginIndex;
		if (count < 16) {
			return new BinaryDictionary(ascWords, beginIndex, endIndex);
		} else {
			return new HashBinaryDictionary(ascWords, hashIndex + 1,
					beginIndex, endIndex, getCapacity(count), 0.75f);
		}
	}

这下可以看的很清楚，当子字典的词条数大于一定值的时候就会使hashIndex 加 1,这里庖丁的作者把这个值设置为15。从createSubDictionary函数可以很清楚的看到构建字典结构是一个递归的过程，当词条数大于一定值时，就会把该子词典接着切分成更小的词典，因为hash是直接映射过去的，要比二叉查找快的多，但是统一构建成hash查找的方式取代二叉查找，其内存开销会很大，当字典达到一定规模后绝对会抛出 OOM错误，这是很头疼的问题，我估计作者考虑这一点，选择了个这种的办法。

     这样构建字典数据结构基本完成。

     后续：个人认为构建这样的数据结构不是很好，因为这样开销太大。中科院的ictclas4j采用的是二分查找的方法，其主要依靠的是他的字典，一次人工处理，终生受益，虽然ictclas4j采用的是ArrayList的存储方式，但由于字典经过人工整理，本身的字典就按照汉字的值进行排序，因而数组也就先天的具有了一定的序列，不需要在捣腾。这样的数据结构内存开销不大，但是有一个缺陷，那就是可扩展性几乎为0。

分享到：

paoding基于词典如何分词 | 基于本体语义标注

2012-05-02 16:58
浏览 3532
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论