庖丁分词的源码分析（3）字典文件的生成和使用

huangyunbin

浏览: 2630230 次
性别:
来自: 广州

最近访客更多访客>>

cht的大摩托

xiaoxiaoHer

zzqfsy

为了ta

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

庖丁分词的源码分析

庖丁分词源码分析

可以看到庖丁分词里有两种字典，一种读取的是编译处理之前的，一种是处理之后的。
基本上看他们的getVocabularyDictionary方法就可以了，这个是获取字典文件的主要方法。

在这里两个字典对象的基本相同，只是处理前的要过滤一些词，而处理过的直接就可以了。

处理后的字典对象的getVocabularyDictionary方法：

/**
	 * 词汇表字典
	 * 
	 * @return
	 */
	public synchronized Dictionary getVocabularyDictionary() {
		if (vocabularyDictionary == null) {
			// 大概有5639个字有词语，故取0x2fff=x^13>8000>8000*0.75=6000>5639
			vocabularyDictionary = new HashBinaryDictionary(
					getVocabularyWords(), 0x2fff, 0.75f);
		}
		return vocabularyDictionary;
	}

其中的关键其实就是这个HashBinaryDictionary了。其实这个就是个多叉树。
用这个方法来生成字典的。

protected void createSubDictionaries() {
		if (this.start >= ascWords.length) {
			return;
		}
		
		// 定位相同头字符词语的开头和结束位置以确认分字典
		int beginIndex = this.start;
		int endIndex = this.start + 1;
		
		char beginHashChar = getChar(ascWords[start], hashIndex);
		char endHashChar;
		for (; endIndex < this.end; endIndex++) {
			endHashChar = getChar(ascWords[endIndex], hashIndex);
			if (endHashChar != beginHashChar) {
				addSubDictionary(beginHashChar, beginIndex, endIndex);
				beginIndex = endIndex;
				beginHashChar = endHashChar;
			}
		}
		addSubDictionary(beginHashChar, beginIndex, this.end);
	}

说明一下，传入字典的这个大数组是排好序的。endHashChar != beginHashChar，说的是相同位置的字不一样的时候，如一一一二二二，第三个字符串二二和前面的两个字的第一个字不一样，前面的两个都是一二二的第一个字是二。这个时候执行addSubDictionary。意思就是说相同位置的字相同，那么久在一个SubDictionary，就是说这些是个分叉树。然后继续这样不断分叉下去。

/**
	 * 将位置在beginIndex和endIndex之间(不包括endIndex)的词语作为一个分词典
	 * 
	 * @param hashChar
	 * @param beginIndex
	 * @param endIndex
	 */
	protected void addSubDictionary(char hashChar, int beginIndex, int endIndex) {
		Dictionary subDic = createSubDictionary(ascWords, beginIndex, endIndex);
		SubDictionaryWrap subDicWrap = new SubDictionaryWrap(hashChar,
				subDic, beginIndex);
		subs.put(keyOf(hashChar), subDicWrap);
	}

注意subDicWrap包装的可能是HashBinaryDictionary，也可能是BinaryDictionary（二叉树），当数目小于16的时候就用二叉树了。

protected Dictionary createSubDictionary(Word[] ascWords, int beginIndex,
			int endIndex) {
		int count = endIndex - beginIndex;
		if (count < 16) {
			return new BinaryDictionary(ascWords, beginIndex, endIndex);
		} else {
			return new HashBinaryDictionary(ascWords, hashIndex + 1,
					beginIndex, endIndex, getCapacity(count), 0.75f);
		}
	}

所以其实整个字典就是个多叉树。

字典的使用：其实就是用个关键词去匹配这个多叉树，
HashBinaryDictionary的查找方法

public Hit search(CharSequence input, int begin, int count) {
		SubDictionaryWrap subDic = (SubDictionaryWrap) subs.get(keyOf(input
				.charAt(hashIndex + begin)));
		if (subDic == null) {
			return Hit.UNDEFINED;
		}
		Dictionary dic = subDic.dic;
		// 对count==hashIndex + 1的处理
		if (count == hashIndex + 1) {
			Word header = dic.get(0);
			if (header.length() == hashIndex + 1) {
				if (subDic.wordIndexOffset + 1 < this.ascWords.length) {
					return new Hit(subDic.wordIndexOffset, header,
							this.ascWords[subDic.wordIndexOffset + 1]);
				} else {
					return new Hit(subDic.wordIndexOffset, header, null);
				}
			} else {
				return new Hit(Hit.UNCLOSED_INDEX, null, header);
			}
		}
		// count > hashIndex + 1
		Hit word = dic.search(input, begin, count);
		if (word.isHit()) {
			int index = subDic.wordIndexOffset + word.getIndex();
			word.setIndex(index);
			if (word.getNext() == null && index < size()) {
				word.setNext(get(index + 1));
			}
		}
		return word;
	}

注意这里的状态有 Hit的几个状态来对应找到：index>0，找不到:UNCLOSED_INDEX，和以这个词开头的都找不到:UNDEFINED。至于为什么要这些状态，这些在分词的环节中是有大用处的。

二叉树的和这个也类似我就不细说了。