论坛首页 → Java企业应用论坛 →

庖丁分词的源码分析（3）字典文件的生成和使用

全部 Hibernate Spring Struts iBATIS 企业应用 Lucene SOA Java综合 Tomcat 设计模式 OO JBoss

浏览 1692 次

锁定老帖子主题：庖丁分词的源码分析（3）字典文件的生成和使用精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者	正文
huangyunbin 等级: 性别: 文章: 38 积分: 300 来自: 广州	发表时间：2013-04-08 相关推荐: 庖丁分词的源码分析 ElasticSearch最全分词器比较及使用方法 Elasticsearch 默认分词器和中分分词器之间的比较及使用方法 ElasticSearch源码解析（一）:转一篇介绍中文分词的文章 Elasticsearch 系列文章（一）：Elasticsearch 默认分词器和中分分词器之间的比较及使用方法更多相关推荐庖丁分词源码分析可以看到庖丁分词里有两种字典，一种读取的是编译处理之前的，一种是处理之后的。基本上看他们的getVocabularyDictionary方法就可以了，这个是获取字典文件的主要方法。在这里两个字典对象的基本相同，只是处理前的要过滤一些词，而处理过的直接就可以了。处理后的字典对象的getVocabularyDictionary方法： /** * 词汇表字典 * * @return / public synchronized Dictionary getVocabularyDictionary() { if (vocabularyDictionary == null) { // 大概有5639个字有词语，故取0x2fff=x^13>8000>80000.75=6000>5639 vocabularyDictionary = new HashBinaryDictionary( getVocabularyWords(), 0x2fff, 0.75f); } return vocabularyDictionary; } 其中的关键其实就是这个HashBinaryDictionary了。其实这个就是个多叉树。用这个方法来生成字典的。 protected void createSubDictionaries() { if (this.start >= ascWords.length) { return; } // 定位相同头字符词语的开头和结束位置以确认分字典 int beginIndex = this.start; int endIndex = this.start + 1; char beginHashChar = getChar(ascWords[start], hashIndex); char endHashChar; for (; endIndex < this.end; endIndex++) { endHashChar = getChar(ascWords[endIndex], hashIndex); if (endHashChar != beginHashChar) { addSubDictionary(beginHashChar, beginIndex, endIndex); beginIndex = endIndex; beginHashChar = endHashChar; } } addSubDictionary(beginHashChar, beginIndex, this.end); } 说明一下，传入字典的这个大数组是排好序的。endHashChar != beginHashChar，说的是相同位置的字不一样的时候，如一一一二二二，第三个字符串二二和前面的两个字的第一个字不一样，前面的两个都是一二二的第一个字是二。这个时候执行addSubDictionary。意思就是说相同位置的字相同，那么久在一个SubDictionary，就是说这些是个分叉树。然后继续这样不断分叉下去。 /** * 将位置在beginIndex和endIndex之间(不包括endIndex)的词语作为一个分词典 * * @param hashChar * @param beginIndex * @param endIndex */ protected void addSubDictionary(char hashChar, int beginIndex, int endIndex) { Dictionary subDic = createSubDictionary(ascWords, beginIndex, endIndex); SubDictionaryWrap subDicWrap = new SubDictionaryWrap(hashChar, subDic, beginIndex); subs.put(keyOf(hashChar), subDicWrap); } 注意subDicWrap包装的可能是HashBinaryDictionary，也可能是BinaryDictionary（二叉树），当数目小于16的时候就用二叉树了。 protected Dictionary createSubDictionary(Word[] ascWords, int beginIndex, int endIndex) { int count = endIndex - beginIndex; if (count < 16) { return new BinaryDictionary(ascWords, beginIndex, endIndex); } else { return new HashBinaryDictionary(ascWords, hashIndex + 1, beginIndex, endIndex, getCapacity(count), 0.75f); } } 所以其实整个字典就是个多叉树。字典的使用：其实就是用个关键词去匹配这个多叉树， HashBinaryDictionary的查找方法 public Hit search(CharSequence input, int begin, int count) { SubDictionaryWrap subDic = (SubDictionaryWrap) subs.get(keyOf(input .charAt(hashIndex + begin))); if (subDic == null) { return Hit.UNDEFINED; } Dictionary dic = subDic.dic; // 对count==hashIndex + 1的处理 if (count == hashIndex + 1) { Word header = dic.get(0); if (header.length() == hashIndex + 1) { if (subDic.wordIndexOffset + 1 < this.ascWords.length) { return new Hit(subDic.wordIndexOffset, header, this.ascWords[subDic.wordIndexOffset + 1]); } else { return new Hit(subDic.wordIndexOffset, header, null); } } else { return new Hit(Hit.UNCLOSED_INDEX, null, header); } } // count > hashIndex + 1 Hit word = dic.search(input, begin, count); if (word.isHit()) { int index = subDic.wordIndexOffset + word.getIndex(); word.setIndex(index); if (word.getNext() == null && index < size()) { word.setNext(get(index + 1)); } } return word; } 注意这里的状态有 Hit的几个状态来对应找到：index>0，找不到:UNCLOSED_INDEX，和以这个词开头的都找不到:UNDEFINED。至于为什么要这些状态，这些在分词的环节中是有大用处的。二叉树的和这个也类似我就不细说了。大小: 4.8 KB 查看图片附件声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

论坛首页 → Java企业应用版

跳转论坛:

Global site tag (gtag.js) - Google Analytics