转自:http://yangshangchuan.iteye.com/blog/2056537(有代码可下载)
word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估
博客分类:word分词是一个Java实现的中文分词组件,提供了多种基于词典的分词算法,并利用ngram模型来消除歧义。 能准确识别英文、数字,以及日期、时间等数量词,能识别人名、地名、组织机构名等未登录词。 同时提供了Lucene、Solr、ElasticSearch插件。
word分词器分词效果评估主要评估下面7种分词算法:
正向最大匹配算法:MaximumMatching
逆向最大匹配算法:ReverseMaximumMatching
正向最小匹配算法:MinimumMatching
逆向最小匹配算法:ReverseMinimumMatching
双向最大匹配算法:BidirectionalMaximumMatching
双向最小匹配算法:BidirectionalMinimumMatching
双向最大最小匹配算法:BidirectionalMaximumMinimumMatching
所有的双向算法都使用ngram来消歧,分词效果评估分别评估bigram和trigram。
评估采用的测试文本有253 3709行,共2837 4490个字符,标准文本和测试文本一行行对应,标准文本中的词以空格分隔,评估标准为严格一致,评估核心代码如下:
- /**
- * 分词效果评估
- * @param resultText 实际分词结果文件路径
- * @param standardText 标准分词结果文件路径
- * @return 评估结果
- */
- public static EvaluationResult evaluation(String resultText, String standardText) {
- int perfectLineCount=0;
- int wrongLineCount=0;
- int perfectCharCount=0;
- int wrongCharCount=0;
- try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
- BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
- String result;
- while( (result = resultReader.readLine()) != null ){
- result = result.trim();
- String standard = standardReader.readLine().trim();
- if(result.equals("")){
- continue;
- }
- if(result.equals(standard)){
- //分词结果和标准一模一样
- perfectLineCount++;
- perfectCharCount+=standard.replaceAll("\\s+", "").length();
- }else{
- //分词结果和标准不一样
- wrongLineCount++;
- wrongCharCount+=standard.replaceAll("\\s+", "").length();
- }
- }
- } catch (IOException ex) {
- LOGGER.error("分词效果评估失败:", ex);
- }
- int totalLineCount = perfectLineCount+wrongLineCount;
- int totalCharCount = perfectCharCount+wrongCharCount;
- EvaluationResult er = new EvaluationResult();
- er.setPerfectCharCount(perfectCharCount);
- er.setPerfectLineCount(perfectLineCount);
- er.setTotalCharCount(totalCharCount);
- er.setTotalLineCount(totalLineCount);
- er.setWrongCharCount(wrongCharCount);
- er.setWrongLineCount(wrongLineCount);
- return er;
- }
- /**
- * 中文分词效果评估结果
- * @author 杨尚川
- */
- public class EvaluationResult implements Comparable{
- private int totalLineCount;
- private int perfectLineCount;
- private int wrongLineCount;
- private int totalCharCount;
- private int perfectCharCount;
- private int wrongCharCount;
- public float getLinePerfectRate(){
- return perfectLineCount/(float)totalLineCount*100;
- }
- public float getLineWrongRate(){
- return wrongLineCount/(float)totalLineCount*100;
- }
- public float getCharPerfectRate(){
- return perfectCharCount/(float)totalCharCount*100;
- }
- public float getCharWrongRate(){
- return wrongCharCount/(float)totalCharCount*100;
- }
- public int getTotalLineCount() {
- return totalLineCount;
- }
- public void setTotalLineCount(int totalLineCount) {
- this.totalLineCount = totalLineCount;
- }
- public int getPerfectLineCount() {
- return perfectLineCount;
- }
- public void setPerfectLineCount(int perfectLineCount) {
- this.perfectLineCount = perfectLineCount;
- }
- public int getWrongLineCount() {
- return wrongLineCount;
- }
- public void setWrongLineCount(int wrongLineCount) {
- this.wrongLineCount = wrongLineCount;
- }
- public int getTotalCharCount() {
- return totalCharCount;
- }
- public void setTotalCharCount(int totalCharCount) {
- this.totalCharCount = totalCharCount;
- }
- public int getPerfectCharCount() {
- return perfectCharCount;
- }
- public void setPerfectCharCount(int perfectCharCount) {
- this.perfectCharCount = perfectCharCount;
- }
- public int getWrongCharCount() {
- return wrongCharCount;
- }
- public void setWrongCharCount(int wrongCharCount) {
- this.wrongCharCount = wrongCharCount;
- }
- @Override
- public String toString(){
- return segmentationAlgorithm.name()+"("+segmentationAlgorithm.getDes()+"):"
- +"\n"
- +"分词速度:"+segSpeed+" 字符/毫秒"
- +"\n"
- +"行数完美率:"+getLinePerfectRate()+"%"
- +" 行数错误率:"+getLineWrongRate()+"%"
- +" 总的行数:"+totalLineCount
- +" 完美行数:"+perfectLineCount
- +" 错误行数:"+wrongLineCount
- +"\n"
- +"字数完美率:"+getCharPerfectRate()+"%"
- +" 字数错误率:"+getCharWrongRate()+"%"
- +" 总的字数:"+totalCharCount
- +" 完美字数:"+perfectCharCount
- +" 错误字数:"+wrongCharCount;
- }
- @Override
- public int compareTo(Object o) {
- EvaluationResult other = (EvaluationResult)o;
- if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
- return 1;
- }
- if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
- return -1;
- }
- return 0;
- }
- }
word分词使用trigram评估结果:
- BidirectionalMaximumMinimumMatching(双向最大最小匹配算法):
- 分词速度:265.62566 字符/毫秒
- 行数完美率:55.352688% 行数错误率:44.647312% 总的行数:2533709 完美行数:1402476 错误行数:1131233
- 字数完美率:46.23227% 字数错误率:53.76773% 总的字数:28374490 完美字数:13118171 错误字数:15256319
- BidirectionalMaximumMatching(双向最大匹配算法):
- 分词速度:335.62155 字符/毫秒
- 行数完美率:50.16934% 行数错误率:49.83066% 总的行数:2533709 完美行数:1271145 错误行数:1262564
- 字数完美率:40.692997% 字数错误率:59.307003% 总的字数:28374490 完美字数:11546430 错误字数:16828060
- ReverseMaximumMatching(逆向最大匹配算法):
- 分词速度:686.71045 字符/毫秒
- 行数完美率:46.723125% 行数错误率:53.27688% 总的行数:2533709 完美行数:1183828 错误行数:1349881
- 字数完美率:36.67598% 字数错误率:63.32402% 总的字数:28374490 完美字数:10406622 错误字数:17967868
- MaximumMatching(正向最大匹配算法):
- 分词速度:733.9535 字符/毫秒
- 行数完美率:46.661713% 行数错误率:53.338287% 总的行数:2533709 完美行数:1182272 错误行数:1351437
- 字数完美率:36.72861% 字数错误率:63.271393% 总的字数:28374490 完美字数:10421556 错误字数:17952934
- BidirectionalMinimumMatching(双向最小匹配算法):
- 分词速度:432.87375 字符/毫秒
- 行数完美率:45.863907% 行数错误率:54.136093% 总的行数:2533709 完美行数:1162058 错误行数:1371651
- 字数完美率:35.942123% 字数错误率:64.05788% 总的字数:28374490 完美字数:10198395 错误字数:18176095
- ReverseMinimumMatching(逆向最小匹配算法):
- 分词速度:1033.58636 字符/毫秒
- 行数完美率:41.776066% 行数错误率:58.223934% 总的行数:2533709 完美行数:1058484 错误行数:1475225
- 字数完美率:31.678978% 字数错误率:68.32102% 总的字数:28374490 完美字数:8988748 错误字数:19385742
- MinimumMatching(正向最小匹配算法):
- 分词速度:1175.4431 字符/毫秒
- 行数完美率:36.853836% 行数错误率:63.146164% 总的行数:2533709 完美行数:933769 错误行数:1599940
- 字数完美率:26.859812% 字数错误率:73.14019% 总的字数:28374490 完美字数:7621334 错误字数:20753156
word分词使用bigram评估结果:
- BidirectionalMaximumMinimumMatching(双向最大最小匹配算法):
- 分词速度:233.49121 字符/毫秒
- 行数完美率:55.31531% 行数错误率:44.68469% 总的行数:2533709 完美行数:1401529 错误行数:1132180
- 字数完美率:45.834396% 字数错误率:54.165604% 总的字数:28374490 完美字数:13005277 错误字数:15369213
- BidirectionalMaximumMatching(双向最大匹配算法):
- 分词速度:303.59401 字符/毫秒
- 行数完美率:52.007233% 行数错误率:47.992767% 总的行数:2533709 完美行数:1317712 错误行数:1215997
- 字数完美率:42.424194% 字数错误率:57.575806% 总的字数:28374490 完美字数:12037649 错误字数:16336841
- BidirectionalMinimumMatching(双向最小匹配算法):
- 分词速度:349.67215 字符/毫秒
- 行数完美率:46.766422% 行数错误率:53.23358% 总的行数:2533709 完美行数:1184925 错误行数:1348784
- 字数完美率:36.52718% 字数错误率:63.47282% 总的字数:28374490 完美字数:10364401 错误字数:18010089
- ReverseMaximumMatching(逆向最大匹配算法):
- 分词速度:598.04272 字符/毫秒
- 行数完美率:46.723125% 行数错误率:53.27688% 总的行数:2533709 完美行数:1183828 错误行数:1349881
- 字数完美率:36.67598% 字数错误率:63.32402% 总的字数:28374490 完美字数:10406622 错误字数:17967868
- MaximumMatching(正向最大匹配算法):
- 分词速度:676.7993 字符/毫秒
- 行数完美率:46.661713% 行数错误率:53.338287% 总的行数:2533709 完美行数:1182272 错误行数:1351437
- 字数完美率:36.72861% 字数错误率:63.271393% 总的字数:28374490 完美字数:10421556 错误字数:17952934
- ReverseMinimumMatching(逆向最小匹配算法):
- 分词速度:806.9586 字符/毫秒
- 行数完美率:41.776066% 行数错误率:58.223934% 总的行数:2533709 完美行数:1058484 错误行数:1475225
- 字数完美率:31.678978% 字数错误率:68.32102% 总的字数:28374490 完美字数:8988748 错误字数:19385742
- MinimumMatching(正向最小匹配算法):
- 分词速度:1020.9208 字符/毫秒
- 行数完美率:36.853836% 行数错误率:63.146164% 总的行数:2533709 完美行数:933769 错误行数:1599940
- 字数完美率:26.859812% 字数错误率:73.14019% 总的字数:28374490 完美字数:7621334 错误字数:20753156
Ansj0.9的评估结果如下:
- Ansj ToAnalysis 精准分词:
- 分词速度:495.9188 字符/毫秒
- 行数完美率:58.609295% 行数错误率:41.390705% 总的行数:2533709 完美行数:1484989 错误行数:1048720
- 字数完美率:50.97614% 字数错误率:49.023857% 总的字数:28374490 完美字数:14464220 错误字数:13910270
- Ansj NlpAnalysis NLP分词:
- 分词速度:350.7527 字符/毫秒
- 行数完美率:58.60353% 行数错误率:41.396465% 总的行数:2533709 完美行数:1484843 错误行数:1048866
- 字数完美率:50.75546% 字数错误率:49.244545% 总的字数:28374490 完美字数:14401602 错误字数:13972888
- Ansj BaseAnalysis 基本分词:
- 分词速度:532.65424 字符/毫秒
- 行数完美率:54.028584% 行数错误率:45.97142% 总的行数:2533709 完美行数:1368927 错误行数:1164782
- 字数完美率:46.84512% 字数错误率:53.15488% 总的字数:28374490 完美字数:13292064 错误字数:15082426
- Ansj IndexAnalysis 面向索引的分词:
- 分词速度:564.6103 字符/毫秒
- 行数完美率:53.510803% 行数错误率:46.489197% 总的行数:2533709 完美行数:1355808 错误行数:1177901
- 字数完美率:46.355087% 字数错误率:53.644913% 总的字数:28374490 完美字数:13153019 错误字数:15221471
Ansj1.4的评估结果如下:
- Ansj ToAnalysis 精准分词:
- 分词速度:581.7306 字符/毫秒
- 行数完美率:58.60302% 行数错误率:41.39698% 总的行数:2533709 完美行数:1484830 错误行数:1048879
- 字数完美率:50.968987% 字数错误率:49.031013% 总的字数:28374490 完美字数:14462190 错误字数:13912300
- Ansj NlpAnalysis NLP分词:
- 分词速度:138.81165 字符/毫秒
- 行数完美率:58.1515% 行数错误率:41.8485% 总的行数:2533687 完美行数:1473377 错误行数:1060310
- 字数完美率:49.806484% 字数错误率:50.19352% 总的字数:28374398 完美字数:14132290 错误字数:14242108
- Ansj BaseAnalysis 基本分词:
- 分词速度:627.68475 字符/毫秒
- 行数完美率:55.3174% 行数错误率:44.6826% 总的行数:2533709 完美行数:1401582 错误行数:1132127
- 字数完美率:48.177986% 字数错误率:51.822014% 总的字数:28374490 完美字数:13670258 错误字数:14704232
- Ansj IndexAnalysis 面向索引的分词:
- 分词速度:715.55176 字符/毫秒
- 行数完美率:50.89444% 行数错误率:49.10556% 总的行数:2533709 完美行数:1289517 错误行数:1244192
- 字数完美率:42.965115% 字数错误率:57.034885% 总的字数:28374490 完美字数:12191132 错误字数:16183358
Ansj分词评估程序如下:
- import java.io.BufferedReader;
- import java.io.BufferedWriter;
- import java.io.FileInputStream;
- import java.io.FileOutputStream;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import java.io.OutputStreamWriter;
- import java.nio.file.Files;
- import java.nio.file.Paths;
- import java.util.ArrayList;
- import java.util.Collections;
- import java.util.List;
- import org.ansj.domain.Term;
- import org.ansj.splitWord.analysis.BaseAnalysis;
- import org.ansj.splitWord.analysis.IndexAnalysis;
- import org.ansj.splitWord.analysis.NlpAnalysis;
- import org.ansj.splitWord.analysis.ToAnalysis;
- /**
- * Ansj分词器分词效果评估
- * @author 杨尚川
- */
- public class AnsjEvaluation {
- public static void main(String[] args) throws Exception{
- // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址:
- // http://pan.baidu.com/s/1hqihzjY
- List<EvaluationResult> list = new ArrayList<>();
- // 对文本进行分词
- float rate = seg("d:/test-text.txt", "d:/result-text-BaseAnalysis.txt", "BaseAnalysis");
- // 对分词结果进行评估
- EvaluationResult result = evaluation("d:/result-text-BaseAnalysis.txt", "d:/standard-text.txt");
- result.setAnalyzer("Ansj BaseAnalysis 基本分词");
- result.setSegSpeed(rate);
- list.add(result);
- // 对文本进行分词
- rate = seg("d:/test-text.txt", "d:/result-text-ToAnalysis.txt", "ToAnalysis");
- // 对分词结果进行评估
- result = evaluation("d:/result-text-ToAnalysis.txt", "d:/standard-text.txt");
- result.setAnalyzer("Ansj ToAnalysis 精准分词");
- result.setSegSpeed(rate);
- list.add(result);
- // 对文本进行分词
- rate = seg("d:/test-text.txt", "d:/result-text-NlpAnalysis.txt", "NlpAnalysis");
- // 对分词结果进行评估
- result = evaluation("d:/result-text-NlpAnalysis.txt", "d:/standard-text.txt");
- result.setAnalyzer("Ansj NlpAnalysis NLP分词");
- result.setSegSpeed(rate);
- list.add(result);
- // 对文本进行分词
- rate = seg("d:/test-text.txt", "d:/result-text-IndexAnalysis.txt", "IndexAnalysis");
- // 对分词结果进行评估
- result = evaluation("d:/result-text-IndexAnalysis.txt", "d:/standard-text.txt");
- result.setAnalyzer("Ansj IndexAnalysis 面向索引的分词");
- result.setSegSpeed(rate);
- list.add(result);
- //输出评估结果
- Collections.sort(list);
- System.out.println("");
- for(EvaluationResult r : list){
- System.out.println(r+"\n");
- }
- }
- private static float seg(final String input, final String output, final String type) throws Exception{
- float rate = 0;
- try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
- BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
- long size = Files.size(Paths.get(input));
- System.out.println("size:"+size);
- System.out.println("文件大小:"+(float)size/1024/1024+" MB");
- int textLength=0;
- int progress=0;
- long start = System.currentTimeMillis();
- String line = null;
- while((line = reader.readLine()) != null){
- if("".equals(line.trim())){
- writer.write("\n");
- continue;
- }
- textLength += line.length();
- switch(type){
- case "BaseAnalysis":
- for(Term term : BaseAnalysis.parse(line)){
- writer.write(term.getName()+" ");
- }
- break;
- case "ToAnalysis":
- for(Term term : ToAnalysis.parse(line)){
- writer.write(term.getName()+" ");
- }
- break;
- case "NlpAnalysis":
- try{
- for(Term term : NlpAnalysis.parse(line)){
- writer.write(term.getName()+" ");
- }
- }catch(Exception e){}
- break;
- case "IndexAnalysis":
- for(Term term : IndexAnalysis.parse(line)){
- writer.write(term.getName()+" ");
- }
- break;
- }
- writer.write("\n");
- progress += line.length();
- if( progress > 500000){
- progress = 0;
- System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%");
- }
- }
- long cost = System.currentTimeMillis() - start;
- rate = textLength/(float)cost;
- System.out.println("字符数目:"+textLength);
- System.out.println("分词耗时:"+cost+" 毫秒");
- System.out.println("分词速度:"+rate+" 字符/毫秒");
- }
- return rate;
- }
- /**
- * 分词效果评估
- * @param resultText 实际分词结果文件路径
- * @param standardText 标准分词结果文件路径
- * @return 评估结果
- */
- private static EvaluationResult evaluation(String resultText, String standardText) {
- int perfectLineCount=0;
- int wrongLineCount=0;
- int perfectCharCount=0;
- int wrongCharCount=0;
- try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
- BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
- String result;
- while( (result = resultReader.readLine()) != null ){
- result = result.trim();
- String standard = standardReader.readLine().trim();
- if(result.equals("")){
- continue;
- }
- if(result.equals(standard)){
- //分词结果和标准一模一样
- perfectLineCount++;
- perfectCharCount+=standard.replaceAll("\\s+", "").length();
- }else{
- //分词结果和标准不一样
- wrongLineCount++;
- wrongCharCount+=standard.replaceAll("\\s+", "").length();
- }
- }
- } catch (IOException ex) {
- System.err.println("分词效果评估失败:" + ex.getMessage());
- }
- int totalLineCount = perfectLineCount+wrongLineCount;
- int totalCharCount = perfectCharCount+wrongCharCount;
- EvaluationResult er = new EvaluationResult();
- er.setPerfectCharCount(perfectCharCount);
- er.setPerfectLineCount(perfectLineCount);
- er.setTotalCharCount(totalCharCount);
- er.setTotalLineCount(totalLineCount);
- er.setWrongCharCount(wrongCharCount);
- er.setWrongLineCount(wrongLineCount);
- return er;
- }
- /**
- * 分词结果
- */
- private static class EvaluationResult implements Comparable{
- private String analyzer;
- private float segSpeed;
- private int totalLineCount;
- private int perfectLineCount;
- private int wrongLineCount;
- private int totalCharCount;
- private int perfectCharCount;
- private int wrongCharCount;
- public String getAnalyzer() {
- return analyzer;
- }
- public void setAnalyzer(String analyzer) {
- this.analyzer = analyzer;
- }
- public float getSegSpeed() {
- return segSpeed;
- }
- public void setSegSpeed(float segSpeed) {
- this.segSpeed = segSpeed;
- }
- public float getLinePerfectRate(){
- return perfectLineCount/(float)totalLineCount*100;
- }
- public float getLineWrongRate(){
- return wrongLineCount/(float)totalLineCount*100;
- }
- public float getCharPerfectRate(){
- return perfectCharCount/(float)totalCharCount*100;
- }
- public float getCharWrongRate(){
- return wrongCharCount/(float)totalCharCount*100;
- }
- public int getTotalLineCount() {
- return totalLineCount;
- }
- public void setTotalLineCount(int totalLineCount) {
- this.totalLineCount = totalLineCount;
- }
- public int getPerfectLineCount() {
- return perfectLineCount;
- }
- public void setPerfectLineCount(int perfectLineCount) {
- this.perfectLineCount = perfectLineCount;
- }
- public int getWrongLineCount() {
- return wrongLineCount;
- }
- public void setWrongLineCount(int wrongLineCount) {
- this.wrongLineCount = wrongLineCount;
- }
- public int getTotalCharCount() {
- return totalCharCount;
- }
- public void setTotalCharCount(int totalCharCount) {
- this.totalCharCount = totalCharCount;
- }
- public int getPerfectCharCount() {
- return perfectCharCount;
- }
- public void setPerfectCharCount(int perfectCharCount) {
- this.perfectCharCount = perfectCharCount;
- }
- public int getWrongCharCount() {
- return wrongCharCount;
- }
- public void setWrongCharCount(int wrongCharCount) {
- this.wrongCharCount = wrongCharCount;
- }
- @Override
- public String toString(){
- return analyzer+":"
- +"\n"
- +"分词速度:"+segSpeed+" 字符/毫秒"
- +"\n"
- +"行数完美率:"+getLinePerfectRate()+"%"
- +" 行数错误率:"+getLineWrongRate()+"%"
- +" 总的行数:"+totalLineCount
- +" 完美行数:"+perfectLineCount
- +" 错误行数:"+wrongLineCount
- +"\n"
- +"字数完美率:"+getCharPerfectRate()+"%"
- +" 字数错误率:"+getCharWrongRate()+"%"
- +" 总的字数:"+totalCharCount
- +" 完美字数:"+perfectCharCount
- +" 错误字数:"+wrongCharCount;
- }
- @Override
- public int compareTo(Object o) {
- EvaluationResult other = (EvaluationResult)o;
- if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
- return 1;
- }
- if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
- return -1;
- }
- return 0;
- }
- }
- }
MMSeg4j1.9.1的评估结果如下:
- MMSeg4j ComplexSeg:
- 分词速度:794.24805 字符/毫秒
- 行数完美率:38.817604% 行数错误率:61.182396% 总的行数:2533688 完美行数:983517 错误行数:1550171
- 字数完美率:29.604435% 字数错误率:70.39557% 总的字数:28374428 完美字数:8400089 错误字数:19974339
- MMSeg4j SimpleSeg:
- 分词速度:1026.1058 字符/毫秒
- 行数完美率:37.570095% 行数错误率:62.429905% 总的行数:2533688 完美行数:951909 错误行数:1581779
- 字数完美率:28.455273% 字数错误率:71.54473% 总的字数:28374428 完美字数:8074021 错误字数:20300407
- MMSeg4j MaxWordSeg:
- 分词速度:813.0676 字符/毫秒
- 行数完美率:34.27573% 行数错误率:65.72427% 总的行数:2533688 完美行数:868440 错误行数:1665248
- 字数完美率:25.20896% 字数错误率:74.79104% 总的字数:28374428 完美字数:7152898 错误字数:21221530
MMSeg4j1.9.1分词评估程序如下:
- import com.chenlb.mmseg4j.ComplexSeg;
- import com.chenlb.mmseg4j.Dictionary;
- import com.chenlb.mmseg4j.MMSeg;
- import com.chenlb.mmseg4j.MaxWordSeg;
- import com.chenlb.mmseg4j.Seg;
- import com.chenlb.mmseg4j.SimpleSeg;
- import com.chenlb.mmseg4j.Word;
- import java.io.BufferedReader;
- import java.io.BufferedWriter;
- import java.io.FileInputStream;
- import java.io.FileOutputStream;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import java.io.OutputStreamWriter;
- import java.io.StringReader;
- import java.nio.file.Files;
- import java.nio.file.Paths;
- import java.util.ArrayList;
- import java.util.Collections;
- import java.util.List;
- /**
- * MMSeg4j分词器分词效果评估
- * @author 杨尚川
- */
- public class MMSeg4jEvaluation {
- public static void main(String[] args) throws Exception{
- // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址:
- // http://pan.baidu.com/s/1hqihzjY
- List<EvaluationResult> list = new ArrayList<>();
- Dictionary dic = Dictionary.getInstance();
- // 对文本进行分词
- float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", new ComplexSeg(dic));
- // 对分词结果进行评估
- EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");
- result.setAnalyzer("MMSeg4j ComplexSeg");
- result.setSegSpeed(rate);
- list.add(result);
- // 对文本进行分词
- rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", new SimpleSeg(dic));
- // 对分词结果进行评估
- result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");
- result.setAnalyzer("MMSeg4j SimpleSeg");
- result.setSegSpeed(rate);
- list.add(result);
- // 对文本进行分词
- rate = seg("d:/test-text.txt", "d:/result-text-MaxWordSeg.txt", new MaxWordSeg(dic));
- // 对分词结果进行评估
- result = evaluation("d:/result-text-MaxWordSeg.txt", "d:/standard-text.txt");
- result.setAnalyzer("MMSeg4j MaxWordSeg");
- result.setSegSpeed(rate);
- list.add(result);
- //输出评估结果
- Collections.sort(list);
- System.out.println("");
- for(EvaluationResult r : list){
- System.out.println(r+"\n");
- }
- }
- private static float seg(final String input, final String output, final Seg seg) throws Exception{
- float rate = 0;
- try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
- BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
- long size = Files.size(Paths.get(input));
- System.out.println("size:"+size);
- System.out.println("文件大小:"+(float)size/1024/1024+" MB");
- int textLength=0;
- int progress=0;
- long start = System.currentTimeMillis();
- String line = null;
- while((line = reader.readLine()) != null){
- if("".equals(line.trim())){
- writer.write("\n");
- continue;
- }
- textLength += line.length();
- writer.write(seg(line, seg));
- writer.write("\n");
- progress += line.length();
- if( progress > 500000){
- progress = 0;
- System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%");
- }
- }
- long cost = System.currentTimeMillis() - start;
- rate = textLength/(float)cost;
- System.out.println("字符数目:"+textLength);
- System.out.println("分词耗时:"+cost+" 毫秒");
- System.out.println("分词速度:"+rate+" 字符/毫秒");
- }
- return rate;
- }
- private static String seg(String text, Seg seg) throws IOException {
- StringBuilder result = new StringBuilder();
- MMSeg mmSeg = new MMSeg(new StringReader(text), seg);
- Word word = null;
- while((word=mmSeg.next())!=null) {
- result.append(word.getString()).append(" ");
- }
- return result.toString().trim();
- }
- /**
- * 分词效果评估
- * @param resultText 实际分词结果文件路径
- * @param standardText 标准分词结果文件路径
- * @return 评估结果
- */
- private static EvaluationResult evaluation(String resultText, String standardText) {
- int perfectLineCount=0;
- int wrongLineCount=0;
- int perfectCharCount=0;
- int wrongCharCount=0;
- try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
- BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
- String result;
- while( (result = resultReader.readLine()) != null ){
- result = result.trim();
- String standard = standardReader.readLine().trim();
- if(result.equals("")){
- continue;
- }
- if(result.equals(standard)){
- //分词结果和标准一模一样
- perfectLineCount++;
- perfectCharCount+=standard.replaceAll("\\s+", "").length();
- }else{
- //分词结果和标准不一样
- wrongLineCount++;
- wrongCharCount+=standard.replaceAll("\\s+", "").length();
- }
- }
- } catch (IOException ex) {
- System.err.println("分词效果评估失败:" + ex.getMessage());
- }
- int totalLineCount = perfectLineCount+wrongLineCount;
- int totalCharCount = perfectCharCount+wrongCharCount;
- EvaluationResult er = new EvaluationResult();
- er.setPerfectCharCount(perfectCharCount);
- er.setPerfectLineCount(perfectLineCount);
- er.setTotalCharCount(totalCharCount);
- er.setTotalLineCount(totalLineCount);
- er.setWrongCharCount(wrongCharCount);
- er.setWrongLineCount(wrongLineCount);
- return er;
- }
- /**
- * 分词结果
- */
- private static class EvaluationResult implements Comparable{
- private String analyzer;
- private float segSpeed;
- private int totalLineCount;
- private int perfectLineCount;
- private int wrongLineCount;
- private int totalCharCount;
- private int perfectCharCount;
- private int wrongCharCount;
- public String getAnalyzer() {
- return analyzer;
- }
- public void setAnalyzer(String analyzer) {
- this.analyzer = analyzer;
- }
- public float getSegSpeed() {
- return segSpeed;
- }
- public void setSegSpeed(float segSpeed) {
- this.segSpeed = segSpeed;
- }
- public float getLinePerfectRate(){
- return perfectLineCount/(float)totalLineCount*100;
- }
- public float getLineWrongRate(){
- return wrongLineCount/(float)totalLineCount*100;
- }
- public float getCharPerfectRate(){
- return perfectCharCount/(float)totalCharCount*100;
- }
- public float getCharWrongRate(){
- return wrongCharCount/(float)totalCharCount*100;
- }
- public int getTotalLineCount() {
- return totalLineCount;
- }
- public void setTotalLineCount(int totalLineCount) {
- this.totalLineCount = totalLineCount;
- }
- public int getPerfectLineCount() {
- return perfectLineCount;
- }
- public void setPerfectLineCount(int perfectLineCount) {
- this.perfectLineCount = perfectLineCount;
- }
- public int getWrongLineCount() {
- return wrongLineCount;
- }
- public void setWrongLineCount(int wrongLineCount) {
- this.wrongLineCount = wrongLineCount;
- }
- public int getTotalCharCount() {
- return totalCharCount;
- }
- public void setTotalCharCount(int totalCharCount) {
- this.totalCharCount = totalCharCount;
- }
- public int getPerfectCharCount() {
- return perfectCharCount;
- }
- public void setPerfectCharCount(int perfectCharCount) {
- this.perfectCharCount = perfectCharCount;
- }
- public int getWrongCharCount() {
- return wrongCharCount;
- }
- public void setWrongCharCount(int wrongCharCount) {
- this.wrongCharCount = wrongCharCount;
- }
- @Override
- public String toString(){
- return analyzer+":"
- +"\n"
- +"分词速度:"+segSpeed+" 字符/毫秒"
- +"\n"
- +"行数完美率:"+getLinePerfectRate()+"%"
- +" 行数错误率:"+getLineWrongRate()+"%"
- +" 总的行数:"+totalLineCount
- +" 完美行数:"+perfectLineCount
- +" 错误行数:"+wrongLineCount
- +"\n"
- +"字数完美率:"+getCharPerfectRate()+"%"
- +" 字数错误率:"+getCharWrongRate()+"%"
- +" 总的字数:"+totalCharCount
- +" 完美字数:"+perfectCharCount
- +" 错误字数:"+wrongCharCount;
- }
- @Override
- public int compareTo(Object o) {
- EvaluationResult other = (EvaluationResult)o;
- if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
- return 1;
- }
- if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
- return -1;
- }
- return 0;
- }
- }
- }
ik-analyzer2012_u6的评估结果如下:
- IKAnalyzer 智能切分:
- 分词速度:178.3516 字符/毫秒
- 行数完美率:37.55943% 行数错误率:62.440567% 总的行数:2533686 完美行数:951638 错误行数:1582048
- 字数完美率:27.978464% 字数错误率:72.02154% 总的字数:28374416 完美字数:7938726 错误字数:20435690
- IKAnalyzer 细粒度切分:
- 分词速度:182.97859 字符/毫秒
- 行数完美率:18.872742% 行数错误率:81.12726% 总的行数:2533686 完美行数:478176 错误行数:2055510
- 字数完美率:10.936535% 字数错误率:89.06347% 总的字数:28374416 完美字数:3103178 错误字数:25271238
ik-analyzer2012_u6分词评估程序如下:
- import java.io.BufferedReader;
- import java.io.BufferedWriter;
- import java.io.FileInputStream;
- import java.io.FileOutputStream;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import java.io.OutputStreamWriter;
- import java.io.StringReader;
- import java.nio.file.Files;
- import java.nio.file.Paths;
- import java.util.ArrayList;
- import java.util.Collections;
- import java.util.List;
- import org.wltea.analyzer.core.IKSegmenter;
- import org.wltea.analyzer.core.Lexeme;
- /**
- * IKAnalyzer分词器分词效果评估
- * @author 杨尚川
- */
- public class IKAnalyzerEvaluation {
- public static void main(String[] args) throws Exception{
- // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址:
- // http://pan.baidu.com/s/1hqihzjY
- List<EvaluationResult> list = new ArrayList<>();
- // 对文本进行分词
- float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", true);
- // 对分词结果进行评估
- EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");
- result.setAnalyzer("IKAnalyzer 智能切分");
- result.setSegSpeed(rate);
- list.add(result);
- // 对文本进行分词
- rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", false);
- // 对分词结果进行评估
- result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");
- result.setAnalyzer("IKAnalyzer 细粒度切分");
- result.setSegSpeed(rate);
- list.add(result);
- //输出评估结果
- Collections.sort(list);
- System.out.println("");
- for(EvaluationResult r : list){
- System.out.println(r+"\n");
- }
- }
- private static float seg(final String input, final String output, final boolean useSmart) throws Exception{
- float rate = 0;
- try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
- BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
- long size = Files.size(Paths.get(input));
- System.out.println("size:"+size);
- System.out.println("文件大小:"+(float)size/1024/1024+" MB");
- int textLength=0;
- int progress=0;
- long start = System.currentTimeMillis();
- String line = null;
- while((line = reader.readLine()) != null){
- if("".equals(line.trim())){
- writer.write("\n");
- continue;
- }
- textLength += line.length();
- writer.write(seg(line, useSmart));
- writer.write("\n");
- progress += line.length();
- if( progress > 500000){
- progress = 0;
- System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%");
- }
- }
- long cost = System.currentTimeMillis() - start;
- rate = textLength/(float)cost;
- System.out.println("字符数目:"+textLength);
- System.out.println("分词耗时:"+cost+" 毫秒");
- System.out.println("分词速度:"+rate+" 字符/毫秒");
- }
- return rate;
- }
- private static String seg(String text, boolean useSmart) throws IOException {
- StringBuilder result = new StringBuilder();
- IKSegmenter ik = new IKSegmenter(new StringReader(text), useSmart);
- Lexeme word = null;
- while((word=ik.next())!=null) {
- result.append(word.getLexemeText()).append(" ");
- }
- return result.toString().trim();
- }
- /**
- * 分词效果评估
- * @param resultText 实际分词结果文件路径
- * @param standardText 标准分词结果文件路径
- * @return 评估结果
- */
- private static EvaluationResult evaluation(String resultText, String standardText) {
- int perfectLineCount=0;
- int wrongLineCount=0;
- int perfectCharCount=0;
- int wrongCharCount=0;
- try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
- BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
- String result;
- while( (result = resultReader.readLine()) != null ){
- result = result.trim();
- String standard = standardReader.readLine().trim();
- if(result.equals("")){
- continue;
- }
- if(result.equals(standard)){
- //分词结果和标准一模一样
- perfectLineCount++;
- perfectCharCount+=standard.replaceAll("\\s+", "").length();
- }else{
- //分词结果和标准不一样
- wrongLineCount++;
- wrongCharCount+=standard.replaceAll("\\s+", "").length();
- }
- }
- } catch (IOException ex) {
- System.err.println("分词效果评估失败:" + ex.getMessage());
- }
- int totalLineCount = perfectLineCount+wrongLineCount;
- int totalCharCount = perfectCharCount+wrongCharCount;
- EvaluationResult er = new EvaluationResult();
- er.setPerfectCharCount(perfectCharCount);
- er.setPerfectLineCount(perfectLineCount);
- er.setTotalCharCount(totalCharCount);
- er.setTotalLineCount(totalLineCount);
- er.setWrongCharCount(wrongCharCount);
- er.setWrongLineCount(wrongLineCount);
- return er;
- }
- /**
- * 分词结果
- */
- private static class EvaluationResult implements Comparable{
- private String analyzer;
- private float segSpeed;
- private int totalLineCount;
- private int perfectLineCount;
- private int wrongLineCount;
- private int totalCharCount;
- private int perfectCharCount;
- private int wrongCharCount;
- public String getAnalyzer() {
- return analyzer;
- }
- public void setAnalyzer(String analyzer) {
- this.analyzer = analyzer;
- }
- public float getSegSpeed() {
- return segSpeed;
- }
- public void setSegSpeed(float segSpeed) {
- this.segSpeed = segSpeed;
- }
- public float getLinePerfectRate(){
- return perfectLineCount/(float)totalLineCount*100;
- }
- public float getLineWrongRate(){
- return wrongLineCount/(float)totalLineCount*100;
- }
- public float getCharPerfectRate(){
- return perfectCharCount/(float)totalCharCount*100;
- }
- public float getCharWrongRate(){
- return wrongCharCount/(float)totalCharCount*100;
- }
- public int getTotalLineCount() {
- return totalLineCount;
- }
- public void setTotalLineCount(int totalLineCount) {
- this.totalLineCount = totalLineCount;
- }
- public int getPerfectLineCount() {
- return perfectLineCount;
- }
- public void setPerfectLineCount(int perfectLineCount) {
- this.perfectLineCount = perfectLineCount;
- }
- public int getWrongLineCount() {
- return wrongLineCount;
- }
- public void setWrongLineCount(int wrongLineCount) {
- this.wrongLineCount = wrongLineCount;
- }
- public int getTotalCharCount() {
- return totalCharCount;
- }
- public void setTotalCharCount(int totalCharCount) {
- this.totalCharCount = totalCharCount;
- }
- public int getPerfectCharCount() {
- return perfectCharCount;
- }
- public void setPerfectCharCount(int perfectCharCount) {
- this.perfectCharCount = perfectCharCount;
- }
- public int getWrongCharCount() {
- return wrongCharCount;
- }
- public void setWrongCharCount(int wrongCharCount) {
- this.wrongCharCount = wrongCharCount;
- }
- @Override
- public String toString(){
- return analyzer+":"
- +"\n"
- +"分词速度:"+segSpeed+" 字符/毫秒"
- +"\n"
- +"行数完美率:"+getLinePerfectRate()+"%"
- +" 行数错误率:"+getLineWrongRate()+"%"
- +" 总的行数:"+totalLineCount
- +" 完美行数:"+perfectLineCount
- +" 错误行数:"+wrongLineCount
- +"\n"
- +"字数完美率:"+getCharPerfectRate()+"%"
- +" 字数错误率:"+getCharWrongRate()+"%"
- +" 总的字数:"+totalCharCount
- +" 完美字数:"+perfectCharCount
- +" 错误字数:"+wrongCharCount;
- }
- @Override
- public int compareTo(Object o) {
- EvaluationResult other = (EvaluationResult)o;
- if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
- return 1;
- }
- if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
- return -1;
- }
- return 0;
- }
- }
- }
ansj、mmseg4j和ik-analyzer的评估程序可在附件中下载,word分词只需运行项目根目录下的evaluation.bat脚本即可。
参考资料:
相关推荐
本话题将深入探讨四种常用的Java分词工具:word分词器、ansj分词器、mmseg4j分词器以及ik-analyzer,以及它们在实际应用中的效果评估。 首先,ansj分词器是由李弄潮开发的一款开源分词工具,它具有强大的词典支持和...
分词器支持:用户可以通过程序选择不同的分词器进行评估,目前支持的分词器有 word、HanLP、Ansj、smartcn、Jieba、Jcseg、MMSeg4j、IKAnalyzer 等。 数据集使用:用户可以通过程序使用不同的数据集进行评估,目前...
IKAnalyzer是一款广泛使用的开源中文分词器,它主要针对Java平台设计,具有良好的性能和扩展性。该分词器适用于各种文本处理场景,如搜索引擎构建、信息检索、文本挖掘等。其核心在于提供高效的中文词语切分功能,...
2. 创建`Seg`对象,初始化分词器。 3. 调用`seg.seg()`方法,传入待分词的文本,获取分词结果。 4. 遍历分词结果,进行后续处理,如词性标注、关键词提取等。 **五、持续发展与社区支持** 随着自然语言处理技术的...
《深入理解ANSJ分词与NLP实践》 在自然语言处理(NLP)领域,分词是基础且至关重要的一步。ANSJ分词库,全称“Anyang Standard Segmentation”,是由李东创建的一个高性能的Java实现的中文分词工具。它提供了强大的...
Ansj中文分词是一款纯Java、主要用于自然语言处理、高精度的中文分词工具,目标是“准确、高效、自由地进行中文分词”。 内容简介:http://www.iteye.com/magazines/102#234 此分词具有自己学习的功能,所以拜托大家...
在2.3.1版本中,它已经集成了两种常用的中文分词器:IK分词器和Ansj分词器,以及Head插件,这为中文文档的索引和搜索提供了更丰富的功能。 IK分词器(Intelligent Chinese Analyzer for Elasticsearch)是专为...
**ansj中文分词器源码详解** **一、ansj分词器概述** ansj中文分词器是由ansj工作室开发的一款高效、精准的中文处理工具,它以其快速的分词速度和较高的准确率在业界获得了广泛的认可。在Mac Air这样的轻薄型设备...
IKAnalyzer是一种流行的中文分词器,使用Java语言编写。IKAnalyzer具有分词、词性标注等功能,能够满足大多数中文文本处理需求。IKAnalyzer的优点是分词准确率高、支持多种语言、可扩展性强等。 4. Paoding ...
aAnsj中文分词 这是一个ictclas的java实现.基本上重写了所有的数据结构和算法.词典是用的开源版的ictclas所提供的.并且进行了部分的人工优化 内存中中文分词每秒钟
本文将深入探讨"mlcsseg"项目,它是一个针对Solr的开源分词器解决方案,其中包括了IK分词器和ANSJ分词器,以及各种过滤器和动态加载词库功能。以下是对这些知识点的详细说明: 1. **Solr分词器**:Solr是Apache ...
ansj分词.ict的真正java实现.分词效果速度都超过开源版的ict. 中文分词,人名识别,词性标注,用户自定义词典
对于不再提供1.9.0版本下载的情况,开发者可以选择使用其他版本的mmseg4j,并自行准备词库,或者选择其他分词工具如jieba、ansj等,它们同样提供了丰富的词库支持和接口供开发者调用。 总之,mmseg4j 1.9.0版本的...
首先,词典是分词器的基石。ansj库中的词典主要包括两个核心文件:`ansj-core词典.xlsx`和`ansj词性表.xlsx`。前者是核心词汇库,包含了大量常用词汇及其对应的词性;后者则是词性表,定义了每个词汇可能携带的语义...
相比于常用的IK分词器和jieba分词器,Ansj在分词速度和效果上都有一定的优势,尤其对于复杂语境下的分词处理更为出色。 集成Ansj到Solr5的过程分为以下几个步骤: 1. **下载与准备**:首先,从Ansj的官方网站或...
毕业设计——基于Ansj中文分词技术的关键词抽取以及网络爬虫技术的简易搜索引擎(java)
Ansj中文分词是一个完全开源的、基于Google语义模型+条件随机场模型的中文分词的Java实现,具有使用简单、开箱即用等特点。 Ansj分词速度达到每秒钟大约100万字左右(Mac Air下测试),准确率能达到96%以上。 ...
IK分词器是一款广泛应用于Java开发中的中文分词工具,主要为Lucene、Elasticsearch等全文搜索引擎提供支持。在本教程中,我们将探讨如何将IK分词器集成到Lucene 4.5版本中,以提升中文文本处理的效率和准确性。 ...
用户可以根据业务需求,添加自己的词汇到IKAnalyzer的词典中,以提升分词效果。只需在配置文件中指定自定义词典路径,并确保在运行时词典文件可被读取。 3. **热更新词典** IKAnalyzer支持热更新词典,这意味着在...