`
yucang52555
  • 浏览: 70210 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
社区版块
存档分类
最新评论

ansj_seg源码分析之用户自定义词库

阅读更多
    最近,工作上用到中文分词ansj,先前我是整合到ES里面,但是觉得这样不利于源码分析,所以我们先把源码部署起来:
    在线演示:[url]http://ansj.sdapp.cn/demo/seg.jsp [/url]
    官网地址:[url]http://www.ansj.org/ [/url]
    github:https://github.com/NLPchina/ansj_seg
    通过maven引入源码,这里不再赘述。得到结构图如下:

    我们可以发现library.properties就是用来配置词典的,最开始配置如下:
#redress dic file path
ambiguityLibrary=library/ambiguity.dic
#path of userLibrary this is default library
userLibrary=library/default.dic
#set real name
isRealName=true

    添加一个词典文件,得到如下所示:
#redress dic file path
ambiguityLibrary=library/ambiguity.dic
#path of defultLibrary this is default library
defaultLibrary=library/default.dic
#path of userLibrary this is user library
userLibrary=library/userLibrary.dic
#set real name
isRealName=true

    个人偏好,把原有的userLibrary改成defaultLibrary,因为我觉得用户自定义词库,可以暂时定义,加入分词,后期维护可以加入默认词库,这样就有了一个升级过程。
    把新加的词库读入内存,只修改如下代码:
/**
	 * 加载用户自定义词典和补充词典
	 */
	private static void initUserLibrary() {
		// TODO Auto-generated method stub
		try {
			FOREST = new Forest();
			// 加载用户自定义词典
			String userLibrary = MyStaticValue.userLibrary;
			loadLibrary(FOREST, userLibrary);
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
为:
/**
	 * 加载用户自定义词典和补充词典
	 */
	private static void initUserLibrary() {
		// TODO Auto-generated method stub
		try {
			FOREST = new Forest();
			// 加载默认自定义词典
			String defaultLibrary = MyStaticValue.defaultLibrary;
			loadLibrary(FOREST, defaultLibrary);
			//加载用户新增词典
			String userLibrary = MyStaticValue.userLibrary;
			loadLibrary(FOREST, userLibrary);
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

    这里我没有加上类名,是我希望读者自己能够根据debug找到相应的类,还请谅解。

    另外,我再把停用词也指出一下:

    通过FilterModifWord类调用。
    需要修改一下源码:
package org.ansj.util;

import static org.ansj.util.MyStaticValue.LIBRARYLOG;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.ansj.domain.Nature;
import org.ansj.domain.Term;
import org.ansj.library.UserDefineLibrary;
import org.nlpcn.commons.lang.tire.domain.Forest;
import org.nlpcn.commons.lang.util.IOUtil;
import org.nlpcn.commons.lang.util.StringUtil;

/*
 * 停用词过滤,修正词性到用户词性.
 */
public class FilterModifWord {

	private static Set<String> FILTER = new HashSet<String>();

	private static String TAG = "#";

	private static boolean isTag = false;
	
	static{
		String filePath = MyStaticValue.stopWordsLibrary;
		initStopWordsDic(filePath);
	}
	
	/**
	 * 初始化停用词词库
	 * @param stopWordsPath
	 */
	private static void initStopWordsDic(String stopWordsPath){
		File file = null;
		if (StringUtil.isNotBlank(stopWordsPath)) {
			file = new File(stopWordsPath);
			if (!file.canRead() || file.isHidden()) {
				LIBRARYLOG.warning("init stopWordsLibrary  warning :" + new File(stopWordsPath).getAbsolutePath() + " because : file not found or failed to read !");
				return;
			}
			if (file.isFile()) {
				loadStopWordsFile(file);
			} else if (file.isDirectory()) {
				File[] files = file.listFiles();
				for (int i = 0; i < files.length; i++) {
					if (files[i].getName().trim().endsWith(".dic")) {
						loadStopWordsFile(files[i]);
					}
				}
			} else {
				LIBRARYLOG.warning("init stopWordsLibrary  error :" + new File(stopWordsPath).getAbsolutePath() + " because : not find that file !");
			}
		}
	}
	
	/**
	 * 加载停用词文件
	 * @param filePath
	 */
	private static void loadStopWordsFile(File file){
		if (!file.canRead()) {
			LIBRARYLOG.warning("file in path " + file.getAbsolutePath() + " can not to read!");
			return;
		}
		String temp = null;
		BufferedReader br = null;
		String[] strs = null;
		try {
			br = IOUtil.getReader(new FileInputStream(file), "UTF-8");
			while ((temp = br.readLine()) != null) {
				if (StringUtil.isBlank(temp)) {
					continue;
				} else {
					insertStopWord(temp);
				}
			}
			LIBRARYLOG.info("init stopWordsLibrary ok path is : " + file.getAbsolutePath());
		} catch (UnsupportedEncodingException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			IOUtil.close(br);
			br = null;
		}
	}

	public static void insertStopWords(List<String> filterWords) {
		FILTER.addAll(filterWords);
	}

	public static void insertStopWord(String... filterWord) {
		for (String word : filterWord) {
			FILTER.add(word);
		}
	}

	public static void insertStopNatures(String... filterNatures) {
		isTag = true;
		for (String natureStr : filterNatures) {
			FILTER.add(TAG + natureStr);
		}
	}

	/*
	 * 停用词过滤并且修正词性
	 */
	public static List<Term> modifResult(List<Term> all) {
		List<Term> result = new ArrayList<Term>();
		try {
			for (Term term : all) {
				if (FILTER.size() > 0 && (FILTER.contains(term.getName()) || (isTag && FILTER.contains(TAG + term.natrue().natureStr)))) {
					continue;
				}
				String[] params = UserDefineLibrary.getParams(term.getName());
				if (params != null) {
					term.setNature(new Nature(params[0]));
				}
				result.add(term);
			}
		} catch (Exception e) {
			// TODO Auto-generated catch block
			System.err.println("FilterStopWord.updateDic can not be null , " + "you must use set FilterStopWord.setUpdateDic(map) or use method set map");
		}
		return result;
	}

	/*
	 * 停用词过滤并且修正词性
	 */
	public static List<Term> modifResult(List<Term> all, Forest... forests) {
		List<Term> result = new ArrayList<Term>();
		try {
			for (Term term : all) {
				if (FILTER.size() > 0 && (FILTER.contains(term.getName()) || FILTER.contains(TAG + term.natrue().natureStr))) {
					continue;
				}
				for (Forest forest : forests) {
					String[] params = UserDefineLibrary.getParams(forest, term.getName());
					if (params != null) {
						term.setNature(new Nature(params[0]));
					}
				}
				result.add(term);
			}
		} catch (Exception e) {
			// TODO Auto-generated catch block
			System.err.println("FilterStopWord.updateDic can not be null , " + "you must use set FilterStopWord.setUpdateDic(map) or use method set map");
		}
		return result;
	}
}



package org.ansj.util;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.ObjectInputStream;
import java.io.UnsupportedEncodingException;
import java.util.HashMap;
import java.util.Map;
import java.util.ResourceBundle;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.logging.Logger;

import org.ansj.app.crf.Model;
import org.ansj.app.crf.SplitWord;
import org.ansj.dic.DicReader;
import org.ansj.domain.AnsjItem;
import org.ansj.library.DATDictionary;
import org.nlpcn.commons.lang.util.IOUtil;
import org.nlpcn.commons.lang.util.StringUtil;

/**
 * 这个类储存一些公用变量.
 * 
 * @author ansj
 * 
 */
public class MyStaticValue {

	public static final Logger LIBRARYLOG = Logger.getLogger("DICLOG");

	// 是否开启人名识别
	public static boolean isNameRecognition = true;

	private static final Lock LOCK = new ReentrantLock();

	// 是否开启数字识别
	public static boolean isNumRecognition = true;

	// 是否数字和量词合并
	public static boolean isQuantifierRecognition = true;

	// crf 模型

	private static SplitWord crfSplitWord = null;

	public static boolean isRealName = false;

	/**
	 * 用户自定义词典的加载,如果是路径就扫描路径下的dic文件
	 */
	public static String defaultLibrary = "library/default.dic";

	public static String ambiguityLibrary = "library/ambiguity.dic";
	
	public static String userLibrary = "library/userLibrary.dic";
	
	public static String stopWordsLibrary = "src/main/resources/newWord/newWordFilter.dic";

	/**
	 * 是否用户辞典不加载相同的词
	 */
	public static boolean isSkipUserDefine = false;

	static {
		/**
		 * 配置文件变量
		 */
		try {
			ResourceBundle rb = ResourceBundle.getBundle("library");
			if (rb.containsKey("defaultLibrary"))
				defaultLibrary = rb.getString("defaultLibrary");
			if (rb.containsKey("ambiguityLibrary"))
				ambiguityLibrary = rb.getString("ambiguityLibrary");
			if (rb.containsKey("userLiberary")) 
				userLibrary = rb.getString("userLibrary");
			if (rb.containsKey("stopWordsLibrary"))
				stopWordsLibrary = rb.getString("stopWordsLibrary");
			if (rb.containsKey("isSkipUserDefine"))
				isSkipUserDefine = Boolean.valueOf(rb.getString("isSkipUserDefine"));
			if (rb.containsKey("isRealName"))
				isRealName = Boolean.valueOf(rb.getString("isRealName"));
		} catch (Exception e) {
			LIBRARYLOG.warning("not find library.properties in classpath use it by default !");
		}
	}

	/**
	 * 人名词典
	 * 
	 * @return
	 */
	public static BufferedReader getPersonReader() {
		return DicReader.getReader("person/person.dic");
	}

	/**
	 * 机构名词典
	 * 
	 * @return
	 */
	public static BufferedReader getCompanReader() {
		return DicReader.getReader("company/company.data");
	}

	/**
	 * 机构名词典
	 * 
	 * @return
	 */
	public static BufferedReader getNewWordReader() {
		return DicReader.getReader("newWord/new_word_freq.dic");
	}

	/**
	 * 核心词典
	 * 
	 * @return
	 */
	public static BufferedReader getArraysReader() {
		// TODO Auto-generated method stub
		return DicReader.getReader("arrays.dic");
	}

	/**
	 * 数字词典
	 * 
	 * @return
	 */
	public static BufferedReader getNumberReader() {
		// TODO Auto-generated method stub
		return DicReader.getReader("numberLibrary.dic");
	}

	/**
	 * 英文词典
	 * 
	 * @return
	 */
	public static BufferedReader getEnglishReader() {
		// TODO Auto-generated method stub
		return DicReader.getReader("englishLibrary.dic");
	}

	/**
	 * 词性表
	 * 
	 * @return
	 */
	public static BufferedReader getNatureMapReader() {
		// TODO Auto-generated method stub
		return DicReader.getReader("nature/nature.map");
	}

	/**
	 * 词性关联表
	 * 
	 * @return
	 */
	public static BufferedReader getNatureTableReader() {
		// TODO Auto-generated method stub
		return DicReader.getReader("nature/nature.table");
	}

	/**
	 * 得道姓名单字的词频词典
	 * 
	 * @return
	 */
	public static BufferedReader getPersonFreqReader() {
		// TODO Auto-generated method stub
		return DicReader.getReader("person/name_freq.dic");
	}

	/**
	 * 名字词性对象反序列化
	 * 
	 * @return
	 */
	@SuppressWarnings("unchecked")
	public static Map<String, int[][]> getPersonFreqMap() {
		InputStream inputStream = null;
		ObjectInputStream objectInputStream = null;
		Map<String, int[][]> map = new HashMap<String, int[][]>(0);
		try {
			inputStream = DicReader.getInputStream("person/asian_name_freq.data");
			objectInputStream = new ObjectInputStream(inputStream);
			map = (Map<String, int[][]>) objectInputStream.readObject();

		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (ClassNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally {
			try {
				if (objectInputStream != null)
					objectInputStream.close();
				if (inputStream != null)
					inputStream.close();
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}
		return map;
	}

	/**
	 * 词与词之间的关联表数据
	 * 
	 * @return
	 */
	public static void initBigramTables() {
		BufferedReader reader = null;
		try {
			reader = IOUtil.getReader(DicReader.getInputStream("bigramdict.dic"), "UTF-8");
			String temp = null;
			String[] strs = null;
			int freq = 0;
			while ((temp = reader.readLine()) != null) {
				if (StringUtil.isBlank(temp)) {
					continue;
				}
				strs = temp.split("\t");
				freq = Integer.parseInt(strs[1]);
				strs = strs[0].split("@");
				AnsjItem fromItem = DATDictionary.getItem(strs[0]);

				AnsjItem toItem = DATDictionary.getItem(strs[1]);

				if (fromItem == AnsjItem.NULL && strs[0].contains("#")) {
					fromItem = AnsjItem.BEGIN;
				}

				if (toItem == AnsjItem.NULL && strs[1].contains("#")) {
					toItem = AnsjItem.END;
				}

				if (fromItem == AnsjItem.NULL || toItem == AnsjItem.NULL) {
					continue;
				}
				
				if(fromItem.bigramEntryMap==null){
					fromItem.bigramEntryMap = new HashMap<Integer, Integer>() ;
				}

				fromItem.bigramEntryMap.put(toItem.index, freq) ;

			}
		} catch (NumberFormatException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (UnsupportedEncodingException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally {
			IOUtil.close(reader);
		}
		
	}

	/**
	 * 得到默认的模型
	 * 
	 * @return
	 */
	public static SplitWord getCRFSplitWord() {
		// TODO Auto-generated method stub
		if (crfSplitWord != null) {
			return crfSplitWord;
		}
		LOCK.lock();
		if (crfSplitWord != null) {
			return crfSplitWord;
		}

		try {
			long start = System.currentTimeMillis();
			LIBRARYLOG.info("begin init crf model!");
			crfSplitWord = new SplitWord(Model.loadModel(DicReader.getInputStream("crf/crf.model")));
			LIBRARYLOG.info("load crf crf use time:" + (System.currentTimeMillis() - start));
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally {
			LOCK.unlock();
		}

		return crfSplitWord;
	}

}

    测试用例:
package org.ansj.demo;

import java.util.List;

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.NlpAnalysis;
import org.ansj.util.FilterModifWord;

public class StopWordDemo {
	public static void main(String[] args) {
//        FilterModifWord.insertStopWord("五一");
        List<Term> parseResultList = NlpAnalysis.parse("your五一,劳动节快乐");
        System.out.println(parseResultList);
        parseResultList = FilterModifWord.modifResult(parseResultList);
        System.out.println(parseResultList);
	}
}


程序猿行业技术生活交流群:181287753(指尖天下),欢迎大伙加入交流学习。
  • 大小: 11.3 KB
  • 大小: 8.8 KB
0
0
分享到:
评论

相关推荐

    ansj分词ansj_seg-5.1.5.jar

    在标题提到的"ansj_seg-5.1.5.jar"中,"seg"代表分词模块,版本号"5.1.5"表示这是该库的第五个大版本的第1次小更新,通常意味着修复了前一版本的一些问题,可能增加了新的特性和性能优化。使用此版本的jar包,可以...

    ansj_seg-master

    其次,ansj_seg-master支持自定义词典,用户可以根据实际需求添加或删除特定词汇,以适应特定领域的分词工作。这对于处理行业术语或者特定主题的文本非常有用。此外,该工具还提供了词语关系识别、关键词提取等功能...

    ansj_seg-3.7.6-all-in-one.jar

    ansj_seg-3.7.6,由于maven库无法访问下载,只能直接引用jar包了。

    ansj_seg-master_java_中文自然语言_

    ansj_seg是一个Java实现的中文分词工具,它不仅提供了基本的分词功能,还支持用户自定义分词逻辑,这使得它具有很高的灵活性和适应性,可以广泛应用于各种中文文本处理场景,如搜索引擎、聊天机器人、情感分析等。...

    ansj_seg-5.1.3

    在实际应用中,ansj_seg不仅可以用于普通的文本分词,还可以应用于搜索引擎的索引构建、文本分类、情感分析等场景。例如,在新闻分析中,通过准确的分词可以提取关键信息,帮助理解新闻的主题;在社交媒体监控中,...

    ansj_seg-3.7.6-one.jar

    ansj_seg-3.7.6-on的jar包,有需要的朋友可以自行下载!

    ansj_seg-5.1.2

    ansj_seg-5.1.2.jar分词架包 5.1.2版本,ansj_seg-5.1.2.jar

    ansj_seg-5.1.1.jar

    ansj_seg-5.1.1.jar

    ansj_seg-5.1.6.jar

    ansj_seg-5.1.6.jar

    ansj_seg-5.1.3.jar

    目前实现了:中文分词、中文姓名识别、用户自定义词典、关键字提取、自动摘要、关键字标记等功能。可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目。作者在最新的分词方式里增加了基于深度学习的...

    ansj_seg-2.0.8-min.jar

    ansj,做中文分词的好东西,非常的不错。希望对大家有用处

    ansj_seg-5.0.0

    ansj_seg-5.0.0

    ansj_seg-3.7.2.jar

    ansj_seg-3.7.2工具类

    ansj_seg.jar nlp-lang-1.7.7.jar

    在给定的压缩包中,包含两个重要的jar包:`ansj_seg-5.1.6.jar`和`nlp-lang-1.7.7.jar`,它们分别是ANSJ分词库和NLP工具包的不同版本。 **ANSJ分词库**是由易开源社区开发的一款高效的中文分词工具,其全称为...

    ( ansj_seg-2.0.8.jar )

    ansj_seg-2.0.8.jar

    ansj_seg:ansj分词.ict的真正java实现.分词效果速度都超过开源版的ict. 中文分词,人名识别,词性标注,用户自定义词典

    Ansj中文分词 使用帮助 开发文档:, 摘要 这是一个基于n-Gram+CRF+HMM的中文分词的java实现。... &lt;artifactId&gt;ansj_seg &lt;version&gt;5.1.1 调用demo 如果你第一次下载只想测试测试效果可以调用这个简易接口

    ansj_seg-5.1.6-sources.jar

    java运行依赖jar包

    ansj_seg-2.0.8.jar

    本jar包是对ansj_seg-master文件中org.ansj.domain中的属性类进行了序列化,以便使属性对象可以在spark中传输。

Global site tag (gtag.js) - Google Analytics