[转]中科院分词工具ICTCLAS Java JNI接口

chenwq

浏览: 569442 次
性别:
来自: 济南

最近访客更多访客>>

thtf2001

u012363178

jiumoji

song0394

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Data mining related

ICTCLAS,网址:http://www.ictclas.org

中科院计算所ICTCLAS 5.0

ICTCLAS的含义是:

Institute of Computing Technology, Chinese Lexical Analysis System

(中科院)计算技术研究所,中文词法分析系统

开源版本下载:

http://www.ictclas.org/ictclas_download_more.aspx

主要功能包括:中文分词；词性标注；命名实体识别；新词识别；同时支持用户词典。

ICTCLAS采用了层叠隐马尔可夫模型（Hierarchical Hidden Markov Model）

ICTCLAS全部采用C/C++编写，支持Linux、FreeBSD及Windows系列操作系统，支持C/C++/C#/Delphi/Java等主流的开发语言

支持GBK编码(Guo-Biao Kuozhan,简体中文)分词，同时支持UTF-8编码和Big5编码(大五码,繁体中文)分词；支持繁体中文分词;支持多线程分词。

用户可以直接自定义输出的词类标准，定义输出格式；

用户可以根据自己的需求，进行量身自助式定做适合自己的分词系统。

目前，ICTCLAS已经向国内外的企业和学术机构颁发50,000多份授权，其中包括腾讯、NEC、中华商务网、硅谷动力、云南日报等企业，北京大学、清华大学、华南理工、麻省理工等院校

ICTCLAS 5.0新增特性包括：

支持多种字符编码

ICTCLAS支持常见字符编码，包括GB2312、GBK、GB18030、UTF-8、BIG5。用户可指定编码类型，也可让系统自动识别。

繁体中文分词

系统支持BIG5编码，即支持繁体中文分词。

支持多线程调用

系统内核全新升级。支持多线程调用。

字符编码

枚举值:

enum eCodeType {

CODE_TYPE_UNKNOWN, // 未知,系统自动识别,0

CODE_TYPE_ASCII, // ASCII,1

CODE_TYPE_GB, // GB2312,GBK, gb18030,2

CODE_TYPE_UTF8, // UTF-8,3

CODE_TYPE_BIG5 // BIG5,4

};

Java接口

JNI

boolean ICTCLAS_Init(byte[] sPath);

初始化,

返回是否初始化成功.

参数sPath是一个字节数组,表示一个路径,该路径存放配置文件(Configure.xml)和Data文件夹,以及授权文件(user.lic).

可以这么调用,

ICTCLAS50 testICTCLAS50 = new ICTCLAS50();

String path=”.”;//当前目录

testICTCLAS50.ICTCLAS_Init(path.getBytes());

The ICTCLAS_Init function must be invoked before any operation with ICTCLAS.

The whole system need call the function only once before starting ICTCLAS.

When stopping the system and make no more operation, ICTCLAS_Exit should be invoked to destroy all working buffer.

Any operation will fail if init do not succeed.

ICTCLAS_Init fails mainly because of two reasons:

1) Required data is incompatible or missing

2) Configure file missing or invalid parameters.

Moreover, you could learn more from the log file ictclas.log in the default directory.

boolean ICTCLAS_Exit();

Exit the program and free all resources and destroy all working buffer used in ICTCLAS.

Return true if succeed. Otherwise return false.

The ICTCLAS_Exit function must be invoked while stopping the system and make no more operation.

And call ICTCLAS_Init function to restart ICTCLAS.

ICTCLAS_ImportUserDictFile(byte[] sPath,int eCodeType);

Import user-defined dictionary (用户自定义词典)from a text file.

Return Value

The number of lexical entry imported successfully

Parameters

sPath: Text filename for user dictionary

Remarks

You only need to invoke the function while you want to make some change in your customized lexicon(词典) or first use the lexicon.

After you import once and make no change again, ICTCLAS will load the lexicon automatically if you set UserDict "on" in the configure file. While you turn UserDict "off"(configure.xml), user-defined lexicon would not be applied.

调用示例:

//导入用户字典

int nCount = 0;

String usrdir = "usrdir.txt"; //用户字典路径

byte[] usrdirb = usrdir.getBytes();

//第一个参数为用户字典路径，第二个参数为用户字典的编码类型(0:type unknown;1:ASCII码;2:GB2312,GBK,GB10380;3:UTF-8;4:BIG5)

nCount = testICTCLAS50.ICTCLAS_ImportUserDict(usrdirb, 0);

System.out.println("导入用户词个数"+ nCount);

byte[] ICTCLAS_ParagraphProcess(byte[] sSrc, int eCodeType, int bPOSTagged)

Return Value

The result of the processing.

Parameters

sSrc: The source paragraph

eCodeType: The character coding type of the string

bPOStagged: Judge whether need POS tagging, 0 for no tag; 1 for tagging;.

boolean ICTCLAS_FileProcess(byte[] sSrcFilename, int eCodeType, int bPOSTagged,byte[] sDestFilename)

Process a text file

Return Value

Return true if processing succeed. Otherwise return false.

Parameters

sSourceFilename: The source file path to be analysized;

eCodeType: The character code type of the source file

bPOStagged: Judge whether need POS tagging, 0 for no tag; 1 for tagging;

sDsnFilename: The result file name to store the results.

调用示例:

//输入文件名

String Inputfilename = "test.txt";

byte[] Inputfilenameb = Inputfilename.getBytes();

//分词处理后输出文件名

String Outputfilename = "test_result.txt";

byte[] Outputfilenameb = Outputfilename.getBytes();

//文件分词(第一个参数为输入文件的名,第二个参数为文件编码类型,第三个参数为是否标记词性集1 yes,0 no,第四个参数为输出文件名)

testICTCLAS50.ICTCLAS_FileProcess(Inputfilenameb,0, 1,Outputfilenameb);

int ICTCLAS_SetPOSmap(int nPOSmap);

select which pos map will use.

POS(Part Of Speech,词类)

Return Value

Return 1 if excute succeed. Otherwise return 0.

Parameters

Parameters :nPOSmap :

ICT_POS_MAP_FIRST(枚举) 计算所一级标注集//1

ICT_POS_MAP_SECOND 计算所二级标注集//0

PKU_POS_MAP_SECOND 北大二级标注集 //2

PKU_POS_MAP_FIRST 北大一级标注集//3

POS map

If the input parameter is not validate, the system will choose a default one.

//调用示例

testICTCLAS50. ICTCLAS_SetPOSmap(1);

Data目录:

byte[] nativeProcAPara(byte[] sSrc, int eCodeType, int bPOStagged);

将分词结果转化为stResult结构体数组输出

在C里头:

#define POS_SIZE 8 // 词性标记最大字节数

The result of the process would be transfer in a class type. The definition of the class is like:

class stResult

{

int start; //start position,词语在输入句子中的开始位置

int length; //length,词语的长度

int iPOS; //POS,词性ID

String sPOS;//word type词性,在C里头,char[8]

int word_ID; //word_ID,词语ID

int word_type; //Is the word of the user's dictionary?(0-no,1-yes)查看词语是否为用户字典中词语

int weight;// word weight,词语权重

public void setStart(int start) {

this.start = start;

}

public void setLength(int length) {

this.length = length;

}

public void setiPOS(int iPOS) {

this.iPOS = iPOS;

}

public void setsPOS(String sPOS) {

this.sPOS = sPOS;

}

public void setWord_ID(int word_ID) {

this.word_ID = word_ID;

}

public void setWord_type(int word_type) {

this.word_type = word_type;

}

public void setWeight(int weight) {

this.weight = weight;

}

//处理字符串

byte nativeBytes[] = testICTCLAS50.nativeProcAPara(sInput.getBytes("GB2312"), 0, 1);

//处理结果转化

for(int i=0;i<nativeBytes.length;i++)

{

//获取词语在输入句子中的开始位置

byte a[] = Arrays.copyOfRange(nativeBytes,i, i+4);

i+=4;

int start = byteToInt2(a);

start = Integer.reverseBytes(start);

System.out.print(" "+start);

//获取词语的长度

byte b[] = Arrays.copyOfRange(nativeBytes,i, i+4);

i+=4;

int length = byteToInt2(b);

length = Integer.reverseBytes(length);

System.out.print(" "+length);

//获取词性ID

byte c[] = Arrays.copyOfRange(nativeBytes,i, i+4);

i+=4;

int iPOS = byteToInt2(c);

iPOS = Integer.reverseBytes(iPOS);

System.out.print(" "+iPOS);

//获取词性

byte s[] = Arrays.copyOfRange(nativeBytes,i, i+8);

i+=8;

String sPOS = new String(s);

System.out.print(" "+sPOS);

//获取词语ID

byte j[] = Arrays.copyOfRange(nativeBytes,i, i+4);

i+=4;

int word_ID = byteToInt2(j);

word_ID = Integer.reverseBytes(word_ID);

System.out.print(" "+word_ID);

//获取词语类型，查看是否是用户字典

byte k[] = Arrays.copyOfRange(nativeBytes,i, i+4);

i+=4;

int word_type = byteToInt2(k);

word_type = Integer.reverseBytes(word_type);

System.out.print(" "+word_type);

//获取词语权重

byte w[] = Arrays.copyOfRange(nativeBytes,i, i+4);

i+=4;

int weight = byteToInt2(w);

weight = Integer.reverseBytes(weight);

System.out.print(" "+weight);

//将处理结果赋值给结构体

stResult stR = new stResult();

stR.setStart(start);

stR.setLength(length);

stR.setiPOS(iPOS);

stR.setsPOS(sPOS);

stR.setWord_ID(word_ID);

stR.setWord_type(word_type);

stR.setWeight(weight);

al.add(stR);

}

//byte转int

public int byteToInt2(byte[] b) {

                   int mask=0xff;
                   int temp=0;
                   int n=0;
                   for(int i=0;i<4;i++){
                      n<<=8;
                      temp=b[i]&mask;
                      n|=temp;
                  }
         return n;
    }

或者

public int byteToInt2(byte[] b) {
int n = b[0] & 0xFF;
n |= ((b[1] << 8) & 0xFF00);
n |= ((b[2] << 16) & 0xFF0000);
n |= ((b[3] << 24) & 0xFF000000);
return n;
}

查看ICTCLAS50.dll的导出函数

ICTCLAS50.java

配置文件Configure.xml

1. 名词 (1个一类，7个二类，5个三类)

名词分为以下子类：

n 名词

nr 人名

nr1 汉语姓氏

nr2 汉语名字

nrj 日语人名

nrf 音译人名

ns 地名

nsf 音译地名

nt 机构团体名

nz 其它专名

nl 名词性惯用语

ng 名词性语素

2. 时间词(1个一类，1个二类)

t 时间词

tg 时间词性语素

3. 处所词(1个一类)

s 处所词

4. 方位词(1个一类)

f 方位词

5. 动词(1个一类，9个二类)

v 动词

vd 副动词

vn 名动词

vshi 动词“是”

vyou 动词“有”

vf 趋向动词

vx 形式动词

vi 不及物动词（内动词）

vl 动词性惯用语

vg 动词性语素

6. 形容词(1个一类，4个二类)

a 形容词

ad 副形词

an 名形词

ag 形容词性语素

al 形容词性惯用语

7. 区别词(1个一类，2个二类)

b 区别词

bl 区别词性惯用语

8. 状态词(1个一类)

z 状态词

9. 代词(1个一类，4个二类，6个三类)

r 代词

rr 人称代词

rz 指示代词

rzt 时间指示代词

rzs 处所指示代词

rzv 谓词性指示代词

ry 疑问代词

ryt 时间疑问代词

rys 处所疑问代词

ryv 谓词性疑问代词

rg 代词性语素

10. 数词(1个一类，1个二类)

m 数词

mq 数量词

11. 量词(1个一类，2个二类)

q 量词

qv 动量词

qt 时量词

12. 副词(1个一类)

d 副词

13. 介词(1个一类，2个二类)

p 介词

pba 介词“把”

pbei 介词“被”

14. 连词(1个一类，1个二类)

c 连词

cc 并列连词

15. 助词(1个一类，15个二类)

u 助词

uzhe 着

ule 了喽

uguo 过

ude1 的底

ude2 地

ude3 得

usuo 所

udeng 等等等云云

uyy 一样一般似的般

udh 的话

uls 来讲来说而言说来

uzhi 之

ulian 连（“连小学生都会”）

16. 叹词(1个一类)

e 叹词

17. 语气词(1个一类)

y 语气词(delete yg)

18. 拟声词(1个一类)

o 拟声词

19. 前缀(1个一类)

h 前缀

20. 后缀(1个一类)

k 后缀

21. 字符串(1个一类，2个二类)

x 字符串

xx 非语素字

xu 网址URL

22. 标点符号(1个一类，16个二类)

w 标点符号

wkz 左括号，全角：（〔［｛《【〖〈半角：( [ { <

wky 右括号，全角：）〕］｝》】〗〉半角： ) ] { >

wyz 左引号，全角：“ ‘ 『

wyy 右引号，全角：” ’ 』

wj 句号，全角：。

ww 问号，全角：？半角：?

wt 叹号，全角：！半角：!

wd 逗号，全角：，半角：,

wf 分号，全角：；半角： ;

wn 顿号，全角：、

wm 冒号，全角：：半角： :

ws 省略号，全角：…… …

wp 破折号，全角：—— －－ ——－半角：--- ----

wb 百分号千分号，全角：％ ‰ 半角：%

wh 单位符号，全角：￥＄￡ ° ℃ 半角：$

原文地址:

http://my.oschina.net/smilethat/blog/42537

分享到：

最大似然估计高斯分布正态分布 | Ubuntu下安装deb文件的方法

2012-06-04 15:07
浏览 1901
评论(0)
分类:行业应用
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论