paoding Lucene中文分词Paoding Analysis

qpshenggui

浏览: 14935 次
性别:
来自: 襄阳

最近访客更多访客>>

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

analyzer paodinganalyzer 庖丁解牛 lucene

Paoding Analysis摘要

Paoding's Knives 中文分词具有极 高效率 和 高扩展性 。引入隐喻，采用完全的面向对象设计，构思先进。

高效率：在PIII 1G内存个人机器上，1秒可准确分词 100万 汉字。

采用基于 不限制个数 的词典文件对文章进行有效切分，使能够将对词汇分类定义。

能够对未知的词汇进行合理解析

2010-01-20 庖丁 Lucene 3.0 升级说明

(代码已提交svn，下载包稍后稍推迟下)

这次升级的主要目的是支持Lucene 3.0，具体改动如下：

（1）支持Lucene 3.0，对Lucene 3.0以下的版本，请使用 http://paoding.googlecode.com/svn/branches/paoding-for-lucene-2.4/ 中的代码编译。

（2）使用Java 5.0编译，不再支持Java 1.4，以后的新功能将会在Java 5上开发。

（3）PaodingAnalyzer的调用接口没有改动，但在使用上需要适应Lucene 3.0的API，分词示例如下：

//生成analyzer实例 Analyzer analyzer = new PaodingAnalyzer(properties);

//取得Token流 TokenStream stream = analyzer.tokenStream("", reader);

//重置到流的开始位置 stream.reset();

//添加工具类 TermAttribute termAtt = (TermAttribute) stream.addAttribute(TermAttribute.class); OffsetAttribute offAtt = (OffsetAttribute) stream.addAttribute(OffsetAttribute.class);

//循环打印所有分词及其位置 while (stream.incrementToken()) {
System.out.println(termAtt.term() + " " + offAtt.startOffset() + " " + offAtt.endOffset());
}

具体使用方法可以参见net.paoding.analysis.analyzer.estimate以及net.paoding.analysis.examples包下面的示例代码。

/*
    *param   分词
    */
    public List getname(String param) throws IOException{
        //分词(庖丁解牛分词法)
        Analyzer ika = new PaodingAnalyzer();
        List<String> keys = new ArrayList<String>();
            TokenStream ts = null;

            try{
                Reader r = new StringReader(param);
                ts = ika.tokenStream("TestField", r);
                TermAttribute termAtt = (TermAttribute) ts.getAttribute(TermAttribute.class);
                TypeAttribute typeAtt = (TypeAttribute) ts.getAttribute(TypeAttribute.class);
                String key = null;
                while (ts.incrementToken()) {
                    if ("word".equals(typeAtt.type())) {
                        key = termAtt.term();
                        if (key.length() >= 2) {
                            keys.add(key);
                        }
                    }
                }
            }catch(IOException e){
                e.printStackTrace();
            } finally {
                if (ts != null) {
                    ts.close();
                }
            }

            Map<String, Integer> keyMap = new HashMap<String, Integer>();
            Integer $ = null;
            //计算每个词出现的次数
            for (String key : keys) {
                keyMap.put(key, ($ = keyMap.get(key)) == null ? 1 : $ + 1);
            }
            List<Map.Entry<String, Integer>> keyList = new ArrayList<Map.Entry<String, Integer>>(keyMap.entrySet());
            //进行排序
            Collections.sort(keyList, new Comparator<Map.Entry<String, Integer>>() {
                public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                    return (o2.getValue() - o1.getValue());
                }
            });
            //取出关键词
            String id = null;
            String str = "";
            List list = new ArrayList();
            if(keyList.size() >0){
                for (int i = 0;i < keyList.size(); i++) {
                    id = keyList.get(i).toString();
                    String[] strs = id.split("\\=");
                    str = strs[0];
                    list.add(strs[0]);
                    System.out.println("id:"+id);
                }
            }
            return list;
    }

中文分词.rar (1.2 MB)
下载次数: 84

lucene3.0.rar (942.3 KB)
下载次数: 35

lucene2.9--2.4.rar (1.7 MB)
下载次数: 29

1
顶

2
踩

分享到：

lucene IndexSearcher实现搜索 | lucene Analyzer 庖丁解牛中文分词

2011-08-26 09:21
浏览 2965
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论