Carrot2 聚类算法概要说明

wangwanbao

浏览: 20585 次
性别:
来自: 上海

最近访客更多访客>>

wvalianty

13366919904

Plwy

shenfengzhuzhu

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Lucene

算法 lucene

Carrot2 聚类算法概要说明

一、实验环境：Carrort2
输入数据类型：数组
输入值：

String[][] documents = new String[][] { 
  { "Introduction yourSelf", "上海" },// 0 
  { "KD Nuggets", "中国上海" },// 1 
  { "The Data Mine", "上海" },// 2 
  { "DMG", "上海浦东" },// 3 
  { "Two Crows: Data mining glossary", "中国上海" },// 4 
  { "中国上海浦东", 
  "http://www-db.stanford.edu/~ullman/mining/mining.html" },// 5 
          { "中国上海", "http://www.thearling.com/" },// 6 
          { "中国上海浦西", 
  "http://www.eco.utexas.edu/~norman/BUS.FOR/course.mat/Alex" },// 7 
          { "CCSU - Data Mining", "中国上海浦东" },// 8 
          { 
  "Data Mining: Practical Machine Learning Tools and Techniques", 
  "中国上海浦西" },// 19 
  { "Data Mining - Monografias.com", "中国上海" } // 10

输出结果为：Clusters:

CL-1 中国上海 (4 documents) 
     1: dummy://6 
     2: dummy://10 
     3: dummy://1 
     4: dummy://4 
 
CL-2 Data Mining (4 documents) 
     1: dummy://2 
     2: dummy://10 
     3: dummy://8 
     4: dummy://4 
 
CL-3 中国上海浦西 (2 documents) 
     1: dummy://7 
     2: dummy://9

二、算法主要处理的类及方法

LocalControllerBase.queryàLocalProcessBase.queryàLocalInputComponentBase.endProcessing()àLingoLocalFilterComponentàMultilingualClusteringContext.cluster()àMultilingualClusteringContext.extractFeatures()àLsiClusteringStrategy.cluster()

三、算法跟踪
第一部分，抽取数据：目的，得到Feature[] singleTerms
1) 将所有的documents用特定的分隔符组成一个字符串，如下：

introduct yourself .  上海  | KD nugget .  中国上海  | the data mine .  上海  | 
DMG .  上海浦东  | two crow data mine glossari .  中国上海  |  中国上海浦东  .
ullman mine |  中国上海  . |  中国上海浦西  . norman | CCSU data mine .  中国
上海浦东  | data mine practic machin learn tool and    techniqu .  中国上海浦西
| data mine monografias.com .  中国上海

2) 记录每一个词的code，所有不重复的词，如下：

words = hashMap<String,int>(); 
{and=23,  中国上海=5, glossari=13, norman=17, techniqu=24, 
monografias.com=25,  上海浦东=10, crow=12, nugget=4, tool=22, ccsu=18, kd=3 
ullman=15, the=6, practic=19, dmg=9, machin=20, yourself=1, two=11, data=7, 
introduct=0,  中国上海浦东=14, mine=8, learn=21,  上海=2,  中国上海浦西=16}

包装关键词，记录它们出现的次数及位置
伪代码如下：
     遍历所有的词{
              记录每个词的位置、code、及文档的id
     }

以下给出运行时变量值：

//所有的词共有63个，其中不重复的是26个，-1表示结束，大的
整数表示标点符号或者分隔符“|” 
intData[63] = [0, 1, 2147483647, 2, 2147483646, 3, 4, 2147483645, 5, 
2147483644, 6, 7, 8, 2147483643, 2, 2147483642, 9, 2147483641, 10, 
2147483640, 11, 12, 7, 8, 13, 2147483639, 5, 2147483638, 14, 2147483637, 
15, 8, 2147483636, 5, 2147483635, 2147483634, 16, 2147483633, 
17,2147483632, 18, 7, 8, 2147483631, 14, 2147483630, 7, 8, 19, 20, 21, 22, 
23, 24, 2147483629, 16, 2147483628, 7, 8, 25, 2147483627, 5, -1]   
//词在字串中的位置 
wordPositions[63] = [0, 10, 19, 21, 24, 26, 29, 36, 38, 43, 45, 49, 54, 59, 61, 
64, 66, 70, 72, 77, 79, 83, 88, 93, 98, 107, 109, 114, 116, 123, 125, 132, 137, 
139, 144, 146, 148, 155, 157, 164, 166, 171, 176, 181, 183, 190, 192, 197, 
202, 210, 217, 223, 228, 232, 241, 243, 250, 252, 257, 262, 278, 280, 285] 
//这些词分别出现的文档 
 documentIndices[62] = [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 
4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 10, 10, 10, 10, 10, 10]

3)记录每个词出现的文档：下面的值是遍历intData[63]得出来的，记录每个
词出现的文档，比如introduct=0，第0个词，出现在第0个文档中

[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 1, 
0, 1, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 
1, 0, 1, 1, 0, 0, 1, 1, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 
0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 
0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 
1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]

4)记录Feature[] singleTerms = new Feature[termCount]; //记录每个词的属性，
共26个词,tf表示在所有的文档中出现的次数。遍历所有的intData[63]得到结果。

[introduct tf=1 tfidf=0.00 ST null, yourself tf=1 tfidf=0.00 ST null,  上海  tf=2 
tfidf=0.00 null, KD tf=1 tfidf=0.00 ST null, nugget tf=1 tfidf=0.00 ST null,  中国
上海  tf=4 tfidf=0.00 ST null, the tf=1 tfidf=0.00 ST null, data tf=5 tfidf=0.00 
ST null, mine tf=6 tfidf=0.00 ST null, DMG tf=1 tfidf=0.00 ST null,  上海浦东 
tf=1 tfidf=0.00 null, two tf=1 tfidf=0.00 ST null, crow tf=1 tfidf=0.00 ST null, 
glossari tf=1 tfidf=0.00 ST null,  中国上海浦东  tf=2 tfidf=0.00 ST null, ullman 
tf=1 tfidf=0.00 null,  中国上海浦西  tf=2 tfidf=0.00 ST null, norman tf=1 
tfidf=0.00 null, CCSU tf=1 tfidf=0.00 ST null, practic tf=1 tfidf=0.00 ST null, 
machin tf=1 tfidf=0.00 ST null, learn tf=1 tfidf=0.00 ST null, tool tf=1 
tfidf=0.00 ST null, and tf=1 tfidf=0.00 ST null, techniqu tf=1 tfidf=0.00 ST null, 
monografias.com tf=1 tfidf=0.00 ST null]

计算tfidf，判断是否是Strongword,如果是加权重。再次计算结果为：

[introduct tf=2 tfidf=4.80 ST null, yourself tf=2 tfidf=4.80 ST null,  上海  tf=2 
tfidf=3.41 null, KD tf=2 tfidf=4.80 ST null, nugget tf=2 tfidf=4.80 ST null,  中国
上海  tf=10 tfidf=10.12 ST null, the tf=2 tfidf=4.80 ST null, data tf=12 
tfidf=9.46 ST null, mine tf=15 tfidf=9.09 ST null, DMG tf=2 tfidf=4.80 ST null, 
上海浦东  tf=1 tfidf=2.40 null, two tf=2 tfidf=4.80 ST null, crow tf=2 
tfidf=4.80 ST null, glossari tf=2 tfidf=4.80 ST null,  中国上海浦东  tf=5 
tfidf=8.52 ST null, ullman tf=1 tfidf=2.40 null,  中国上海浦西  tf=5 tfidf=8.52 
ST null, norman tf=1 tfidf=2.40 null, CCSU tf=2 tfidf=4.80 ST null, practic tf=2 tfidf=4.80 ST null, machin tf=2 tfidf=4.80 ST null, learn tf=2 tfidf=4.80   
ST null, tool tf=2 tfidf=4.80 ST null, and tf=2 tfidf=4.80 ST null, techniqu tf=2 
tfidf=4.80 ST null, monografias.com tf=2 tfidf=4.80 ST null]

singleTerms最后一步，设置语言，格式化数据，如把glossari格式化为：glossary

[Introduction tf=2 tfidf=4.80 ST en, yourself tf=2 tfidf=4.80 SW ST en,  上海 
tf=2 tfidf=3.41 en, KD tf=2 tfidf=4.80 ST en, Nuggets tf=2 tfidf=4.80 ST en,  中
国上海  tf=10 tfidf=10.12 ST en, the tf=2 tfidf=4.80 SW ST en, Data tf=12 
tfidf=9.46 ST en, Mining tf=15 tfidf=9.09 ST en, DMG tf=2 tfidf=4.80 ST en,  上
海浦东  tf=1 tfidf=2.40 en, two tf=2 tfidf=4.80 SW ST en, Crows tf=2 
tfidf=4.80 ST en, Glossary tf=2 tfidf=4.80 ST en,  中国上海浦东  tf=5 
tfidf=8.52 ST en, Ullman tf=1 tfidf=2.40 en,  中国上海浦西  tf=5 tfidf=8.52 ST 
en, Norman tf=1 tfidf=2.40 en, CCSU tf=2 tfidf=4.80 ST en, Practical tf=2 
tfidf=4.80 ST en, Machine tf=2 tfidf=4.80 ST en, Learning tf=2 tfidf=4.80 ST en, 
Tools tf=2 tfidf=4.80 ST en, and tf=2 tfidf=4.80 SW ST en, Techniques tf=2 
tfidf=4.80 ST en, Monografias.com tf=2 tfidf=4.80 ST en]

说到这，应该给出Feature的属性及说明了，如下：

/** A unique integer code of this feature */ 
   private int code; 
   /** String representation of this feature */ 
   private String text; 
   /** ISO code of the language to which Lingo believes this feature belongs *
   private String language; 
   /** The number of occurrences of this feature in all input snippets */ 
   private int tf; 
   /** 
   singleTerms[i].setIdf(Math.log( 
                   (double) 11 /  当前的word在所有的文档中出现的次
数); **/ 
private double idf; 
/** Length (in words) of this feature */ 
private int length; 
/** True if this feature is a stop word */ private boolean stopWord; 
 
     /** True if this feature is among query words */ 
     private boolean queryWord; 
 
     /** True if this feature appeared in some snippet's title */ 
     private boolean strong; 
 
     /** 
      * Used for phrase features only. An array pointing to features 
      * representing the phrase's individual words.   
       */ 
     private int[] phraseFeatureIndices; 
 
     /** 
      *  记录在哪些文档中出现了，比如“上海”这个词，这个值为：[0, 2] 
      */ 
     private int[] snippetIndices; 
 
     //即termDocument关于当前word的出现文档的数组，如： [1, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0]，表示这个word在哪些文档中出现过。 
     private int[] snippetTf;

第二部分：聚类
1）排序，按非停词，tf倒排

[Mining tf=15 tfidf=9.09 ST en, Data tf=12 tfidf=9.46 ST en,  中国上海  tf=10 
tfidf=10.12 ST en,  中国上海浦东  tf=5 tfidf=8.52 ST en,  中国上海 
浦西  tf=5 tfidf=8.52 ST en, CCSU tf=2 tfidf=4.80 ST en, Crows tf=2 tfidf=4.80 
ST en, DMG tf=2 tfidf=4.80 ST en, Glossary tf=2 tfidf=4.80 ST en, Introduction 
tf=2 tfidf=4.80 ST en, KD tf=2 tfidf=4.80 ST en, Learning tf=2 tfidf=4.80 ST en, 
Machine tf=2 tfidf=4.80 ST en, Monografias.com tf=2 tfidf=4.80 ST en, 
Nuggets tf=2 tfidf=4.80 ST en, Practical tf=2 tfidf=4.80 ST en, Techniques tf=2 
tfidf=4.80 ST en, Tools tf=2 tfidf=4.80 ST en,  上海  tf=2 tfidf=3.41 en, 
Norman tf=1 tfidf=2.40 en, Ullman tf=1 tfidf=2.40 en,  上海浦东  tf=1 
tfidf=2.40 en, and tf=2 tfidf=4.80 SW ST en, the tf=2 tfidf=4.80 SW ST en, two 
tf=2 tfidf=4.80 SW ST en, yourself tf=2 tfidf=4.80 SW ST en]

2)以上26个关键词中，有的是停词，去除停词后生成矩阵

生成非停词的矩阵，这里二维数组的长度为：double[][] tdMatrix = new double [19][11]

矩阵生成策略：tdMatrix[term][snippetIndices[doc]] =
features[term].getSnippetTf()[snippetIndices[doc]] * features[term].getIdf();

tdMatrix = new Matrix(matrix);//其中Matrix是Jama-1.0.2的包，一个专门生
成矩阵策略的jar工具包。（TfidfTdMatrixBuildingStrategy. buildTdMatrix）

3）根据矩阵，有N多复杂的计算，有兴趣请参阅源码：LsiClusteringStrategy. createClusters();抛开矩阵内部处理不说，有了这些数据，我们也可以编写自已的算法，得出哪些keyword或keywords集合出现在哪些文档中！达到聚类的效果！

carrot2聚类概要说明.pdf (343.6 KB)
下载次数: 392

分享到：

Comass的增量索引----数据同步

2009-05-25 09:41
浏览 5234
评论(5)
论坛回复 / 浏览 (5 / 6072)
分类:企业架构
查看更多

5 楼 yajie 2009-05-31

wangwanbao 写道

基于两点考虑
1、效率
2、无需修改Carrot的源代码。
我在应用中先分好词，放在Lucene中，聚类的时候把分词的结果取出，用空格格开，交给Carrot去处理。

上次我问问了maillist，那些个大鼻子老外不回答。
生气啊
看来鼻子大气就粗，得另想办法了。

4 楼 wangwanbao 2009-05-31

基于两点考虑
1、效率
2、无需修改Carrot的源代码。
我在应用中先分好词，放在Lucene中，聚类的时候把分词的结果取出，用空格格开，交给Carrot去处理。

3 楼 yajie 2009-05-27

原来是用WhitespaceAnalyzer来分词，对于英文还行，中文就更凑活了。那请问有没有在carrot2使用中文分词的方法？

2 楼 wangwanbao 2009-05-27

Carrot2的分词是英文分词,用空格格开。
它把一句话中用空格格开的若干个词当成一个元素。所以用它来聚类的话，请先分好词用空格格开再聚。
当然，替换它的分词系统除外。

1 楼 yajie 2009-05-27

请问你这分词是怎么实现的？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论