Mahout: K-means clustering

ylzhj02

浏览: 248790 次
性别:
来自: 成都

最近访客更多访客>>

daqin

bbpopeye

也许on

learnmore

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Mahout

K-means Algorithm

The k-means algorithm will start with an initial set of k centroid points. The algorithm does multiple rounds of processing and refines the centroid locations until the iteration max-limit criterion is reached or until the centroids converge to a fixed point from which they don’t move very much.

K-generation

centroids are randomly generated using RandomSeedGenerator

For good quality clustering using k-means, you’ll need to estimate a value for k. An approximate way of estimating k is to figure it out based on the data you have and the size of clusters you need. In the preceding example, where we have about a million news articles, if there are an average of 500 news articles published about every unique story, you should start your clustering with a k value of about 2,000 (1,000,000/500).

Finding the perfect k using canopy clustering

Canopy clustering’s strength lies in its ability to create clusters extremely quickly—it can do this with a single pass over the data. But its strength is also its weakness. This algorithm may not give accurate and precise clusters. But it can give the optimal number of clusters without even specifying the number of clusters, k, as required by k-means.

The algorithm uses a fast distance measure and two distance thresholds, T1 and T2, with T1 > T2. It begins with a data set of points and an empty list of canopies, and then iterates over the data set, creating canopies in the process. During each iteration, it removes a point from the data set and adds a canopy to the list with that point as the center. It loops through the rest of the points one by one. For each one, it calculates the distances to all the canopy centers in the list. If the distance between the point and any canopy center is within T1, it’s added into that canopy. If the distance is within T2, it’s removed from the list and thereby prevented from forming a new canopy in the subsequent loops. It repeats this process until the list is empty.
This approach prevents all points close to an already existing canopy (distance < T2) from being the center of a new canopy. It’s detrimental to form another redundant canopy in close proximity to an existing one.

mahout canopy 
-i reuters-vectors/tfidf-vectors \
-o reuters-canopy-centroids \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure \
-t1 1500 -t2 2000

mahout kmeans 
-i mahout/reuters-vectors/tfidf-vectors 
-o mahout/reuters-kmeans-clusters 
-dm org.apache.mahout.common.distance.TanimotoDistanceMeasure 
-c mahout/reuters-canopy-centroids/clusters-0-final 
-cd 0.1 
-ow 
-x 20 
-cl

 mahout clusterdump 
-i mahout/reuters-kmeans-clusters/clusters-1-final 
-o mahout/reuters-kmeans-clusters-dump 
-dt sequencefile 
-d mahout/reuters-vectors/dictionary.file-* 
-p mahout/reuter-kmeans-clusters/clusteredPoints
-b 10 
-n 10

mahout clusterdump usage

Note: you must add -cl in kmean command otherwise there has no clusteredPoints dir. See

http://lucene.472066.n3.nabble.com/where-are-the-points-in-each-cluster-kmeans-clusterdump-tc838683.html#none

查看图片附件

分享到：

Mahout: An overview of clustering techni ... | Mahout: Clustering - Representing data

2014-06-11 16:06
浏览 819
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论