Dirichlet clustering starts with a data set of points and a ModelDistribution. Think of ModelDistribution as a class that generates different models. You create an empty model and try to assign points to it. When this happens, the model crudely grows or shrinks its parameters to try and fit the data. Once it does this for all points, it re-estimates the parameters of the model precisely using all the points and a partial probability of the point belonging to the model.
At the end of each pass, you get a number of samples that contain the probabilities, models, and assignment of points to models. These samples could be regarded as clusters, and they provide information about the models and their parameters, such as their shape and size. Moreover, by examining the number of models in each sample that have some points assigned to them, you can get information about how many models (clusters) the data supports. Also, by examining how often two points are assigned to the same model, you can get an approximate measure of how likely these points are to be explained by the same model. Such soft-membership information is a side product of using model-based clustering. Dirichlet clustering is able to capture the partial probabilities of points belonging to various models.
Dirichlet clustering is a powerful way of getting quality clusters using known data distribution models. In Mahout, the algorithm is a pluggable framework, so different models can be created and tested. As the models become more complex there’s a chance of things slowing down on huge data sets, and at this point you’ll have to fall back on other clustering algorithms. But after seeing the output of Dirichlet cluster-
ing, you can clearly decide whether the algorithm we choose should be fuzzy or rigid, overlapping or hierarchical, whether the distance measure should be Manhattan or cosine, and what the threshold for convergence should be. Dirichlet clustering is both a data-understanding tool and a great data clustering tool.
bin/mahout dirichlet -i mahout/reuters-vectors/tfidf-vectors -o mahout/reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSparseVector
相关推荐
- **Dirichlet Process Clustering**:能够自动确定最优聚类数量的方法。 - **LDA 聚类**:基于主题模型的聚类方法。 - **Spectral Clustering**:利用图论中的谱分析技术进行聚类。 - **Minhash Clustering**:适用...
新添功能:基于高性能的Colt library的 math、collections模块采用FP-bonsai pruning而实现更快的频 繁模式增长(Frequent Pattern Growtt)算法并行计算Dirichlet 聚 类算法(基于模型的聚类算法)并行计算基于共现...
11. **文本挖掘工具**:课程可能会介绍一些常用工具,如NLTK、Spacy(Python)、Gensim、Stanford NLP(Java)、Apache OpenNLP等,以及开源平台如Apache Mahout、Spark MLlib等。 12. **实际应用**:可能包括舆情...
### LDA(Latent Dirichlet Allocation)原始论文解析与应用 #### 概述 LDA(潜在狄利克雷分配)是一种强大的机器学习算法,它能够自动地将词汇聚类到“主题”中,并同时将文档聚类到主题混合中。自Blei等人在2003...
实际操作中,数据挖掘工具有如Weka、Python的Scikit-learn库、Apache Mahout等,它们提供了一系列预定义的算法和接口,方便进行实验和开发。Web数据挖掘的应用广泛,如搜索引擎优化、社交媒体分析、网络营销策略制定...