`

Mahout: Clustering - Representing data

 
阅读更多

Transforming data into vectors

In Mahout, vectors are implemented as three different classes

  • DenseVector can be thought of as an array of doubles, whose size is the numberof features in the data. Because all the entries in the array are preallocatedregardless of whether the value is 0 or not, we call it dense.
  • RandomAccessSparseVector is implemented as a HashMap between an integer and a double, where only nonzero valued features are allocated. Hence, they’re called as SparseVectors.
  • SequentialAccessSparseVector is implemented as two parallel arrays, one ofintegers and the other of doubles. Only nonzero valued entries are kept in it.Unlike the RandomAccessSparseVector, which is optimized for random access,this one is optimized for linear reading.



One possible problem with our chosen mappings to dimension values is that the values in dimension 1 are much larger than the others. If we applied a simple distance-based metric to determine similarity between these vectors, color differences would dominate the results. A relatively small color difference of 10 nm is treated as equal to a huge size difference of 10. Weighting the different dimensions solves this
problem.

 

 

Representing text documents as vectors

The vector space model (VSM) is the common way of vectorizing text documents. First, imagine the set of all words that could be encountered in a series of documents being vectorized. This set might be all words that appear at least once in any of the documents. Imagine each word being assigned a number, which is the dimension it’ll occupy in document vectors.

  • term frequency (TF)     The value of the vector dimension for a word is usually the number of occurrences of the word in the document. This is known as term frequency (TF) weighting.
  • Term frequency–inverse document frequency (TF-IDF)  Term frequency–inverse document frequency (TF-IDF) weighting is a widely usedimprovement on simple term-frequency weighting. The IDF part is the improvement;instead of simply using term frequency as the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its value is reduced more for words used frequently across all the documents in the dataset than for infrequently used words.



 
The basic assumption of the vector space model (VSM) is that the words are dimensions and therefore are orthogonal to each other. In other words, VSM assumes that the occurrences of words are independent of each other, in the same sense that a point’s x coordinate is entirely independent of its y coordinate, in two dimensions. By intuition you know that this assumption is wrong in many cases. For example, the word Cola has higher probability of occurring along with the word Coca, so these words aren’t completely independent. Other models try to consider word dependencies. One well-known technique is latent semantic indexing (LSI), which detects dimensions that seem to go together and merges them into a single one.

 

In Mahout, text documents are converted to vectors using TF-IDF weighting and n-gram collocation using the DictionaryVectorizer class.

 

Generating vectors from documents

 

 

mvn -e -q exec:java
-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
-Dexec.args="reuters/ reuters-extracted/"

 

 

mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles

 

 

mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow

 

  • In the first step, the text documents are tokenized—they’re split into individual words using the Lucene StandardAnalyzer and stored in the tokenized-documents/ folder.
  • The word-counting step—the n-gram generation step (which in this case only counts unigrams)—iterates through the tokenized documents and generates a set of important words from the collection.
  • The third step converts the tokenized documents into vectors using the term-frequency weight, thus creating TF vectors. By default, the vectorizer uses the TF-IDF weighting, so two more steps happen after this:
  • the document-frequency (DF) counting job, and the TF-IDF vector creation.

 

 

  • 大小: 37.6 KB
  • 大小: 3.7 KB
  • 大小: 3.5 KB
分享到:
评论

相关推荐

    apache-mahout-distribution-0.11.0-src.zip

    Apache Mahout是基于Hadoop的数据挖掘库,提供了一套用于实现推荐系统、分类和聚类算法的工具。这个项目的目标是创建易于使用的、高效的机器学习算法,使大数据分析变得更加简单。 2. **源码分析**: 在源码中,...

    mahout-core-0.9.jar+mahout-core-0.8.jar+mahout-core-0.1.jar

    Mahout提供了多种机器学习算法,包括推荐系统、聚类和分类。推荐系统如协同过滤,用于个性化推荐;聚类算法如K-means,用于将数据分成多个相似的组;分类算法如随机森林,用于预测目标变量。 2. **Hadoop支持**: ...

    mahout-distribution-0.9.tar.gz

    Apache Mahout是一个基于Java的开源机器学习库,它提供了丰富的算法和工具,用于构建智能应用程序,特别是关注推荐系统、分类和聚类分析。"mahout-distribution-0.9.tar.gz"是Apache Mahout的0.9版本的发行包,包含...

    如何成功运行Apache Mahout的Taste Webapp-Mahout推荐教程-Maven3.0.5-JDK1.6-Mahout0.5

    Mahout包含了多种机器学习的经典算法,如聚类、分类、协同过滤和进化编程等。此外,Mahout支持在Hadoop集群上运行算法,使得它们能够在云计算环境中高效运行。 2. Mahout的版本及其重要性 文档强调使用特定版本的...

    mahout-core-0.3.jar

    mahout中需要用到的一个版本jar包:mahout-core-0.3.jar

    mahout-0.9-cdh5.5.0.tar.gz

    mahout-0.9-cdh5.5.0.tar.gz

    maven_mahout_template-mahout-0.8

    《Apache Maven与Mahout实战:基于maven_mahout_template-mahout-0.8的探索》 Apache Maven是一款强大的项目管理和依赖管理工具,广泛应用于Java开发领域。它通过一个项目对象模型(Project Object Model,POM)来...

    mahout-distribution-0.9-src.zip

    2. **算法库**:Mahout包含了大量的机器学习算法,如聚类、分类、推荐系统等。例如,基于密度的聚类算法DBSCAN、协同过滤推荐算法等,这些算法都是经过优化的,可以直接应用到实际项目中。 3. **推荐系统**:Mahout...

    mahout-core-0.9.jar

    mahout-core-0.9.jar,支持版本hadoop-2.2.x,由mahout-distribution-0.9.tar.gz源码构建生成jar包。

    mahout-distribution-0.5.tar.gz + 源码

    3. **聚类**:K-Means、Fuzzy K-Means和Canopy Clustering等方法用于将数据分组到相似的集合中,无监督学习的一种常见应用。 4. **特征选择与降维**:通过PCA(主成分分析)和其他方法减少数据的维度,以便更有效地...

    mahout-distribution-0.9含jar包

    Apache Mahout是一个基于Java的开源机器学习库,它为大数据处理提供了丰富的算法,主要用于推荐系统、分类和聚类。"mahout-distribution-0.9含jar包" 是一个包含了Mahout项目0.9版本的预编译二进制文件集合,其中不...

    mahout-distribution-0.5-src.zip mahout 源码包

    mahout-distribution-0.5-src.zip mahout 源码包

    mahout-distribution-0.12.2-src.tar.gz

    1. **机器学习算法**:Mahout的核心在于它提供了多种机器学习算法,包括分类(如决策树、随机森林)、聚类(如K-Means、Fuzzy K-Means)、协同过滤(用于推荐系统)等。这些算法可以处理大规模数据,并且利用Hadoop...

    mahout-examples-0.10.1-job.jar

    mahout-examples-0.10.1-job.jar 已经包含分词程序,替换掉mahout默认的jar包

    mahout-integration-0.7

    mahout-integration-0.7mahout-integration-0.7mahout-integration-0.7mahout-integration-0.7

    mahout-distribution-0.10.0-src.tar.gz

    mahout-distribution-0.10.0-src.tar.gz

    mahout-examples-0.9-job.jar(修改版)

    重新编译mahout-examples-0.9-job.jar,增加分类指标:最小最大精度、召回率。详情见http://blog.csdn.net/u012948976/article/details/50203249

    mahout-distribution-0.8-src

    2. **聚类(Clustering)**:包括K-means、Fuzzy K-means、Canopy Clustering等算法,用于将数据集中的对象分组到相似的类别中。这些算法广泛应用于市场细分、文本分类和图像分析等领域。 3. **分类...

    mahout:mahout-推荐-测试

    Apache Mahout 是一个基于 Apache Hadoop 的开源机器学习库,主要设计用于构建大规模的机器学习算法。在"mahout:mahout-推荐-测试"这个主题中,我们聚焦于 Mahout 的推荐系统部分以及相关的测试过程。Mahout 的推荐...

    mahout-distribution-0.7-src.zip

    2. 聚类(Clustering):包括K-Means、Fuzzy K-Means、Canopy Clustering等算法,用于将数据点自动分组到相似的集合中。 3. 分类(Classification):支持基于概率的朴素贝叶斯分类器(NaiveBayesTrainer)和其他...

Global site tag (gtag.js) - Google Analytics