Transforming data into vectors
In Mahout, vectors are implemented as three different classes
- DenseVector can be thought of as an array of doubles, whose size is the numberof features in the data. Because all the entries in the array are preallocatedregardless of whether the value is 0 or not, we call it dense.
- RandomAccessSparseVector is implemented as a HashMap between an integer and a double, where only nonzero valued features are allocated. Hence, they’re called as SparseVectors.
- SequentialAccessSparseVector is implemented as two parallel arrays, one ofintegers and the other of doubles. Only nonzero valued entries are kept in it.Unlike the RandomAccessSparseVector, which is optimized for random access,this one is optimized for linear reading.
One possible problem with our chosen mappings to dimension values is that the values in dimension 1 are much larger than the others. If we applied a simple distance-based metric to determine similarity between these vectors, color differences would dominate the results. A relatively small color difference of 10 nm is treated as equal to a huge size difference of 10. Weighting the different dimensions solves this
problem.
Representing text documents as vectors
The vector space model (VSM) is the common way of vectorizing text documents. First, imagine the set of all words that could be encountered in a series of documents being vectorized. This set might be all words that appear at least once in any of the documents. Imagine each word being assigned a number, which is the dimension it’ll occupy in document vectors.
- term frequency (TF) The value of the vector dimension for a word is usually the number of occurrences of the word in the document. This is known as term frequency (TF) weighting.
- Term frequency–inverse document frequency (TF-IDF) Term frequency–inverse document frequency (TF-IDF) weighting is a widely usedimprovement on simple term-frequency weighting. The IDF part is the improvement;instead of simply using term frequency as the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its value is reduced more for words used frequently across all the documents in the dataset than for infrequently used words.
The basic assumption of the vector space model (VSM) is that the words are dimensions and therefore are orthogonal to each other. In other words, VSM assumes that the occurrences of words are independent of each other, in the same sense that a point’s x coordinate is entirely independent of its y coordinate, in two dimensions. By intuition you know that this assumption is wrong in many cases. For example, the word Cola has higher probability of occurring along with the word Coca, so these words aren’t completely independent. Other models try to consider word dependencies. One well-known technique is latent semantic indexing (LSI), which detects dimensions that seem to go together and merges them into a single one.
In Mahout, text documents are converted to vectors using TF-IDF weighting and n-gram collocation using the DictionaryVectorizer class.
Generating vectors from documents
mvn -e -q exec:java -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" -Dexec.args="reuters/ reuters-extracted/"
mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles
mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
- In the first step, the text documents are tokenized—they’re split into individual words using the Lucene StandardAnalyzer and stored in the tokenized-documents/ folder.
- The word-counting step—the n-gram generation step (which in this case only counts unigrams)—iterates through the tokenized documents and generates a set of important words from the collection.
- The third step converts the tokenized documents into vectors using the term-frequency weight, thus creating TF vectors. By default, the vectorizer uses the TF-IDF weighting, so two more steps happen after this:
- the document-frequency (DF) counting job, and the TF-IDF vector creation.
相关推荐
Apache Mahout是基于Hadoop的数据挖掘库,提供了一套用于实现推荐系统、分类和聚类算法的工具。这个项目的目标是创建易于使用的、高效的机器学习算法,使大数据分析变得更加简单。 2. **源码分析**: 在源码中,...
Mahout提供了多种机器学习算法,包括推荐系统、聚类和分类。推荐系统如协同过滤,用于个性化推荐;聚类算法如K-means,用于将数据分成多个相似的组;分类算法如随机森林,用于预测目标变量。 2. **Hadoop支持**: ...
Apache Mahout是一个基于Java的开源机器学习库,它提供了丰富的算法和工具,用于构建智能应用程序,特别是关注推荐系统、分类和聚类分析。"mahout-distribution-0.9.tar.gz"是Apache Mahout的0.9版本的发行包,包含...
Mahout包含了多种机器学习的经典算法,如聚类、分类、协同过滤和进化编程等。此外,Mahout支持在Hadoop集群上运行算法,使得它们能够在云计算环境中高效运行。 2. Mahout的版本及其重要性 文档强调使用特定版本的...
mahout中需要用到的一个版本jar包:mahout-core-0.3.jar
mahout-0.9-cdh5.5.0.tar.gz
《Apache Maven与Mahout实战:基于maven_mahout_template-mahout-0.8的探索》 Apache Maven是一款强大的项目管理和依赖管理工具,广泛应用于Java开发领域。它通过一个项目对象模型(Project Object Model,POM)来...
2. **算法库**:Mahout包含了大量的机器学习算法,如聚类、分类、推荐系统等。例如,基于密度的聚类算法DBSCAN、协同过滤推荐算法等,这些算法都是经过优化的,可以直接应用到实际项目中。 3. **推荐系统**:Mahout...
Mahout的目标是帮助开发人员构建智能应用程序,如推荐系统、分类和聚类算法,这些在大数据分析领域中极为重要。 **K-Means聚类算法** K-Means是一种无监督学习的聚类算法,用于将数据集分成不同的群组或类别。在...
mahout-core-0.9.jar,支持版本hadoop-2.2.x,由mahout-distribution-0.9.tar.gz源码构建生成jar包。
3. **聚类**:K-Means、Fuzzy K-Means和Canopy Clustering等方法用于将数据分组到相似的集合中,无监督学习的一种常见应用。 4. **特征选择与降维**:通过PCA(主成分分析)和其他方法减少数据的维度,以便更有效地...
Apache Mahout是一个基于Java的开源机器学习库,它为大数据处理提供了丰富的算法,主要用于推荐系统、分类和聚类。"mahout-distribution-0.9含jar包" 是一个包含了Mahout项目0.9版本的预编译二进制文件集合,其中不...
mahout-distribution-0.5-src.zip mahout 源码包
1. **机器学习算法**:Mahout的核心在于它提供了多种机器学习算法,包括分类(如决策树、随机森林)、聚类(如K-Means、Fuzzy K-Means)、协同过滤(用于推荐系统)等。这些算法可以处理大规模数据,并且利用Hadoop...
mahout-examples-0.10.1-job.jar 已经包含分词程序,替换掉mahout默认的jar包
mahout-integration-0.7mahout-integration-0.7mahout-integration-0.7mahout-integration-0.7
mahout-distribution-0.10.0-src.tar.gz
重新编译mahout-examples-0.9-job.jar,增加分类指标:最小最大精度、召回率。详情见http://blog.csdn.net/u012948976/article/details/50203249
2. **聚类(Clustering)**:包括K-means、Fuzzy K-means、Canopy Clustering等算法,用于将数据集中的对象分组到相似的类别中。这些算法广泛应用于市场细分、文本分类和图像分析等领域。 3. **分类...
Apache Mahout 是一个基于 Apache Hadoop 的开源机器学习库,主要设计用于构建大规模的机器学习算法。在"mahout:mahout-推荐-测试"这个主题中,我们聚焦于 Mahout 的推荐系统部分以及相关的测试过程。Mahout 的推荐...