Mahout: Clustering - Representing data - 术业有专攻

ylzhj02

浏览: 248802 次
性别:
来自: 成都

最近访客更多访客>>

daqin

bbpopeye

也许on

learnmore

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Mahout: Clustering - Representing data

博客分类：

Mahout

Transforming data into vectors

In Mahout, vectors are implemented as three different classes

DenseVector can be thought of as an array of doubles, whose size is the numberof features in the data. Because all the entries in the array are preallocatedregardless of whether the value is 0 or not, we call it dense.
RandomAccessSparseVector is implemented as a HashMap between an integer and a double, where only nonzero valued features are allocated. Hence, they’re called as SparseVectors.
SequentialAccessSparseVector is implemented as two parallel arrays, one ofintegers and the other of doubles. Only nonzero valued entries are kept in it.Unlike the RandomAccessSparseVector, which is optimized for random access,this one is optimized for linear reading.

One possible problem with our chosen mappings to dimension values is that the values in dimension 1 are much larger than the others. If we applied a simple distance-based metric to determine similarity between these vectors, color differences would dominate the results. A relatively small color difference of 10 nm is treated as equal to a huge size difference of 10. Weighting the different dimensions solves this
problem.

Representing text documents as vectors

The vector space model (VSM) is the common way of vectorizing text documents. First, imagine the set of all words that could be encountered in a series of documents being vectorized. This set might be all words that appear at least once in any of the documents. Imagine each word being assigned a number, which is the dimension it’ll occupy in document vectors.

term frequency (TF) The value of the vector dimension for a word is usually the number of occurrences of the word in the document. This is known as term frequency (TF) weighting.
Term frequency–inverse document frequency (TF-IDF) Term frequency–inverse document frequency (TF-IDF) weighting is a widely usedimprovement on simple term-frequency weighting. The IDF part is the improvement;instead of simply using term frequency as the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its value is reduced more for words used frequently across all the documents in the dataset than for infrequently used words.

The basic assumption of the vector space model (VSM) is that the words are dimensions and therefore are orthogonal to each other. In other words, VSM assumes that the occurrences of words are independent of each other, in the same sense that a point’s x coordinate is entirely independent of its y coordinate, in two dimensions. By intuition you know that this assumption is wrong in many cases. For example, the word Cola has higher probability of occurring along with the word Coca, so these words aren’t completely independent. Other models try to consider word dependencies. One well-known technique is latent semantic indexing (LSI), which detects dimensions that seem to go together and merges them into a single one.

In Mahout, text documents are converted to vectors using TF-IDF weighting and n-gram collocation using the DictionaryVectorizer class.

Generating vectors from documents

mvn -e -q exec:java
-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
-Dexec.args="reuters/ reuters-extracted/"

mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles

mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow

In the first step, the text documents are tokenized—they’re split into individual words using the Lucene StandardAnalyzer and stored in the tokenized-documents/ folder.
The word-counting step—the n-gram generation step (which in this case only counts unigrams)—iterates through the tokenized documents and generates a set of important words from the collection.
The third step converts the tokenized documents into vectors using the term-frequency weight, thus creating TF vectors. By default, the vectorizer uses the TF-IDF weighting, so two more steps happen after this:
the document-frequency (DF) counting job, and the TF-IDF vector creation.

查看图片附件

分享到：

Mahout: K-means clustering | MongoDB: Queries and aggregation 1

2014-06-11 11:02
浏览 606
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Mahout: Clustering - Representing data

Transforming data into vectors

Representing text documents as vectors

Generating vectors from documents

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Mahout: Clustering - Representing data

Transforming data into vectors

Representing text documents as vectors

Generating vectors from documents

评论

发表评论

相关推荐

Solr:Deploy solr to tomcat

Mahout: CVB

Mahout: Integerate jcseg with mahout seq2parse

Mahoout: CWSS

Mahout: Batch and online clustering

Mahout:Topic modeling using latent Dirichlet allocation (LDA)

Mahout: Dirichlet clustering

Mahout: Fuzzy k-means clustering

Mahout: An overview of clustering techniques

Mahout: K-means clustering

Mahout: Run ItemBasedRecommemdation Job in eclipse

Mahout: qulity blogs

Mahout: Introduction to clustering

Mahout: build 0.9 support hadoop2.3.0

Mahout: distributed item-based algorithm 3

Mahout: distributed item-based algorithm 2

Mahout: distributed item-based algorithm 1

Mahout: build with hadoop2.3.0

New and experimental recommenders

Item-based recommendation

最近访客更多访客>>