`

Mahout:Topic modeling using latent Dirichlet allocation (LDA)

 
阅读更多

Introduction

To find these topics in a particular set of documents,We’d modify our clustering code to work with word vectors instead of the document vectors we’ve been using so far. A word vector is nothing but a vector for each word, where the features would be IDs of the other words that occur along with it in the corpus, and the weights would be the number of documents they occur together in.

 

Latent Dirichlet analysis (LDA) is more than just this type of clustering. If two words having the same meaning or form don’t occur together, clustering won’t be able to associate those two based on other instances. This is where LDA shines. LDA can sift through patterns in the way words occur, and figure out which two have similar meanings or are being used in similar contexts. These groups of words can be
thought of as a concept or a topic.

 

The LDA algorithm works like Dirichlet clustering. It starts with an empty topic model, reads all the documents in a mapper phase in parallel, and calculates the probability of each topic for each word in the document. Once this is done, the counts of these probabilities are sent to the reducer where they’re summed, and the whole model is normalized. This process is run repeatedly until the model starts explaining the documents better—when the sum of the (log) probabilities stops changing. The degree of change is set by a convergence threshold parameter, similar to the threshold in k-means clustering. Instead of measuring the relative change in centroid, LDA estimates how well the model fits the data. If the likelihood value doesn’t change above this threshold, the iterations stop.

 

TF-IDF vs. LDA

While clustering documents, we used TF-IDF word weighting to bring out important words within a document. One of the drawbacks of TF-IDF was that it failed to recognize the co-occurrence or correlation between words, such as Coca Cola. Moreover, TF-IDF isn’t able to bring out subtle and intrinsic relations between words based on their occurrence and distribution. LDA brings out these relations based on the input word frequency, so it’s important to give term-frequency vectors as input to the algorithm,and not TF-IDF vectors.

 

Tuning the parameters of LDA

  • the number of topics
  • the number of words in the corpus   If you need to speed up LDA, apart from decreasing the number of topics, you cankeep the features to a minimum, but if you need to find the complete probability distribution of all the words over topics you should leave this parameter alone. If you’reinterested in finding the topic model containing only the keywords from a large corpus, you can prune away the high frequency words in the corpus while creating vectors.You can lower the value of the maximum-document-frequency percentage parameter (--maxDFPercent) in the dictionary-based vectorizer. A value of 70 removes allwords that occur in more than 70 percent of the documents.

Invocation and Usage

Mahout's implementation of LDA operates on a collection of SparseVectors of word counts. These word counts should be non-negative integers, though things will-- probably --work fine if you use non-negative reals. (Note that the probabilistic model doesn't make sense if you do!) To create these vectors, it's recommended that you follow the instructions in Creating Vectors From Text , making sure to use TF and not TFIDF as the scorer.

Invocation takes the form:

bin/mahout cvb \
    -i <input path for document vectors> \
    -dict <path to term-dictionary file(s) , glob expression supported> \
    -o <output path for topic-term distributions>
    -dt <output path for doc-topic distributions> \
    -k <number of latent topics> \
    -nt <number of unique features defined by input document vectors> \
    -mt <path to store model state after each iteration> \
    -maxIter <max number of iterations> \
    -mipd <max number of iterations per doc for learning> \
    -a <smoothing for doc topic distributions> \
    -e <smoothing for term topic distributions> \
    -seed <random seed> \
    -tf <fraction of data to hold for testing> \
    -block <number of iterations per perplexity check, ignored unless test_set_percentage>0

 

Topic smoothing should generally be about 50/K, where K is the number of topics. The number of words in the vocabulary can be an upper bound, though it shouldn't be too high (for memory concerns).

Choosing the number of topics is more art than science, and it's recommended that you try several values.

After running LDA you can obtain an output of the computed topics using the LDAPrintTopics utility:

bin/mahout ldatopics \
    -i <input vectors directory> \
    -d <input dictionary file> \
    -w <optional number of words to print> \
    -o <optional output working directory. Default is to console> \
    -h <print out help> \
    -dt <optional dictionary type (text|sequencefile). Default is text>

 

 

 

 

 

 

 

 

 

 

 

 

References

http://mahout.apache.org/users/clustering/latent-dirichlet-allocation.html

http://blog.csdn.net/wangran51/article/details/7408399

http://en.wikipedia.org/wiki/Dirichlet_distribution

分享到:
评论

相关推荐

    apache-mahout-distribution-0.11.0-src.zip

    Apache Mahout是一个开源项目,专注于开发可扩展的机器学习库,它主要由Java语言编写,并且依赖于Maven构建系统。在"apache-mahout-distribution-0.11.0-src.zip"这个压缩包中,您将找到Mahout 0.11.0版本的源代码,...

    mahout:Apache Mahout的镜像

    Apache Mahout:trade_mark:项目的目标是构建一个环境,以快速创建可扩展的高性能机器学习应用程序。 有关Mahout的其他信息,请访问设置环境无论您是使用Mahoutshell,运行命令行作业还是将其用作构建应用程序的库,...

    apache-mahout-distribution-0.11.1-src

    Apache Mahout 项目旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout 的创始者 Grant Ingersoll 介绍了机器学习的基本概念,并演示了如何使用 Mahout 来实现文档集群、提出建议和组织内容。

    mahout所需jar包

    **马哈多(Mahout)库的概述** 马哈多(Mahout)是Apache软件基金会的一个开源项目,专注于提供可扩展的机器学习库。它基于Hadoop,这意味着它能够利用分布式计算来处理大规模数据集。 Mahout的目标是帮助开发人员...

    如何成功运行Apache Mahout的Taste Webapp-Mahout推荐教程-Maven3.0.5-JDK1.6-Mahout0.5

    根据给定的文件信息,我们可以提炼出以下几个与Apache Mahout及其Taste Webapp相关的知识点: 1. Apache Mahout简介 Apache Mahout是一个开源项目,隶属于Apache软件基金会(ASF),专门提供可扩展的机器学习算法...

    sample-LDA-Java:从Scala轻松转换代码LDA

    标题 "sample-LDA-Java:从Scala轻松转换代码LDA" 指向的是一个项目,该项目旨在帮助开发者将Scala中的Latent Dirichlet Allocation(LDA)代码转换为Java语言。LDA是一种主题模型,常用于文本挖掘,它可以发现文档...

    mahout:mahout机器智能推荐系统

    Apache Mahout是一个基于Java的开源项目,专注于开发可扩展的机器学习库,尤其在推荐系统、分类和聚类算法方面表现出色。在大数据领域,Mahout为Hadoop提供了一个理想的平台,用于实现大规模的数据挖掘和分析任务。...

    Apache_Mahout_Cookbook(高清版)

    ### Apache Mahout Cookbook知识点概述 #### 一、Apache Mahout简介 Apache Mahout是一个高度可扩展的机器学习库,主要用于构建智能推荐系统、聚类分析以及其他数据挖掘任务。该库利用了Apache Hadoop的强大分布式...

    mahout:mahout-推荐-测试

    在"mahout:mahout-推荐-测试"这个主题中,我们聚焦于 Mahout 的推荐系统部分以及相关的测试过程。Mahout 的推荐引擎是其核心功能之一,它能够帮助开发者构建个性化的推荐系统,广泛应用于电子商务、社交媒体、流媒体...

    Hadoop-Mahout:使用 Mahout 在 Hadoop 上进行推荐、集群和分类

    《Hadoop-Mahout:基于Hadoop的大数据处理与机器学习实践》 Hadoop-Mahout 是一个基于Apache Hadoop的开源项目,专注于提供大规模的数据挖掘和机器学习算法。这个项目的目标是创建易于使用的、可扩展的机器学习库,...

    Recommendation-with-mahout:与Maven + hadoop和mahout一起推荐

    本项目名为“Recommendation-with-mahout”,它结合了Maven、Hadoop和Apache Mahout这三个强大的工具,旨在实现高效的推荐算法。以下是对这些技术及其整合应用的详细说明。 **Apache Mahout** Apache Mahout是一个...

    人工智能-推荐系统-新闻推荐-基于Mahout的新闻推荐系统

    Mahout:整体框架,实现了协同过滤 Deeplearning4j,构建VSM Jieba:分词,关键词提取 HanLP:分词,关键词提取 Spring Boot:提供API、ORM 关键实现 基于用户的协同过滤 直接调用Mahout相关接口即可 选择不同...

    优质课件 北京大学研究生课程文本挖掘 文本数据挖掘全套PPT教程(共67页) TextMining14-文本挖掘工具与应用.ra

    7. **主题模型**:如LDA(Latent Dirichlet Allocation)等,用于发现文本中的隐藏主题或模式。 8. **文本聚类**:将相似的文本归为一类,常用于新闻分类、用户分群等场景。 9. **情感分析**:判断文本的情绪倾向...

    play-mahout:一个运行Apache Mahout方法的游乐场

    **Apache Mahout与Play-Mahout游乐场** Apache Mahout是一个开源机器学习库,它为开发者和数据科学家提供了实现各种机器学习算法的平台。Mahout最初是基于Java开发的,但随着时间的发展,它也整合了Scala和Spark等...

    LDA模型的Java版

    LDA(Latent Dirichlet Allocation)模型,全称为潜在狄利克雷分配,是一种基于概率的统计方法,广泛应用于自然语言处理领域,特别是文本挖掘。它是一种主题模型,旨在从文档集合中发现隐藏的主题结构,并能用于主题...

    jruby_mahout:JRuby Mahout是一颗宝石,它在JRuby世界中释放了Apache Mahout的力量

    JRuby Mahout Jruby Mahout是一颗宝石,它在JRuby世界中释放了Apache Mahout的力量。 Mahout是用Java编写的高级机器学习库。 它大规模地处理了建议,聚类和分类机器学习问题。 到目前为止,在Ruby项目中很难使用它...

    Recommendation-System-using-ApacheMahout:MovieLens电影数据集和Crossing数据集的基于项目和基于用户的推荐系统

    推荐系统使用ApacheMahout 使用Mahout库进行协同过滤。 使用的数据集: 100k电影镜头数据集。 网址: : 图书交叉数据集。 网址: : 数据预处理: 电影镜头数据集:该数据集的值用'\ t'分隔,并且还报告了时间戳。 ...

Global site tag (gtag.js) - Google Analytics