Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. For a general introduction to topic modeling, see for example
Probabilistic Topic Models by Steyvers and Griffiths (2007).
For an example showing how to use the Java API to import data, train models, and infer topics for new documents, see the
topic model developer's guide.
The MALLET topic model package includes an extremely fast and highly scalable implementation of Gibbs sampling, efficient methods for document-topic hyperparameter optimization, and tools for inferring topics for new documents given trained models.
Importing Documents: Once MALLET has been
downloaded and installed, the next step is to import text files into MALLET's internal format. The following instructions assume that the documents to be used as input to the topic model are in separate files, in a directory that contains no other files. See the introduction to
importing data in MALLET for more information and other import methods.
Change to the MALLET directory and run the command
bin/mallet import-dir --input /data/topic-input --output topic-input.mallet \
--keep-sequence --remove-stopwords
To learn more about options for the
import-dir command use the argument "--help".
Building Topic Models: Once you have imported documents into MALLET format, you can use the
train-topics command to build a topic model, for example:
bin/mallet train-topics --input topic-input.mallet \
--num-topics 100 --output-state topic-state.gz
Use the option --help to get a complete list of options for the train-topics command. Commonly used options include:
--input [FILE] Use this option to specify the MALLET collection file you created in the previous step.
--num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. The default (10) will provide a broad overview of the contents of the corpus. The number of topics should depend to some degree on the size of the collection, but 200 to 400 will produce reasonably fine-grained results.
--num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
Hyperparameter Optimization
--optimize-interval [NUMBER] This option turns on hyperparameter optimization, which allows the model to better fit the data by allowing some topics to be more prominent than others. Optimization every 10 iterations is reasonable.
--optimize-burn-in [NUMBER] The number of iterations before hyperparameter optimization begins. Default is twice the optimize interval.
Model Output
--output-model [FILENAME] This option specifies a file to write a serialized MALLET topic trainer object. This type of output is appropriate for pausing and restarting training, but does not produce data that can easily be analyzed.
--output-state [FILENAME] Similar to output-model, this option outputs a compressed text file containing the words in the corpus with their topic assignments. This file format can easily be parsed and used by non-Java-based software. Note that the state file will be GZipped, so it is helpful to provide a filename that ends in .gz.
--output-doc-topics [FILENAME] This option specifies a file to write the topic composition of documents. See the --help options for parameters related to this file.
--output-topic-keys [FILENAME] This file contains a "key" consisting of the top k words for each topic (where k is defined by the --num-top-words option). This output can be useful for checking that the model is working as well as displaying results of the model. In addition, this file reports the Dirichlet parameter of each topic. If hyperparamter optimization is turned on, this number will be roughly proportional to the overall portion of the collection assigned to a given topic.
Topic Inference
--inferencer-filename [FILENAME] Create a topic inference tool based on the current, trained model. Use the MALLET command bin/mallet infer-topics --help to get information on using topic inference.
Note that you must make sure that the new data is compatible with your training data. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.
Topic Held-out probability
--evaluator-filename [FILENAME] The previous section describes how to get topic proportions for new documents. We often want to estimate the log probability of new documents, marginalized over all topic configurations. Use the MALLET command bin/mallet evaluate-topics --help to get information on using held-out probability estimation.
As with topic inference, you must make sure that the new data is compatible with your training data. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.
分享到:
相关推荐
"Map of Science with Topic Modeling: Comparison of Unsupervised Learning and Human-Assigned Subject Classification"这篇文章探讨了如何通过主题建模来绘制科学领域的地图,并将这种方法与传统的人工分类方法...
基于最小领域知识的主题建模 ,一种基于融合知识的主题模型的微博话题发现方法,涉及自然语言处理领域 传统的主题挖掘技术基于概率统计的混合模型,对文本信息进行建模,使得模型能够自动挖掘出文本中潜在的语义信息...
TopicModeling, 关于 Apache Spark,主题建模 基于的主题建模研究这里软件包包含一组在Spark上实现的分布式文本建模算法,包括:在线 : 将实现的早期版本合并到了( PR #4419), 和几个扩展( 比如,预测) 之中。...
In this paper, we studied the novel problem of targeted modeling. Instead of nding all topics from a corpus like existing models based on full modeling, the proposed model focuses on nding topics of a...
一种从文本文档集合中发现主题的图形工具。
LDA主题建模 潜在的Dirichlet分配(LDA)主题建模的基于浏览器的PureScript实现。...cd lda-topic-modeling # Install nvm and npm. nvm use npm install -g bower npm install bower install npm run build cd build
主题建模 主题模型是文档集合的简化表示。 主题建模软件使用主题标签识别单词,从而使经常出现在同一文档中的单词更有可能收到相同的标签。 它可以识别文档集合中的常见主题-具有相似含义和联想的单词簇-以及随着...
### 知识点详解 #### 一、主题模型与概率主题模型 - **主题模型**是一种统计模型,用于处理文本集合中的数据,旨在揭示文档集合中的潜在语义结构。 - **概率主题模型**如PLSA(Probabilistic Latent Semantic ...
Topic_modeling_with_latent_Dirichlet_allocation_us_lda
主题建模 一个从头复制了多个主题建模算法的仓库 pLSA概率潜在语义分析-plsa.py 原始论文可在中找到,使用EM算法估计主题分布,每个文档中的单词分布 潜在狄利克雷分配(LDA)算法-lda.py 原始论文可以在这里找到 ,...
主题建模数据集 一时兴起,我在这里集中了一些用于主题建模的测试数据集。 请! 文件夹结构: 数据集在Data子文件夹中按Data Format > Dataset Parent Folder > Data Files 组织。 在提出拉取请求时,请保持这种...
标题中的“COVID-19-Twitter-Textual-and-Visual-Topic-Modeling-over-time”是一个研究项目,专注于分析在Twitter上关于COVID-19疫情的文本和视觉内容随时间变化的主题模型。这个项目可能涉及大数据分析、自然语言...
主题建模工具 MALLET的LDA实现的更新的GUI。 新的功能: 元数据整合 自动文件分割 自定义CSV分隔符 Alpha / Beta优化 自定义正则表达式标记化 多核处理器支持 ... 要立即开始使用其中一些新功能,请查阅。...
主题建模 主题建模可改善图书馆搜索 概要: 该项目旨在通过使用主题建模算法(特别是潜在的狄利克雷分配(LDA))来改进在联合图书馆中进行的搜索。主题建模是一种统计模型,用于发现文档集中出现的抽象主题。...
musicRecommendation_topicmodeling 个人项目:使用主题建模的音乐推荐。 由于音乐是我放学时唯一的激情,所以我喜欢花费大量时间阅读音乐评论家和收听各种专辑。 我还使用了那些流行的网络音乐播放器,例如Spotify...
pyLDAvis_Optimized_TopicModeling 使用Sk-learn建立LDA模型并使用pyLDAvis绘制主题间距离图 作者:丹麦Anis和Barsha Saha博士 联络方式: 该项目的目的是优化主题模型,以使用网格搜索方法实现最佳拟合。 主题...
在项目"Master---Topic-Modeling--main"中,可能包含了R代码、数据文件、结果报告等组成部分。代码可能涉及到数据导入、预处理、模型构建、结果评估等多个步骤。通过阅读和理解这些内容,我们可以学习如何在实际问题...
### 跟踪社会情绪演化的主题模型视角 #### 概述 本文献提出了一种新的方法来跟踪在线新闻中的社会情绪变化,并将其应用于不同类型的在线服务中。研究背景是许多现代在线新闻网站允许用户在阅读新闻后表达自己的...
elyzee_topic_modeling 致力于为LumenAI的在线聚类平台开发可视化工具,以促进聚类解释。 将经典的自然语言处理技术应用于此问题,尤其是主题建模。 这项工作是在LumenAI的夏季实习(2020)的背景下完成的。 主要...
主题建模(Topic Modeling)是一种无监督机器学习技术,用于从大量文本数据中发现隐藏的主题或概念。2016年是大数据和人工智能发展的重要时期,这个时期的主题建模研究具有较高的参考价值。以下将详细探讨这一领域的...