GibbsLDA++

summerbell

浏览: 819228 次
性别:
来自: 武汉

最近访客更多访客>>

wangweihuamy

icnd

wyhappiness

玫瑰步道

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

隐含语义标引

3.4 例子学习

比如，我们想估计一个LDA模型，对一个文档集合，存储在文件models/casestudy/trndocs.dat中。继而使用其模型来做推论，为存储在文件models/casestudy/newdocs.dat中的新数据。

我们想要估计100个主题，alpha=0.5且beta=1.我们想完成1000 Gibbs取样重复，保存模型，在每100个重复，并且每次保存模型，都打印出每个话题的最相似20个单词。设想我们现在在GibbsLDA++的根目录，我们将运行如下的命令，从头估计LDA模型。

$ src/lda -est -alpha 0.5 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 -twords 20 -dfile models/casestudy/trndocs.dat

现在查看models/casestudy目录，我们可以看到如下的输出。

Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files:

<model_name>.others

<model_name>.phi

<model_name>.theta

<model_name>.tassign

<model_name>.twords

in which:

<model_name>: is the name of a LDA model corresponding to the time step it was saved on the hard disk. For example, the name of the model was saved at the Gibbs sampling iteration 400th will be model-00400. Similarly, the model was saved at the 1200th iteration is model-01200. The model name of the last Gibbs sampling iteration is model-final.

<model_name>.others: This file contains some parameters of LDA model, such as:

alpha=?

beta=?

ntopics=? # i.e., number of topics

ndocs=? # i.e., number of documents

nwords=? # i.e., the vocabulary size

liter=? # i.e., the Gibbs sampling iteration at which the model was saved

<model_name>.phi: This file contains the word-topic distributions, i.e., p(wordw|topict). Each line is a topic, each column is a word in the vocabulary.

<model_name>.theta: This file contains the topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic.

<model_name>.tassign: This file contains the topic assignments for words in training data. Each line is a document that consists of a list of <wordij>:<topic of wordij>

<model_file>.twords: This file contains twords most likely words of each topic. twords is specified in the command line (see Sections 3.1.1 and 3.1.2).

GibbsLDA++ also saves a file called wordmap.txt that contains the maps between words and word's IDs (integer). This is because GibbsLDA++ works directly with integer IDs of words/terms inside instead of text strings.

现在，我们想要继续完成另一个800 Gibbs取样重复，从先前估计的模型model-01000以savestep=100，twords=30，我们完成如下的命令：

$ src/lda -estc -dir models/casestudy/ -model model-01000 -niters 800 -savestep 100 -twords 30

现在查看casestudy目录来看输出。

现在，如果我们想要推论（30 Gibbs取样重复）为新数据newdocs.dat使用一个先前估计的LDA模型，比如model-01800，我们完成如下的命令：

src/lda -inf -dir models/casestudy/ -model model-01800 -niters 30 -twords 20 -dfile newdocs.dat

现在，查看casestudy目录，我们可以看到推论的输出

newdocs.dat.others

newdocs.dat.phi

newdocs.dat.tassign

newdocs.dat.theta

newdocs.dat.twords

查看图片附件

分享到：

HMM示例及Matlab计算 | mysql错误：……is marked as crashed and ...

2009-05-14 10:38
浏览 4515
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

GibbsLDA++

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

GibbsLDA++

评论

发表评论

相关推荐

LSI

潜在语义分析对认知科学的启示

SVD

最近访客更多访客>>