`
yunmanfan
  • 浏览: 93602 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Topic Modeling for Java Developers

 
阅读更多

In this example, I import data from a file, train a topic model, and analyze the topic assignments of the first instance. I then create a new instance, which is made up of the words from topic 0, and infer a topic distribution for that instance. Note that this example requires the latest development release, and will not compile under Mallet 2.0.6.

An example input file is available: ap.txt. This is the same example data set provided by David Blei with the lda-c package. The file contains one document per line. Each line has three fields, separated by commas. This is a standard Mallet format. For more information, see the importing data guide. The first field is a name for the document. The second field could contain a document label, as in a classification task, but for this example we won't use that field. It is therefore set to a meaningless placeholder value. The third field contains the full text of the document, with no newline characters.

The following example is in the cc.mallet.examples package. Annotations are included in comments. You can run this code using the command bin/mallet run

 

 

cc.mallet.examples.TopicModel [filename].

package cc.mallet.examples;

import cc.mallet.util.*;
import cc.mallet.types.*;
import cc.mallet.pipe.*;
import cc.mallet.pipe.iterator.*;
import cc.mallet.topics.*;

import java.util.*;
import java.util.regex.*;
import java.io.*;

public class TopicModel {

    public static void main(String[] args) throws Exception {

        // Begin by importing documents from text to feature sequences
        ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

        // Pipes: lowercase, tokenize, remove stopwords, map to features
        pipeList.add( new CharSequenceLowercase() );
        pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
        pipeList.add( new TokenSequenceRemoveStopwords(new File("stoplists/en.txt"), "UTF-8", false, false, false) );
        pipeList.add( new TokenSequence2FeatureSequence() );

        InstanceList instances = new InstanceList (new SerialPipes(pipeList));

        Reader fileReader = new InputStreamReader(new FileInputStream(new File(args[0])), "UTF-8");
        instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
                                               3, 2, 1)); // data, label, name fields

        // Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
        //  Note that the first parameter is passed as the sum over topics, while
        //  the second is the parameter for a single dimension of the Dirichlet prior.
        int numTopics = 100;
        ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);

        model.addInstances(instances);

        // Use two parallel samplers, which each look at one half the corpus and combine
        //  statistics after every iteration.
        model.setNumThreads(2);

        // Run the model for 50 iterations and stop (this is for testing only, 
        //  for real applications, use 1000 to 2000 iterations)
        model.setNumIterations(50);
        model.estimate();

        // Show the words and topics in the first instance

        // The data alphabet maps word IDs to strings
        Alphabet dataAlphabet = instances.getDataAlphabet();
        
        FeatureSequence tokens = (FeatureSequence) model.getData().get(0).instance.getData();
        LabelSequence topics = model.getData().get(0).topicSequence;
        
        Formatter out = new Formatter(new StringBuilder(), Locale.US);
        for (int position = 0; position < tokens.getLength(); position++) {
            out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
        }
        System.out.println(out);
        
        // Estimate the topic distribution of the first instance, 
        //  given the current Gibbs state.
        double[] topicDistribution = model.getTopicProbabilities(0);

        // Get an array of sorted sets of word ID/count pairs
        ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();
        
        // Show top 5 words in topics with proportions for the first document
        for (int topic = 0; topic < numTopics; topic++) {
            Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
            
            out = new Formatter(new StringBuilder(), Locale.US);
            out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
            int rank = 0;
            while (iterator.hasNext() && rank < 5) {
                IDSorter idCountPair = iterator.next();
                out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
                rank++;
            }
            System.out.println(out);
        }
        
        // Create a new instance with high probability of topic 0
        StringBuilder topicZeroText = new StringBuilder();
        Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();

        int rank = 0;
        while (iterator.hasNext() && rank < 5) {
            IDSorter idCountPair = iterator.next();
            topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
            rank++;
        }

        // Create a new instance named "test instance" with empty target and source fields.
        InstanceList testing = new InstanceList(instances.getPipe());
        testing.addThruPipe(new Instance(topicZeroText.toString(), null, "test instance", null));

        TopicInferencer inferencer = model.getInferencer();
        double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);
        System.out.println("0\t" + testProbabilities[0]);
    }

}

 

分享到:
评论

相关推荐

    [GA458]IBM Pattern Modeling and Analysis Tool for Java Garbage Collector

    【IBM Pattern Modeling and Analysis Tool for Java Garbage Collector】是一款针对Java虚拟机(JVM)垃圾收集器进行性能分析和建模的专业工具。该工具由IBM开发,主要用于帮助开发者和系统管理员深入理解Java应用...

    ibm pattern modeling and analysis tool for java

    IBM Pattern Modeling and Analysis Tool for Java,简称为GA,是一个专门针对Java Garbage Collector(GC)进行性能分析和诊断的专业工具。这款工具由IBM开发,旨在帮助Java开发者和系统管理员深入理解JVM内存管理...

    argeted Topic Modeling for Focused Analysis

    full models are not the most eective methods for such fo- cused analysis because their results are often too coarse and they may not nd topics that the user is really interested in and/or miss many ...

    IBM Pattern Modeling and Analysis Tool for Java Garbage Collector

    IBM Pattern Modeling and Analysis Tool for Java Garbage Collector

    Turbulence Modeling for CFD

    首先,本书的标题“Turbulence Modeling for CFD”直接点明了其核心内容,即面向计算流体力学的湍流建模方法。湍流是自然界和工程实践中普遍存在的流体运动状态,其复杂性在于流体速度、压力等物理量的时空不规则...

    Turbulence Modeling for CFD (Third Edition)

    David C. Wilcox_Turbulence Modeling for CFD (Third Edition)。比较经典的英文书籍。

    Turbulence modeling for CFD

    Turbulence modeling for CFD - Wilcox D.C, 很经典。是Djvu格式(据说比pdf格式要好)

    基于最小领域知识的主题建模 :Topic Modeling with Minimal Domain Knowledge

    基于最小领域知识的主题建模 ,一种基于融合知识的主题模型的微博话题发现方法,涉及自然语言处理领域 传统的主题挖掘技术基于概率统计的混合模型,对文本信息进行建模,使得模型能够自动挖掘出文本中潜在的语义信息...

    eclipse Modeling Framework

    eclipse Modeling Framework ... Now there's a definitive guide to using Eclipse's breakthrough modeling tools-for Java developers, XML programmers, and experienced object modelers alike.

    Turbulence Modeling for CFD (3rd ed.)

    《Turbulence Modeling for CFD (3rd ed.)》是一本经典的关于计算流体动力学(Computational Fluid Dynamics, CFD)与湍流模型的专业书籍。本书由David C. Wilcox博士撰写,作为该领域的权威著作,对于从事CFD研究和...

    Wiley Device Modeling for Analog and RF CMOS Circuit Design

    根据提供的文件信息,“Wiley Device Modeling for Analog and RF CMOS Circuit Design”这本书主要涉及的是模拟与射频(RF)互补金属氧化物半导体(CMOS)电路设计中的器件建模技术。接下来将对这一主题进行详细...

    Advanced Modeling with Java.pdf

    在"Advanced Modeling with Java.pdf"中,主要讨论了如何在AnyLogic环境下使用Java进行高级建模。以下是基于提供的内容对相关知识点的详细解释: 1. **Java基础知识** - **Java in AnyLogic**: AnyLogic是一个多...

    TopicModeling, 关于 Apache Spark,主题建模.zip

    TopicModeling, 关于 Apache Spark,主题建模 基于的主题建模研究这里软件包包含一组在Spark上实现的分布式文本建模算法,包括:在线 : 将实现的早期版本合并到了( PR #4419), 和几个扩展( 比如,预测) 之中。...

    Murach's SQL Server 2016 for Developers

    enhancements, not only those for developers. We want to show the whole picture and point where things are moving on. Chapter 2, Review of SQL Server Features for Developers, is a brief recapitulation ...

    java在线安装包.rar

    Eclipse IDE for Java Developers Eclipse IDE for Enterprise Java and Web Developers Eclipse IDE for C/C++ Developers Eclipse IDE for Embedded C/C++ Developers Eclipse IDE for PHP Developers Eclipse...

Global site tag (gtag.js) - Google Analytics