mahout之TestNaiveBayesDriver源码分析

wbj0110

浏览: 1644543 次
性别:
来自: 上海

最近访客更多访客>>

一往无前bhz

ninja2006

loginboot

u012363178

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Mahout

Mahout

有个参数sequential决定是否本地执行，这里只讲MapReduce执行。
源代码如下，

  private boolean runMapReduce(Map< string , List< String > > parsedArgs) throws IOException,
      InterruptedException, ClassNotFoundException {
    Path model = new Path(getOption("model"));
    HadoopUtil.cacheFiles(model, getConf());
    //the output key is the expected value, the output value are the scores for all the labels
    Job testJob = prepareJob(getInputPath(), getOutputPath(), SequenceFileInputFormat.class, BayesTestMapper.class,
            Text.class, VectorWritable.class, SequenceFileOutputFormat.class);
    boolean complementary = parsedArgs.containsKey("testComplementary");
    testJob.getConfiguration().set(COMPLEMENTARY, String.valueOf(complementary));
    boolean succeeded = testJob.waitForCompletion(true);
    return succeeded;
  }

首先从训练的模型中得到model，实例化model，也就是将写入的vectors重新读取出来罢了。
testJob只用到了map阶段，如下

  protected void map(Text key, VectorWritable value, Context context) throws IOException, InterruptedException {
    Vector result = classifier.classifyFull(value.get());
    //the key is the expected value
    context.write(new Text(key.toString().split("/")[1]), new VectorWritable(result));
  }

输出的key就是类别的text，value就是输入的向量在每个类的得分。
classifier.classifyFull()计算输入的向量在每个label的得分：

  public Vector classifyFull(Vector instance) {
    Vector score = model.createScoringVector();
    for (int label = 0; label < model.numLabels(); label++) {
      score.set(label, getScoreForLabelInstance(label, instance));
    }
    return score;
  }

getScoreForLabelInstance如下，计算此label下的feature得分和。

  protected double getScoreForLabelInstance(int label, Vector instance) {
    double result = 0.0;
    Iterator<element> elements = instance.iterateNonZero();
    while (elements.hasNext()) {
      Element e = elements.next();
      result += e.get() * getScoreForLabelFeature(label, e.index());
    }
    return result;
  }
</element>

getScoreForLabelFeature有两种计算方式，
1，标准bayes ，log[(Wi+alphai)/(ƩWi + N)]

  public double getScoreForLabelFeature(int label, int feature) {
    NaiveBayesModel model = getModel();
return
computeWeight(model.weight(label, feature), model.labelWeight(label), model.alphaI(),
        model.numFeatures());
  }
  public static double computeWeight(double featureLabelWeight, double labelWeight, double alphaI,
      double numFeatures) {
    double numerator = featureLabelWeight + alphaI;
    double denominator = labelWeight + alphaI * numFeatures;
    return Math.log(numerator / denominator);
  }

2, complementary bayes,也就是计算除此类之外的其他类的值。

//complementary bayes
    public double getScoreForLabelFeature(int label, int feature) {
    NaiveBayesModel model = getModel();
    return computeWeight(model.featureWeight(feature), model.weight(label, feature),
        model.totalWeightSum(), model.labelWeight(label), model.alphaI(), model.numFeatures());
  }
  public static double computeWeight(double featureWeight, double featureLabelWeight,
      double totalWeight, double labelWeight, double alphaI, double numFeatures) {
    double numerator = featureWeight - featureLabelWeight + alphaI;
    double denominator = totalWeight - labelWeight + alphaI * numFeatures;
    return -Math.log(numerator / denominator);
  }

最后就是analyze了，对每个key，通过score vector得到最大值，与label index比较。产生confusion matrix了。

http://hnote.org/big-data/mahout/mahout-testnaivebayesdriver-testnb

分享到：

mahout之TrainNaiveBayesJob源码分析 | Mahout之SparseVectorsFromSequenceFiles源 ...

2014-06-19 10:46
浏览 776
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论