【Mahout一】基于Mahout 命令参数含义

bit1129

浏览: 1078322 次
性别:
来自: 北京

最近访客更多访客>>

xiaoyaohen24

yuxin8000

abc951654

zhongqi2513

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Mahout

1. mahout seqdirectory

    $ mahout seqdirectory 
        --input (-i) input               Path to job input directory(原始文本文件).
        --output (-o) output             The directory pathname for output.（<Text,Text>Sequence File）
        -ow

功能：将原始文本数据集转换为< Text, Text > SequenceFile

2. mahout seq2sparke

功能： Convert and preprocesses the dataset（<Text,Text> SequenceFile） into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.

即根据Sequence File转换为tfidf向量文件

说明：If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization

    mahout seq2sparse                         
      --output (-o) output             The directory pathname for output.        
      --input (-i) input               Path to job input directory.              
      --weight (-wt) weight            The kind of weight to use. Currently TF   
                                           or TFIDF. Default: TFIDF                  
      --norm (-n) norm                 The norm to use, expressed as either a    
                                           float or "INF" if you want to use the     
                                           Infinite norm.  Must be greater or equal  
                                           to 0.  The default is not to normalize    
      --overwrite (-ow)                If set, overwrite the output directory    
      --sequentialAccessVector (-seq)  (Optional) Whether output vectors should  
                                           be SequentialAccessVectors. If set true   
                                           else false                                
      --namedVector (-nv)              (Optional) Whether output vectors should  
                                           be NamedVectors. If set true else false

-i Sequence File文件目录

-o 向量文件输出目录

-wt 权重类型，支持TF或者TFIDF两种选项，默认TFIDF

-n 使用的正规化，使用浮点数或者"INF"表示，

-ow 指定该参数，将覆盖已有的输出目录

-seq 指定该参数，那么输出的向量是SequentialAccessVectors

-nv 指定该参数，那么输出的向量是NamedVectors

3. mahout split

功能：Split the preprocessed dataset into training and testing sets.

将预处理的tfidf向量集转换为training和testing向量集

    $ mahout split 
        -i ${WORK_DIR}/20news-vectors/tfidf-vectors 
        --trainingOutput ${WORK_DIR}/20news-train-vectors 
        --testOutput ${WORK_DIR}/20news-test-vectors  
        --randomSelectionPct 40 
        --overwrite --sequenceFiles -xm sequential

说明：如上是将向量数据集分为训练数据和检测数据，以随机40-60拆分

3. mahout trainnb

功能：训练分类器

mahout trainnb
  --input (-i) input               Path to job input directory.                 
  --output (-o) output             The directory pathname for output.                    
  --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0
  --trainComplementary (-c)        Train complementary? Default is false.                        
  --labelIndex (-li) labelIndex    The path to store the label index in         
  --overwrite (-ow)                If present, overwrite the output directory   
                                       before running job                           
  --help (-h)                      Print out help                               
  --tempDir tempDir                Intermediate output directory                
  --startPhase startPhase          First phase to run                           
  --endPhase endPhase              Last phase to run

-i 输入路径

-o 输出路径

-a

-c 补偿性训练

-li label index文件的目录

-ow 指定该参数，删除输出目录

tempDir MapReduce作业的中间结果

startPhase 运行的第一个阶段

endPhase 运行的最后一个阶段

4. mahout testnb

功能：检验Bayes分类器

mahout testnb   
  --input (-i) input               Path to job input directory.                  
  --output (-o) output             The directory pathname for output.            
  --overwrite (-ow)                If present, overwrite the output directory    
                                       before running job

  --model (-m) model               The path to the model built during training   
  --testComplementary (-c)         Test complementary? Default is false.                          
  --runSequential (-seq)           Run sequential?                               
  --labelIndex (-l) labelIndex     The path to the location of the label index   
  --help (-h)                      Print out help                                
  --tempDir tempDir                Intermediate output directory                 
  --startPhase startPhase          First phase to run                            
  --endPhase endPhase              Last phase to run

-i 输入路径

-o 输出路径

-ow 覆盖输出目录

-c

分享到：

【Mahout二】基于Mahout CBayes算法的20ne ... | 【Mahout三】基于Mahout CBayes算法的20ne ...

2015-05-23 13:30
浏览 2835
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论