hive 中间结果和结果数据压缩

bupt04406

浏览: 350332 次
性别:
来自: 杭州

最近访客更多访客>>

rotkNirvana

zhangyi0618

xuhai0605

pengcong90

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hive
hadoop

Hadoop.The.Definitive.Guide.2nd.Edition    79页
hadoop默认的压缩算法。
DEFLATE org.apache.hadoop.io.compress.DefaultCodec

结果数据压缩是否开启，下面的配置为true，所以开启。
这个是最终的结果数据：
<property>
<name>hive.exec.compress.output</name>
<value>true</value>
<description> This controls whether the final outputs of a query (to a local/hdfs file or a hive table) is compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress* </description>
</property>

mapred.output.compression.codec 这个选项确定压缩算法

这个是中间的结果数据是否压缩，也就是一个sql，生成多道MR，除了最后一道MR的结果数据外，前面的MR的结果数据可以压缩。
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
<description> This controls whether intermediate files produced by hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress* </description>
</property>
中间结果数据压缩使用的算法
<property>
<name>hive.intermediate.compression.codec</name>
<value>org.apache.hadoop.io.compress.LzoCodec</value>
</property>
默认的文件格式是SequenceFile
<property>
<name>hive.default.fileformat</name>
<value>SequenceFile</value>
<description>Default file format for CREATE TABLE statement. Options are TextFile and SequenceFile. Users can explicitly say CREATE TABLE ... STORED AS <TEXTFILE|SEQUENCEFILE> to override</description>
</property>

HiveConf里面：
    COMPRESSRESULT("hive.exec.compress.output", false),
    COMPRESSINTERMEDIATE("hive.exec.compress.intermediate", false),
    COMPRESSINTERMEDIATECODEC("hive.intermediate.compression.codec", ""),
    COMPRESSINTERMEDIATETYPE("hive.intermediate.compression.type", ""),

hive.exec.compress.output
SemanticAnalyzer：
private Operator genFileSinkPlan(String dest, QB qb, Operator input)
      throws SemanticException {

        Operator output = putOpInsertMap(
        OperatorFactory.getAndMakeChild(
            new FileSinkDesc(
                queryTmpdir,
                table_desc,
                conf.getBoolVar(HiveConf.ConfVars.COMPRESSRESULT), //结果数据是否压缩
                currentTableId,
                rsCtx.isMultiFileSpray(),
                rsCtx.getNumFiles(),
                rsCtx.getTotalFiles(),
                rsCtx.getPartnCols(),
                dpCtx),
            fsRS, input), inputRR);

}

FileSinkOperator：
private void createEmptyBuckets(Configuration hconf, ArrayList<String> paths)
      throws HiveException, IOException {

    for (String p: paths) {
      Path path = new Path(p);
      RecordWriter writer = HiveFileFormatUtils.getRecordWriter(
          jc, hiveOutputFormat, outputClass, isCompressed, tableInfo.getProperties(), path);//创建RecordWriter
      writer.close(false);
      LOG.info("created empty bucket for enforcing bucketing at " + path);
    }

}

HiveFileFormatUtils：
public static RecordWriter getRecordWriter(JobConf jc,
      HiveOutputFormat<?, ?> hiveOutputFormat,
      final Class<? extends Writable> valueClass, boolean isCompressed,
      Properties tableProp, Path outPath) throws IOException, HiveException {
    if (hiveOutputFormat != null) {
      return hiveOutputFormat.getHiveRecordWriter(jc, outPath, valueClass,
          isCompressed, tableProp, null);
    }
    return null;
}

HiveSequenceFileOutputFormat：
public class HiveSequenceFileOutputFormat extends SequenceFileOutputFormat
    implements HiveOutputFormat<WritableComparable, Writable> {

      public RecordWriter getHiveRecordWriter(JobConf jc, Path finalOutPath,
      Class<? extends Writable> valueClass, boolean isCompressed,
      Properties tableProperties, Progressable progress) throws IOException {

         FileSystem fs = finalOutPath.getFileSystem(jc);
         final SequenceFile.Writer outStream = Utilities.createSequenceWriter(jc,
             fs, finalOutPath, BytesWritable.class, valueClass, isCompressed);

      }

}

Utilities：
public static SequenceFile.Writer createSequenceWriter(JobConf jc, FileSystem fs, Path file,
      Class<?> keyClass, Class<?> valClass, boolean isCompressed) throws IOException {
    CompressionCodec codec = null;
    CompressionType compressionType = CompressionType.NONE;
    Class codecClass = null;
    if (isCompressed) {
      compressionType = SequenceFileOutputFormat.getOutputCompressionType(jc);
      codecClass = FileOutputFormat.getOutputCompressorClass(jc, DefaultCodec.class); //默认的压缩算法是DefaultCodec   org.apache.hadoop.io.compress.DefaultCodec
      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, jc);
    }
    return (SequenceFile.createWriter(fs, jc, file, keyClass, valClass, compressionType, codec));

}

FileOutputFormat：
public static Class<? extends CompressionCodec>
getOutputCompressorClass(JobConf conf,
                       Class<? extends CompressionCodec> defaultValue) {
    Class<? extends CompressionCodec> codecClass = defaultValue;

    String name = conf.get("mapred.output.compression.codec"); //可以经过这个选项进行配置
    if (name != null) {
      try {
        codecClass =
        conf.getClassByName(name).asSubclass(CompressionCodec.class);
      } catch (ClassNotFoundException e) {
        throw new IllegalArgumentException("Compression codec " + name +
                                           " was not found.", e);
      }
    }
    return codecClass;
}

中间结果数据压缩：
GenMapRedUtils.splitTasks：
public static void splitTasks(Operator<? extends Serializable> op,
      Task<? extends Serializable> parentTask,
      Task<? extends Serializable> childTask, GenMRProcContext opProcCtx,
      boolean setReducer, boolean local, int posn) throws SemanticException {

    // Create a file sink operator for this file name
    boolean compressIntermediate = parseCtx.getConf().getBoolVar(
        HiveConf.ConfVars.COMPRESSINTERMEDIATE);
    FileSinkDesc desc = new FileSinkDesc(taskTmpDir, tt_desc,
        compressIntermediate);
    if (compressIntermediate) {
      desc.setCompressCodec(parseCtx.getConf().getVar(
          HiveConf.ConfVars.COMPRESSINTERMEDIATECODEC));
      desc.setCompressType(parseCtx.getConf().getVar(
          HiveConf.ConfVars.COMPRESSINTERMEDIATETYPE));
    }
    Operator<? extends Serializable> fs_op = putOpInsertMap(OperatorFactory
        .get(desc, parent.getSchema()), null, parseCtx);

}

分享到：

hive JoinOperator | hive ColumnPruner

2011-09-13 21:13
浏览 7456
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论