`

hadoop-compression

 
阅读更多

http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

(namely :使hadoop支持Splittable压缩lzo)

 

 

-what is the flow of upload a gzip file to dfs?

  split to hdfs chunk block by block,so the first block is full,but last one maybe not.and no changes involved in these blockes in terms of bytes:

hadoop@host-08:~$ hadoop fsck /user/hadoop/mr-test-data.zj.tar.gz -blocks -locations  -files

FSCK started by hadoop from /192.168.12.108 for path /user/hadoop/mr-test-data.zj.tar.gz at Mon Oct 26 17:22:24 CST 2015
/user/hadoop/mr-test-data.zj.tar.gz 173826303 bytes, 2 block(s):  OK
0. blk_-6142856910439989465_2680086 len=134217728 repl=3 [192.168.12.148:50010, 192.168.12.110:50010, 192.168.12.132:50010]
1. blk_-9182536886628119965_2680086 len=39608575 repl=3 [192.168.12.110:50010, 192.168.12.134:50010, 192.168.12.140:50010]

 compared with raw file:

ls 
-rw-r--r--  1 hadoop hadoopgrp 173826303 Apr 23  2014 mr-test-data.zj.tar.gz

 

-what is the order of writing a gzip file to dfs?split-> compress or compress -> split

   TODO,see hbase's src

 

conclusion:

-a new record does -not- always mean that one slice text per line,though,may be one key/value pair etc.

-hadoop block level meaning is unrelated to 'splittable'

-lzo formatted file is splitable only if it generated by 'lzo indexed file' which is part of it.

 this is similar to the 'hbase's hfile' format

  gzip is not splittable ,so only one map to process it:

job_201411101612_0397	NORMAL	hadoop	word count	100.00% 1 	0

  but for hbase's hfile with snappy compression there are more than one mappers:

hadoop@host-08:/usr/local/hadoop/hadoop-1.0.3$ hbase hfile  -s -f /hbase/archive/f63235f4a6d84c84722f82ffd8122206/fml/b7e2701a60764f9a940912743b55d4e0
15/10/26 17:51:56 INFO util.ChecksumType: Checksum can use java.util.zip.CRC32
15/10/26 17:51:56 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 3.7g
15/10/26 17:51:57 WARN snappy.LoadSnappy: **Snappy** native library is available

 

job_201411101612_0396	NORMAL	hadoop	word count	100.00% 51	51

   so ,u can think as it's a common text file as the hfile(with snappy compression) only compress the streaming data to it for key/value data bytes,instead of generating a real snappy file *.snappy.

 

 

分享到:
评论

相关推荐

    hadoop-lzo-0.4.20.jar

    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.BZip2Codec <name>io.compression.codec....

    hadoop-lzo-0.4.21-SNAPSHOT jars

    集成Hadoop-LZO到你的Hadoop环境,你需要将`hadoop-lzo-0.4.21-SNAPSHOT.jar`添加到Hadoop的类路径中,并配置Hadoop的相关参数,例如在`core-site.xml`中设置`io.compression.codecs`属性,指定支持LZO压缩。...

    hadoop-lzo-master

    1.3 cp hadoop-gpl-compression-0.1.0/hadoop-gpl-compression-0.1.0.jar /usr/local/hadoop-1.0.2/lib/ 2.安装 lzo apt-get install gcc apt-get install lzop 3.在本地测试 可以 执行 压缩及解压缩命令 下载 ...

    hadoop-lzo-0.4.21-SNAPSHOT.jar

    通过设置`mapreduce.output.fileoutputformat.class`属性为`com.hadoop.compression.lzo.LZOFileOutputFormat`,可以指定输出文件采用LZO压缩。 3. 分布式缓存:为了提高效率,LZO的库文件(包括.lzo文件和相应的...

    ant打包hadoop-eclipse-plugin

    在本文中,我们将深入探讨如何使用Apache Ant工具在Windows环境下打包Hadoop-eclipse-plugin,这是一个允许开发者在Eclipse IDE中创建和调试Hadoop MapReduce项目的插件。以下是详细步骤: 首先,你需要下载Apache ...

    hadoop-lzo-0.4.15.jar

    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.BZip2Codec <name>io.compression.codec....

    hadoop-lzo-master.zip

    然后,在Hadoop的配置文件中,如core-site.xml或mapred-site.xml,设置相应的参数,如`io.compression.codecs`,以启用LZO压缩支持。 五、性能优化 Hadoop-LZO的性能受到多种因素影响,包括硬件性能、Hadoop集群的...

    hadoop-snappy的jar包

    1. **配置Hadoop**: 在Hadoop的配置文件(如`core-site.xml`)中设置`io.compression.codecs`属性,包含`org.apache.hadoop.io.compress.SnappyCodec`,以便Hadoop知道可以使用Snappy。 2. **选择压缩格式**: 当...

    编译hadoop-2.5.0-cdh5.3.6 + snappy 源码的native包

    - 配置Hadoop的`core-site.xml`,添加`io.compression.codecs`属性,指定Snappy压缩 codec。 ```xml ... <name>io.compression.codecs <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache....

    hadoop-lzo-0.4.15.tar.gz

    为了在Hadoop集群中使用Hadoop LZO,用户需要将库文件添加到Hadoop的类路径中,并配置Hadoop的属性,如`io.compression.codecs`和`io.compression.codec.lzo.class`,以启用LZO压缩支持。同时,还需要确保集群中的...

    hadoop-lzo:Hadoop 0.20的code.google.comhadoop-gpl-compression重构版本

    它修复了hadoop-gpl-compression中的一些错误-尤其是,它允许解压缩器读取小的或不可压缩的lzo文件,并且还修复了压缩器在压缩小的或不可压缩的块时遵循lzo标准。 它还修复了许多在lzo编写器在中途被杀死时可能发生...

    hadoop-snappy的java包

    <name>io.compression.codecs <value>org.apache.hadoop.io.compress.SnappyCodec ``` 这样设置后,Hadoop就可以识别并使用Snappy压缩格式。同时,为了支持Snappy的本地库,可能还需要在`hadoop-env.sh`中添加...

    hadoop-lzo.zip

    <value>com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value> <name>io.compression.codec.lzo.class <value>com.hadoop.compression.lzo.LzoCodec</value> ``` 此外,...

    hadoop-2.6.0-cdh5.16.2.tar.gz for linux 支持snappy

    接着,需要修改Hadoop的配置文件,如`core-site.xml`,设置`io.compression.codecs`属性,将Snappy添加到支持的压缩编码器列表中。同时,可以设定默认的压缩格式为Snappy,通过修改`mapreduce.map.output.compress....

    hadoop-snappy-0.0.1-SNAPSHOT

    3. **配置Hadoop**:在Hadoop的配置文件(如core-site.xml)中,设置`io.compression.codecs`属性,添加Snappy支持: ```xml <name>io.compression.codecs <value>org.apache.hadoop.io.compress.SnappyCodec ...

    hadoop-cdh4.3-lzo安装及问题解决

    5. **数据读取失败**:如果压缩数据无法正确解压缩,检查Hadoop配置文件中是否正确设置了LZO相关的属性,如`io.compression.codecs`和`io.compression.codec.lzo.class`。 **优化和注意事项** 1. **合理配置压缩...

    hadoop-lzo-lib

    编译环境:centos 6.4 64bit、maven 3.3.9、jdk...目的:编译给hadoop2.4.1(64)用的; 解决:hive报错:Cannot create an instance of InputFormat class org.apache.hadoop ....... as specified in mapredwork!

    Hadoop- The Definitive Guide, 3rd Edition.pdf

    4 covers the fundamentals of I/O in Hadoop: data integrity, compression, serialization, and file-based data structures. The next four chapters cover MapReduce in depth. Chapter 5 goes through the ...

Global site tag (gtag.js) - Google Analytics