1.安装LZO
sudo apt-get install liblzo2-dev 或者下载lzo2[http://www.oberhumer.com/opensource/lzo/download/]. wget [http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz] ./configure \--enable-shared make make install
2.安装hadoop-lzo
wget [https://github.com/kevinweil/hadoop-lzo/archive/master.zip] 或 git clone [https://github.com/kevinweil/hadoop-lzo.git] 64位机器: export CFLAGS=-m64 export CXXFLAGS=-m64 32位机器: export CFLAGS=-m32 export CXXFLAGS=-m32 编译打包:ant compile-native tar {color:#ff0000}编译过程中遇到错误:{color} compile-native: [mkdir] Created dir: /home/caodaoxi/soft/hadoop-lzo/build/native/Linux-i386-32/lib [mkdir] Created dir: /home/caodaoxi/soft/hadoop-lzo/build/native/Linux-i386-32/src/com/hadoop/compression/lzo [javah] 错误: 找不到类org.apache.hadoop.conf.Configuration。 BUILD FAILED /home/caodaoxi/soft/hadoop-lzo/build.xml:269: compilation failed 解决方法: 在build.xml中添加 <classpath refid="classpath"/> <javah classpath="${build.classes}" destdir="${build.native}/src/com/hadoop/compression/lzo" force="yes" verbose="yes"> <class name="com.hadoop.compression.lzo.LzoCompressor" /> <class name="com.hadoop.compression.lzo.LzoDecompressor" /> {color:#ff0000}<classpath refid="classpath"/>{color} </javah>
3.将安装后的hadoop的lzo目录的native文件夹copy到hadoop的lib的native目录
cp /home/hadoop/soft/hadoop-lzo/build/native /home/hadoop/soft/hadoop/lib/
4.将安装后的hadoop的lzo目录的jar包copy到hadoop的lib目录
cp /home/hadoop/soft/hadoop-lzo/build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar /home/hadoop/soft/hadoop/share/hadoop/lib
5.配置hadoop的配置文件
修改$HADOOP_HOME/conf/core-site.xml, 加入下面这段配置。(后期测试发现,这个配置可以不用加,而且加了这个配置以后会导致sqoop等一些框架加载不到LzoCode.class)
<property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/soft/hadoop/tmp</value> </property> <property> <name>fs.trash.interval</name> <value>1440</value> <description>Number of minutes between trash checkpoints. If zero, the trash feature is disabled.</description> </property> <property> <name>io.compression.codecs</name> <value> org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec </value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> 修改$HADOOP_HOME/conf/mapred-site.xml <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Djava.library.path=/home/hadoop/soft/hadoop/lib/native/Linux-i386-32/</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
6.hadoop集群重启
cd /home/hadoop/soft/hadoop/bin ./stop-all.sh ./start-all.sh
*7.对集群进行测试
a.测试环境的测试*
1.安装lzop
wget http://www.lzop.org/download/lzop-1.03.tar.gz
/configure && make && sudo make install
2.使用lzop压缩日志文件
下载原始日志: hadoop fs -copyToLocal /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log
原始日志文件:
rw-rr- 1 hadoop hadoop 497060688 Jul 1 10:36 pv.log
使用lzop进行压缩: lzop pv.log
压缩后的日志文件:
rw-rr- 1 hadoop hadoop 497060688 Jul 1 10:36 pv.logrw-rr- 1 hadoop hadoop 163517168 Jul 1 10:36 pv.log.lzo
压缩率为 163517168/497060688=33%
hadoop fs -put pv.log.lzo /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/
测试是否安装成功:hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04
报错:
13/07/01 15:01:35 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:845)
at java.lang.System.loadLibrary(System.java:1084)
at com.hadoop.compression.lzo.GPLNativeCodeLoader.<clinit>(GPLNativeCodeLoader.java:32)
at com.hadoop.compression.lzo.LzoCodec.<clinit>(LzoCodec.java:71)
at com.hadoop.compression.lzo.LzoIndexer.<init>(LzoIndexer.java:36)
at com.hadoop.compression.lzo.LzoIndexer.main(LzoIndexer.java:134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
13/07/01 15:01:35 ERROR lzo.LzoCodec: Cannot load native-lzo without native-hadoop
13/07/01 15:01:36 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-14/pv.log.lzo, size 0.05 GB...
Exception in thread "main" java.lang.RuntimeException: native-lzo library not available
at com.hadoop.compression.lzo.LzopCodec.createDecompressor(LzopCodec.java:104)
at com.hadoop.compression.lzo.LzoIndex.createIndex(LzoIndex.java:229)
at com.hadoop.compression.lzo.LzoIndexer.indexSingleFile(LzoIndexer.java:117)
at com.hadoop.compression.lzo.LzoIndexer.indexInternal(LzoIndexer.java:98)
at com.hadoop.compression.lzo.LzoIndexer.index(LzoIndexer.java:52)
at com.hadoop.compression.lzo.LzoIndexer.main(LzoIndexer.java:137)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
说明lzo的库没有安装成功.
解决方法:
1.经过无数次的排查,终于发现jdk是32位的,而机器是64的,导致hadoop编译后是32位的,将jdk版本改成64位的.
2.进过对hadoop-lzo源码和hadoop代码的阅读,发现其中几个代码片段.
com.hadoop.compression.lzo.GPLNativeCodeLoader:
try {
//try to load the lib
System.loadLibrary("gplcompression");
nativeLibraryLoaded = true;
LOG.info("Loaded native gpl library");
} catch (Throwable t) {
LOG.error("Could not load native gpl library", t);
nativeLibraryLoaded = false;
}
/home/hadoop/soft/hadoop/bin/hadoop:
HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"
在此之前输出JAVA_LIBRARY_PATH,发现并不包括lzo的动态链接库文件,而使用lzo的话,必须要将lzo的动态链接库文件加入到JAVA_LIBRARY_PATH,
所以在365行添加:JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}
重启hadoop集群
再次执行: hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04
打印信息如下:
13/07/01 17:40:53 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/07/01 17:40:53 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/07/01 17:40:54 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs://kooxoo1-154.kuxun.cn:9000/user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo to indexing list (no index currently exists)
13/07/01 17:40:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/01 17:40:54 INFO input.FileInputFormat: Total input paths to process : 1
13/07/01 17:40:54 INFO mapred.JobClient: Running job: job_201307011738_0001
13/07/01 17:40:55 INFO mapred.JobClient: map 0% reduce 0%
13/07/01 17:41:11 INFO mapred.JobClient: map 100% reduce 0%
13/07/01 17:41:16 INFO mapred.JobClient: Job complete: job_201307011738_0001
13/07/01 17:41:16 INFO mapred.JobClient: Counters: 19
13/07/01 17:41:16 INFO mapred.JobClient: Job Counters
13/07/01 17:41:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=15320
13/07/01 17:41:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/01 17:41:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/07/01 17:41:16 INFO mapred.JobClient: Launched map tasks=1
13/07/01 17:41:16 INFO mapred.JobClient: Data-local map tasks=1
13/07/01 17:41:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/07/01 17:41:16 INFO mapred.JobClient: File Output Format Counters
13/07/01 17:41:16 INFO mapred.JobClient: Bytes Written=0
13/07/01 17:41:16 INFO mapred.JobClient: FileSystemCounters
13/07/01 17:41:16 INFO mapred.JobClient: HDFS_BYTES_READ=15388
13/07/01 17:41:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21849
13/07/01 17:41:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=15176
13/07/01 17:41:16 INFO mapred.JobClient: File Input Format Counters
13/07/01 17:41:16 INFO mapred.JobClient: Bytes Read=15220
13/07/01 17:41:16 INFO mapred.JobClient: Map-Reduce Framework
13/07/01 17:41:16 INFO mapred.JobClient: Map input records=1897
13/07/01 17:41:16 INFO mapred.JobClient: Physical memory (bytes) snapshot=100438016
13/07/01 17:41:16 INFO mapred.JobClient: Spilled Records=0
13/07/01 17:41:16 INFO mapred.JobClient: CPU time spent (ms)=3770
13/07/01 17:41:16 INFO mapred.JobClient: Total committed heap usage (bytes)=189202432
13/07/01 17:41:16 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3543986176
13/07/01 17:41:16 INFO mapred.JobClient: Map output records=1897
13/07/01 17:41:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=164
说明hadoop-lzo安装成功,创建的索引文件列表:
-rw-r-r- 3 hadoop caodx 163517168 2013-07-01 10:55 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo
-rw-r-r- 3 hadoop caodx 15176 2013-07-01 17:41 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo.index
3,mapreduce测试:
WordCount核心代码片段:
TextOutputFormat.setCompressOutput(job, true);
TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class);
运行WordCount:
hadoop fs -put soft/hadoop/README.txt /user/hadoop
hadoop jar lzotest.jar org.apache.hadoop.examples.WordCount /user/hadoop/README.txt /user/hadoop/lzo1
13/07/01 18:12:40 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/01 18:12:40 INFO input.FileInputFormat: Total input paths to process : 1
13/07/01 18:12:40 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/07/01 18:12:40 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/07/01 18:12:40 INFO mapred.JobClient: Running job: job_201307011738_0004
13/07/01 18:12:41 INFO mapred.JobClient: map 0% reduce 0%
13/07/01 18:12:55 INFO mapred.JobClient: map 100% reduce 0%
13/07/01 18:13:07 INFO mapred.JobClient: map 100% reduce 100%
13/07/01 18:13:12 INFO mapred.JobClient: Job complete: job_201307011738_0004
查看输出文件:
hadoop fs -ls /user/hadoop/lzo1
-rw-r-r- 3 hadoop supergroup 1037 2013-07-01 18:13 /user/hadoop/lzo1/part-r-00000.lzo
可以看出输出结果被压缩了.
4,hive测试
设置map中间结果压缩:
hive (labrador)> set mapred.compress.map.output=true;
hive (labrador)> set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
hive (labrador)>select count(*) from pvlog where ptdate='2013-06-04';
5.性能测试(针对9.1G的日志做测试)
例子程序是对pv日志统计pv,uv,ip
hadoop@kooxoo1-155:~$ ll -h
-rw-r-r- 1 hadoop hadoop 9.1G Jul 2 14:55 pvlog2013-06-04.txt
a.不设置压缩:
hive (labrador)> select count(*) pv, count(distinct visitsid) uv, count(distinct ip) ip from pvlog where ptdate='2013-06-04';
查看http://hadoop154.ikuxun.cn/jobconf.jsp?jobid=job_201307021641_0001页面:
运行时显示map和reduce的个数:
Hadoop job information for Stage-1: number of mappers: 37; number of reducers:
运行结果:
pv uv ip
14569944 946643 685518
Time taken: 204.92 seconds
b.设置map中间数据压缩:
重建表结构(必须在建表的时候指定表的输入和输出格式,而不能在执行hsql之前指定,不然会报错):
hive (labrador)>drop table pvlog; (这一步删除外部表结构的时候是不会删除表的数据的)
hive (labrador)> CREATE EXTERNAL TABLE pvlog(ip string, current_date string, current_time string, entry_time string,
hive (labrador)>visitor_id string, url string, first_refer string, last_refer string, fromid string, ifid string, external_source string, internal_source string, pagetype string,
hive (labrador)>global_landing string, channel_landing string, visits_count string, pv_count string, kuxun_id string, utm_source string, utm_medium string, utm_term string,
hive (labrador)>utm_id string, utm_campaign string, pool string, reserve_a string, reserve_b string, reserve_c string, reserve_d string, city string, pvid string,
hive (labrador)>lastpvid string, visitsid string, maxpvcount string, channelpv string, channelleads string)
hive (labrador)>PARTITIONED BY (ptdate string)
hive (labrador)>ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
hive (labrador)>LINES TERMINATED BY '\n'
hive (labrador)>STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
hive (labrador)>OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
hive (labrador)>LOCATION '/user/hive/warehouse/labrador.db/pvlog/';
hive (labrador)>ALTER TABLE pvlog ADD PARTITION (ptdate='2013-06-04');
hive (labrador)> set mapred.compress.map.output=true;
hive (labrador)> set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
hive (labrador)> set hive.exec.compress.intermediate=true;
hive (labrador)> set io.compression.codecs=com.hadoop.compression.lzo.LzopCodec
hive (labrador)> select count(*) pv, count(distinct visitsid) uv, count(distinct ip) ip from pvlog where ptdate='2013-06-04';
运行时显示map和reduce的个数:
Hadoop job information for Stage-1: number of mappers: 37; number of reducers:
查看http://hadoop154.ikuxun.cn/jobconf.jsp?jobid=job_201307021641_0001页面:
运行结果:
pv uv ip
14569944 946643 685518
Time taken: 184.92 seconds
执行效率提高了20多秒,提升不是很明显,可能与测试环境的机器有关,测试环境4个节点,都是虚拟机,总的map槽位6个,总的reduce槽位6个.
c.测试索引自动创建:
删除pvlog表并创建表
load数据hive (labrador)> LOAD DATA local INPATH '/home/hadoop/pvlog2013-06-04.txt' INTO TABLE pvlog PARTITION(ptdate='2013-06-04');
查看load的数据
hadoop@kooxoo1-155:~$ hadoop fs -ls /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04
-rw-r-r- 3 hadoop supergroup 9674697618 2013-07-04 10:12 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pvlog2013-06-04.txt
由此可以确定在load数据的时候是不会自动压缩的和创建索引的,所以要想数据被压缩和创建索引,必须手动压缩并创建索引.
手动压缩和创建索引的脚本:
#!/bin/bash
hadoop fs -copyToLocal /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/ /home/caodx/workspace/hadoopscript/lzo-test/
cd /home/caodx/workspace/hadoopscript/lzo-test/ptdate=2013-06-04
#创建lzo压缩文件
/usr/local/bin/lzop pvlog2013-06-04.txt
hadoop fs -moveFromLocal /home/caodx/workspace/hadoopscript/lzo-test/pvlog2013-06-04.txt.lzo /home/caodx/lzo-test
hadoop fs -rmr /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pvlog2013-06-04.txt
cd /home/caodx/workspace/hadoopscript/lzo-test/
#创建压缩文件的索引文件
hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/
rm -rf ptdate=2013-06-04
相关推荐
《Hadoop与LZO压缩:深入理解hadoop-lzo-0.4.15.tar.gz》 在大数据处理领域,Hadoop是不可或缺的核心组件,它为海量数据的存储和计算提供了分布式解决方案。而LZO(Lempel-Ziv-Oberhumer)是一种高效的无损数据压缩...
《Hadoop LZO压缩工具详解》 在大数据处理领域,Hadoop是一个不可或缺的开源框架,它为海量数据的存储和处理提供了强大支持。而在数据压缩方面,Hadoop LZO是其中一种常用的数据压缩工具,本文将详细介绍这个工具...
hadoop lzo 压缩jar包,本人已经编译好,提供大家下载。
hadoop2 lzo 文件 ,编译好的64位 hadoop-lzo-0.4.20.jar 文件 ,在mac 系统下编译的,用法:解压后把hadoop-lzo-0.4.20.jar 放到你的hadoop 安装路径下的lib 下,把里面lib/Mac_OS_X-x86_64-64 下的所有文件 拷到 ...
hadoop用于解析lzo的包,这个问题在使用presto的时候需要将此包添加到presto的工具包中,以支持lzo格式文件的查询。
集成Hadoop-LZO到你的Hadoop环境,你需要将`hadoop-lzo-0.4.21-SNAPSHOT.jar`添加到Hadoop的类路径中,并配置Hadoop的相关参数,例如在`core-site.xml`中设置`io.compression.codecs`属性,指定支持LZO压缩。...
hadoop2 lzo 文件 ,编译好的64位 hadoop-lzo-0.4.15.jar 文件 ,在mac 系统下编译的,用法:解压后把hadoop-lzo-0.4.15.jar 放到你的hadoop 安装路径下的lib 下,把里面lib/Mac_OS_X-x86_64-64 下的所有文件 拷到 ...
编译后的hadoop-lzo源码,将hadoop-lzo-0.4.21-SNAPSHOT.jar放到hadoop的classpath下 如${HADOOP_HOME}/share/hadoop/common。hadoop才能正确支持lzo,免去编译的烦恼
hadoop-lzo-0.4.13.jar 依赖包 hadoop-lzo-0.4.13.jar 依赖包 hadoop-lzo-0.4.13.jar 依赖包
hadoop lzo 压缩算法的所有工程,包括hadoop-lzo-master,编译好之后的target文件夹和hadoop-lzo-0.4.20-SNAPSHOT.jar文件。复制到eclipse中,可以直接使用lzo压缩算法。
标题中的“2.Hadoop-lzo.7z lzo源码+包”指的是一个包含Hadoop-LZO相关的源代码和预编译的库文件的压缩包。Hadoop-LZO是Hadoop生态系统中的一个扩展,它提供了对LZO(一种高效的压缩算法)的支持。LZO是一种快速的...
【标题】"lzo 2.0.6、hadoop-lzo-master、apache-maven" 涉及的主要是三个关键元素:LZO压缩库、Hadoop-LZO项目以及Apache Maven,这些都是在大数据处理和软件构建领域的重要工具。 【LZO 2.0.6】: LZO(Lempel-...
Hadoop-LZO是一款针对Hadoop生态系统的数据压缩库,它实现了高效的Lempel-Ziv-Ozark (LZO) 压缩算法。LZO是一种快速的无损压缩算法,适用于大数据处理场景,尤其是需要频繁读取和解压的数据。在Hadoop中,LZO压缩...
Hadoop配置支持LZO和Snappy压缩技术的过程是分布式数据处理中一个重要环节,这对于提升数据处理效率和优化存储空间使用具有重大意义。下面将详细介绍配置Hadoop以支持LZO和Snappy压缩的关键步骤和知识点。 首先,...