hadoop学习4——使用hadoop压缩（zipping）文件

goon

浏览: 184140 次
性别:
来自: 上海

最近访客更多访客>>

leon916

diyinuli

skypiggy

alpenliebe

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop系列

hadoop0.20.2

1.使用streaming命令（摘至hadoop开发文档）：

除了纯文本格式的输出，你还可以生成gzip文件格式的输出，你只需设置streaming作业中的选项‘-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode’。

2.使用程序：

输入文件：

$ bin/hadoop fs -ls /temp/in
Found 2 items
-rw-r--r--   1 Administrator supergroup         52 2012-02-09 10:02 /temp/in/t1.txt
-rw-r--r--   1 Administrator supergroup         35 2012-02-09 10:02 /temp/in/t2.txt

调试代码：

public class ZipFile {
	
	public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
		private final static IntWritable one = new IntWritable(1);
		private Text word = new Text();

		public void map(LongWritable key, Text value,
				OutputCollector<Text, IntWritable> output, Reporter reporter)
				throws IOException {
			
			output.collect((Text)value, null);

		}
	}

	public static void main(String[] args) {
		JobClient client = new JobClient();
		JobConf conf = new JobConf(com.hadoop.test.ZipFile.class);

		// TODO: specify output types
//		conf.setOutputKeyClass(Text.class);
//		conf.setOutputValueClass(IntWritable.class);

		// TODO: specify input and output DIRECTORIES (not files)
		FileInputFormat.setInputPaths(conf, new Path("/temp/in"));
		FileOutputFormat.setOutputPath(conf, new Path("/temp/out-" + System.currentTimeMillis()));

		// TODO: specify a mapper
		conf.setMapperClass(Map.class);

		// TODO: specify a reducer
//		conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
		
		FileOutputFormat.setCompressOutput(conf, true);
        FileOutputFormat.setOutputCompressorClass(conf, org.apache.hadoop.io.compress.GzipCodec.class);
//        conf.setOutputFormat(NonSplitableTextInputFormat.class);
//        conf.setInputFormat(TextInputFormat.class);
//		conf.setOutputFormat(TextOutputFormat.class);
		
        conf.setNumReduceTasks(0);

	       
		client.setConf(conf);
		try {
			JobClient.runJob(conf);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

输出文件：

$ bin/hadoop fs -ls /temp/out-1328857284203
Found 2 items
-rw-r--r--   3 Administrator supergroup         67 2012-02-10 15:01 /temp/out-1328857284203/part-00000.gz
-rw-r--r--   3 Administrator supergroup         53 2012-02-10 15:01 /temp/out-1328857284203/part-00001.gz

使用命令：

$ bin/hadoop fs -get /temp/out-1328857284203/part-00000.gz out1.gz

把压缩后的文件下载到本地也是zip格式的文件，打开，解压打开跟原文件一致。

分享到：

hadoop学习5——从start-all.sh入手调试源 ... | hadoop学习3——DistributedCache加载本地 ...

2012-02-10 15:15
浏览 6272
评论(0)
分类:互联网
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hadoop学习4——使用hadoop压缩（zipping）文件

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hadoop学习4——使用hadoop压缩（zipping）文件

评论

发表评论

相关推荐

hadoop学习——IO之ObjectWritable

hadoop学习5——从start-all.sh入手调试源码

hadoop学习3——DistributedCache加载本地库

hadoop学习2——DistributedCache的部分用法

hadoop问题记录1

hadoop学习1——job执行过程

最近访客更多访客>>