RcFile存储和读取操作

smallboby

浏览: 192783 次
性别:
来自: 北京

最近访客更多访客>>

nalnait

lvtt

501311837

ycy12ycy

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

RcFile

RcFile hadoop 存储

工作中用到了RcFile来存储和读取RcFile格式的文件，记录下。
RcFile是FaceBook开发的一个集行存储和列存储的优点于一身，压缩比更高，读取列更快，它在MapReduce环境中大规模数据处理中扮演着重要的角色。
读取操作：

job信息：
	    Job job = new Job();
            job.setJarByClass(类.class);
		//设定输入文件为RcFile格式
            job.setInputFormatClass(RCFileInputFormat.class);  
		//普通输出
            job.setOutputFormatClass(TextOutputFormat.class);
		//设置输入路径
            RCFileInputFormat.addInputPath(job, new Path(srcpath));
            //MultipleInputs.addInputPath(job, new Path(srcpath), RCFileInputFormat.class);
		// 输出
            TextOutputFormat.setOutputPath(job, new Path(respath));
            // 输出key格式
            job.setOutputKeyClass(Text.class);  
		//输出value格式
            job.setOutputValueClass(NullWritable.class);  
		//设置mapper类
            job.setMapperClass(ReadTestMapper.class);
		//这里没设置reduce，reduce的操作就是读Text类型文件，因为mapper已经给转换了。
            
            code = (job.waitForCompletion(true)) ? 0 : 1;


// mapper 类

pulic class ReadTestMapper extends Mapper<LongWritable, BytesRefArrayWritable, Text, NullWritable> {
        
        @Override
        protected void map(LongWritable key, BytesRefArrayWritable value, Context context) throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            Text txt = new Text(); 
		//因为RcFile行存储和列存储，所以每次进来的一行数据，Value是个列簇，遍历，输出。
            StringBuffer sb = new StringBuffer();
            for (int i = 0; i < value.size(); i++) {
                BytesRefWritable v = value.get(i);
                txt.set(v.getData(), v.getStart(), v.getLength());
                if(i==value.size()-1){
                    sb.append(txt.toString());
                }else{
                    sb.append(txt.toString()+"\t");
                }
            }
            context.write(new Text(sb.toString()),NullWritable.get());
            }
        }

输出压缩为RcFile格式：

job信息：
            Job job = new Job();
            Configuration conf = job.getConfiguration();
		//设置每行的列簇数
            RCFileOutputFormat.setColumnNumber(conf, 4);
            job.setJarByClass(类.class);

            FileInputFormat.setInputPaths(job, new Path(srcpath));
            RCFileOutputFormat.setOutputPath(job, new Path(respath));

            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(RCFileOutputFormat.class);

            job.setMapOutputKeyClass(LongWritable.class);
            job.setMapOutputValueClass(BytesRefArrayWritable.class);

            job.setMapperClass(OutPutTestMapper.class);

            conf.set("date", line.getOptionValue(DATE));
		//设置压缩参数
            conf.setBoolean("mapred.output.compress", true);
            conf.set("mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");

            code = (job.waitForCompletion(true)) ? 0 : 1;

mapper类：
    public class OutPutTestMapper extends Mapper<LongWritable, Text, LongWritable, BytesRefArrayWritable> {
        @Override
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String day = context.getConfiguration().get("date");
            if (!line.equals("")) {
                String[] lines = line.split(" ", -1);
                if (lines.length > 3) {
                    String time_temp = lines[1];
                    String times = timeStampDate(time_temp);
                    String d = times.substring(0, 10);
                    if (day.equals(d)) {
                        byte[][] record = {lines[0].getBytes("UTF-8"), lines[1].getBytes("UTF-8"),lines[2].getBytes("UTF-8"), lines[3].getBytes("UTF-8")};

                        BytesRefArrayWritable bytes = new BytesRefArrayWritable(record.length);

                        for (int i = 0; i < record.length; i++) {
                            BytesRefWritable cu = new BytesRefWritable(record[i], 0, record[i].length);
                            bytes.set(i, cu);
                        }
                        context.write(key, bytes);
                    }
                }
            }
        }

转载，请超链接形式标明文章原始出处和作者。
永久链接: http://smallboby.iteye.com/blog/1592531。
感谢。不当之处，请指教。

分享到：

HIVE的基本操作 | linux下安装apache，mysql，php5简单过程 ...

2012-07-14 00:11
浏览 10439
评论(2)
分类:编程语言
查看更多

2 楼 yiqi1943 2014-07-14

timeStampDate函数没找到

1 楼 yiqi1943 2014-07-04

您用的hadoop和hive的版本是什么呢？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

RcFile存储和读取操作

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

RcFile存储和读取操作

评论

发表评论

相关推荐

普通文本压缩成RcFile的通用类

最近访客更多访客>>