Hadoop中自定义计数器 -

dajuezhao

浏览: 61515 次
性别:
来自: 北京

最近访客更多访客>>

wjboy49

jaydonluo

追求卓绝

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Hadoop中自定义计数器

博客分类：

hadoop

Hadoop Mapreduce Apache Rack Linux

一、环境

1、hadoop 0.20.2

2、操作系统Linux

二、背景

1、最近写MR的代码，总在想统计一些错误的数据出现的次数，发现如果都写在reduce的输出里太难看了，所以想找办法专门输出一些统计数字。

2、翻看《hadoop权威指南》第8章第1节的时候发现能够自定义计数器，但都是基于0.19版本写的，好多函数都不对，改动相对较大。

3、基于上面2个理由，写个文档，记录一下。

三、实现

1、前提：写入一个文件，规范的是3个字段，“\t”划分，有2条异常，一条是2个字段，一条是4个字段，内容如下：
jim 1 28
kate 0 26
tom 1
kaka 1 22
lily 0 29 22
2、统计处不规范的数据。我没有写reduce，因为不需要输出，代码如下，先看代码
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MyCounter {

public static class MyCounterMap extends Mapper<LongWritable, Text, Text, Text> {

public static Counter ct = null;

protected void map(LongWritable key, Text value,
org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text>.Context context)
throws java.io.IOException, InterruptedException {
String arr_value[] = value.toString().split("\t");
if (arr_value.length > 3) {
ct = context.getCounter("ErrorCounter", "toolong");
ct.increment(1);
} else if (arr_value.length < 3) {
ct = context.getCounter("ErrorCounter", "tooshort");
ct.increment(1);
}
}
}

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: MyCounter <in> <out>");
System.exit(2);
}

Job job = new Job(conf, "MyCounter");
job.setJarByClass(MyCounter.class);

job.setMapperClass(MyCounterMap.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

3、启动命令如下：
hadoop jar /jz/jar/Hadoop_Test.jar jz.MyCounter /jz/* /jz/06
对于小于3个字段的采用tooshort统计，大于3个字段的采用toolong统计

4、结果如下（红色部分）：
10/08/04 17:29:15 INFO mapred.JobClient: Job complete: job_201008032120_0019
10/08/04 17:29:15 INFO mapred.JobClient: Counters: 18
10/08/04 17:29:15 INFO mapred.JobClient: Job Counters
10/08/04 17:29:15 INFO mapred.JobClient: Launched reduce tasks=1
10/08/04 17:29:15 INFO mapred.JobClient: Rack-local map tasks=1
10/08/04 17:29:15 INFO mapred.JobClient: Launched map tasks=6
10/08/04 17:29:15 INFO mapred.JobClient: ErrorCounter
10/08/04 17:29:15 INFO mapred.JobClient: tooshort=1
10/08/04 17:29:15 INFO mapred.JobClient: toolong=1
10/08/04 17:29:15 INFO mapred.JobClient: FileSystemCounters
10/08/04 17:29:15 INFO mapred.JobClient: FILE_BYTES_READ=6
10/08/04 17:29:15 INFO mapred.JobClient: HDFS_BYTES_READ=47
10/08/04 17:29:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=234
10/08/04 17:29:15 INFO mapred.JobClient: Map-Reduce Framework
10/08/04 17:29:15 INFO mapred.JobClient: Reduce input groups=0
10/08/04 17:29:15 INFO mapred.JobClient: Combine output records=0
10/08/04 17:29:15 INFO mapred.JobClient: Map input records=5
10/08/04 17:29:15 INFO mapred.JobClient: Reduce shuffle bytes=36
10/08/04 17:29:15 INFO mapred.JobClient: Reduce output records=0
10/08/04 17:29:15 INFO mapred.JobClient: Spilled Records=0
10/08/04 17:29:15 INFO mapred.JobClient: Map output bytes=0
10/08/04 17:29:15 INFO mapred.JobClient: Combine input records=0
10/08/04 17:29:15 INFO mapred.JobClient: Map output records=0
10/08/04 17:29:15 INFO mapred.JobClient: Reduce input records=0
四、总结
1、其实hadoop权威指南写的很清楚了，但是由于版本不一样，所以很多方法也不同，总一下，主要有以下不同：
不再需要枚举的类型、计数器名不在需要写properties文件，调用的方法在context中都封装了。
2、hadoop权威指南中写了统计百分比值，代码改改就能实现，就是一个总数除以错误数然后百分比的结果。
3、有疑问或是写的不对的地方，欢迎发邮件到dajuezhao@gmail.com

分享到：

Map/Reduce使用杂记 | Map/Reduce中的Partiotioner使用

2010-10-27 09:40
浏览 1542
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hadoop中自定义计数器

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hadoop中自定义计数器

评论

发表评论

相关推荐

Hadoop的基准测试工具使用(部分转载)

分布式集群中的硬件选择

Map/Reduce的内存使用设置

Hadoop开发常用的InputFormat和OutputFormat(转)

SecondaryNamenode应用摘记

Zookeeper分布式安装手册

Hadoop分布式安装

Map/Reduce使用杂记

Map/Reduce中的Partiotioner使用

Map/Reduce中的Combiner的使用

Hadoop中DBInputFormat和DBOutputFormat使用

Hadoop的MultipleOutputFormat使用

Map/Reduce中公平调度器配置

无法启动Datanode的问题

Map/Reduce的GroupingComparator排序简述

Map/Reduce中分区和分组的问题

关于Map和Reduce最大的并发数设置

关于集群数据负载均衡

Map/Reduce执行流程简述

Hadoop集群中关于SSH认证权限的问题

最近访客更多访客>>