`

远程调用执行Hadoop Map/Reduce

阅读更多

在Web项目中,由用户下发任务后,后台服务器远程调用JobTracker所在服务器,运行Map/Reduce更符合B/S架构的习惯。

由于网上没有相关资料,所以自己实现了一个,现在分享一下。

注:基于Hadoop1.1.2版本

转发请注明地址:http://sgq0085.iteye.com/admin/blogs/1879442

一个常见的WordCount如下:

 

package com.gqshao.hadoop.remote;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

public class WordCount extends Configured implements Tool {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public int run(String[] args) throws Exception {
        this.getClass().getResource("/hadoop/");
        Configuration conf = getConf();
        Job job = new Job(conf);
        conf.set("mapred.job.tracker", "192.168.0.128:9001");
        conf.set("fs.default.name", "hdfs://192.168.0.128:9000");
        conf.set("hadoop.job.ugi", "hadoop");
        conf.set("Hadoop.tmp.dir", "/user/gqshao/temp/");

        job.setJarByClass(WordCount.class);
        job.setJobName("wordcount");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        String hdfs = "hdfs://192.168.0.128:9000";
        args = new String[] { hdfs + "/user/gqshao/input/big", hdfs + "/user/gqshao/output/WordCount/" + new Date().getTime() };
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        boolean success = job.waitForCompletion(true);
        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int ret = ToolRunner.run(new WordCount(), args);
        System.exit(ret);
    }
}
 在这里输入和输出目录都是指向HDFS上的,但实际运行的时候(一般 -Xms128m -Xmx512m -XX:MaxPermSize=128M)发现输出中有如下信息:
信息: Running job: job_local_0001

证明该Map/Reduce程序运行在Local中。也就是说,这种方式只能提前打好Jar包,放到Cluster服务器上,在通过Jar运行。

转发请注明地址:http://sgq0085.iteye.com/admin/blogs/1879442

如何远程运行Map/Reduce程序,经研究发现两点。

1.需要将Hadoop的配置文件加载到当前进程的ClassLoader中,或将配置文件放到/bin目录下。

通过跟踪 job.waitForCompletion(true);→submit();→info = jobClient.submitJobInternal(conf);→status = jobSubmitClient.submitJob(jobId, submitJobDir.toString(), jobCopy.getCredentials());

发现private JobSubmissionProtocol jobSubmitClient;分别有两个实现

在org.apache.hadoop.mapred.JobClient中init()方法中可以看到如果设置了conf中如果设置了mapred.job.tracker则在Hadoop Cluster中运行,否则是Local

 

  public void init(JobConf conf) throws IOException {
    String tracker = conf.get("mapred.job.tracker", "local");
    tasklogtimeout = conf.getInt(
      TASKLOG_PULL_TIMEOUT_KEY, DEFAULT_TASKLOG_TIMEOUT);
    this.ugi = UserGroupInformation.getCurrentUser();
    if ("local".equals(tracker)) {
      conf.setNumMapTasks(1);
      this.jobSubmitClient = new LocalJobRunner(conf);
    } else {
      this.rpcJobSubmitClient = 
          createRPCProxy(JobTracker.getAddress(conf), conf);
      this.jobSubmitClient = createProxy(this.rpcJobSubmitClient, conf);
    }        
  }

 

所以需要在运行时加载某目录下配置文件

方法如下:

    /**
     * 加载配置文件
     */
    public static void setConf(Class<?> clazz, Thread thread, String path) {
        URL url = clazz.getResource(path);
        try {
            File confDir = new File(url.toURI());
            if (!confDir.exists()) {
                return;
            }
            URL key = confDir.getCanonicalFile().toURI().toURL();
            ClassLoader classLoader = thread.getContextClassLoader();
            classLoader = new URLClassLoader(new URL[] { key }, classLoader);
            thread.setContextClassLoader(classLoader);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

 

2.设置运行时Jar包

继续看jobClient.submitJobInternal(conf);可以发现client在提交作业到Hadoop时需要把作业打包成jar,然后copy到fs的submitJarFile路径中。所以必须指定conf中的运行的Jar包。

方法如下:

    /**
     * 动态生成Jar包
     */
    public static File createJar(Class<?> clazz) throws Exception {
        String fqn = clazz.getName();
        String base = fqn.substring(0, fqn.lastIndexOf("."));
        base = "/" + base.replaceAll("\\.", Matcher.quoteReplacement("/"));
        URL root = clazz.getResource("");

        JarOutputStream out = null;
        final File jar = File.createTempFile("HadoopRunningJar-", ".jar", new File(System.getProperty("java.io.tmpdir")));
        System.out.println(jar.getAbsolutePath());
        Runtime.getRuntime().addShutdownHook(new Thread() {
            public void run() {
                jar.delete();
            }
        });
        try {
            File path = new File(root.toURI());
            Manifest manifest = new Manifest();
            manifest.getMainAttributes().putValue("Manifest-Version", "1.0");
            manifest.getMainAttributes().putValue("Created-By", "RemoteHadoopUtil");
            out = new JarOutputStream(new FileOutputStream(jar), manifest);
            writeBaseFile(out, path, base);
        } finally {
            out.flush();
            out.close();
        }
        return jar;
    }

    /**
     * 递归添加.class文件
     */
    private static void writeBaseFile(JarOutputStream out, File file, String base) throws IOException {
        if (file.isDirectory()) {
            File[] fl = file.listFiles();
            if (base.length() > 0) {
                base = base + "/";
            }
            for (int i = 0; i < fl.length; i++) {
                writeBaseFile(out, fl[i], base + fl[i].getName());
            }
        } else {
            out.putNextEntry(new JarEntry(base));
            FileInputStream in = null;
            try {
                in = new FileInputStream(file);
                byte[] buffer = new byte[1024];
                int n = in.read(buffer);
                while (n != -1) {
                    out.write(buffer, 0, n);
                    n = in.read(buffer);
                }
            } finally {
                in.close();
            }  
        }
    }

 

修改后的WordCount如下:

public class WordCount extends Configured implements Tool {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            System.out.println("line===>" + line);
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        Job job = new Job(conf);
        System.out.println(conf.get("mapred.job.tracker"));
        System.out.println(conf.get("fs.default.name"));
        /**
         * TODO:调用二
         */
        File jarFile = RemoteHadoopUtil.createJar(WordCount.class);
        ((JobConf) job.getConfiguration()).setJar(jarFile.toString());
        job.setJarByClass(WordCount.class);
        job.setJobName("wordcount");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        String hdfs = "hdfs://192.168.0.128:9000";
        args = new String[] { hdfs + "/user/gqshao/input/WordCount/", hdfs + "/user/gqshao/output/WordCount/" + new Date().getTime() };
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        boolean success = job.waitForCompletion(true);
        System.out.println(job.isComplete());
        System.out.println("JobID: " + job.getJobID());
        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        /**
         * TODO:调用一
         */
        RemoteHadoopUtil.setConf(WordCount.class, Thread.currentThread(), "/hadoop");
        int ret = ToolRunner.run(new WordCount(), args);
        System.exit(ret);
    }
}

 转发请注明地址:http://sgq0085.iteye.com/admin/blogs/1879442

 附件中有完整代码和测试用例,欢迎讨论。解压后在文件目录中运行mvn eclipse:clean eclipse:eclipse即可(前提是需要有Maven)

分享到:
评论
6 楼 tonny1228 2017-01-09  
经测试还是运行在local
5 楼 wangjinyang_123 2014-04-15  
  大虾,你好,这个怎么才能在web端调用呢?还是没咋看懂,初学者,见谅。Maven也没玩过。
4 楼 perfri 2014-02-20  
正在学习中,作为小白还是有很多不明白的地方
3 楼 qq690388648 2014-02-18  
楼主好人一生平安!!!
2 楼 xieweijv 2013-12-27  
xieweijv 写道
感谢楼主的无私奉献,我使用你的代码在hadoop2.2.0上无法直接运行,下面是我稍加修改后的WordCount.java,分享一下:
package com.missionsky.hadoop.remote;

/**
 * For hadoop 2.2.0
 */
import java.io.File;
import java.io.IOException;
import java.util.Date;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import com.missionsky.hadoop.remote.utils.RemoteHadoopUtil;

public class WordCount extends Configured implements Tool {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            System.out.println("line===>" + line);
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public int run(String[] args) throws Exception {
        Job job = Job.getInstance();
        
        job.setJobName("job_wordcount");
        
        // Create Jar
        File jarFile = RemoteHadoopUtil.createJar(WordCount.class);
        ((JobConf) job.getConfiguration()).setJar(jarFile.toString());
        
        job.setJarByClass(WordCount.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        
        String hdfs = "hdfs://192.168.0.109:9000";
        FileInputFormat.setInputPaths(job, new Path(hdfs + "/user/input/wordcount/"));
        FileOutputFormat.setOutputPath(job, new Path(hdfs + "/user/output/wordcount/" + new Date().getTime()));
        boolean success = job.waitForCompletion(true);
        
        System.out.println("Job Final Status:" + job.getStatus().getState());
        
        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
    	Configuration configuration = new Configuration();
        int ret = ToolRunner.run(configuration,new WordCount(), args);
        System.exit(ret);
    }
}

因为JobConf已经deprecated,所以job设置jar时候的时候,直接:
File jarFile = RemoteHadoopUtil.createJar(WordCount.class);
        job.setJar(jarFile.toString());
1 楼 xieweijv 2013-12-27  
感谢楼主的无私奉献,我使用你的代码在hadoop2.2.0上无法直接运行,下面是我稍加修改后的WordCount.java,分享一下:
package com.missionsky.hadoop.remote;

/**
 * For hadoop 2.2.0
 */
import java.io.File;
import java.io.IOException;
import java.util.Date;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import com.missionsky.hadoop.remote.utils.RemoteHadoopUtil;

public class WordCount extends Configured implements Tool {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            System.out.println("line===>" + line);
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public int run(String[] args) throws Exception {
        Job job = Job.getInstance();
        
        job.setJobName("job_wordcount");
        
        // Create Jar
        File jarFile = RemoteHadoopUtil.createJar(WordCount.class);
        ((JobConf) job.getConfiguration()).setJar(jarFile.toString());
        
        job.setJarByClass(WordCount.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        
        String hdfs = "hdfs://192.168.0.109:9000";
        FileInputFormat.setInputPaths(job, new Path(hdfs + "/user/input/wordcount/"));
        FileOutputFormat.setOutputPath(job, new Path(hdfs + "/user/output/wordcount/" + new Date().getTime()));
        boolean success = job.waitForCompletion(true);
        
        System.out.println("Job Final Status:" + job.getStatus().getState());
        
        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
    	Configuration configuration = new Configuration();
        int ret = ToolRunner.run(configuration,new WordCount(), args);
        System.exit(ret);
    }
}

相关推荐

    hadoop之map/reduce

    ebsdi-apps则包含具体的MapReduce作业,它们简单调用ebsdi-domain中的接口来执行业务逻辑。 实现MapReduce程序的流程大致如下: 1. 创建输入实体类,确保属性与HDFS中的原始数据字段匹配,并实现MREntity抽象类的...

    第3章 HadoopAPI操作.pdf

    接下来,在Eclipse中可以通过Window -&gt; Preferences配置Hadoop安装路径,然后通过Open Perspective选择Map/Reduce视角,以显示Map/Reduce Locations面板。 连接Hadoop涉及在DFS Locations面板上创建新的location,...

    hadoop2x-eclipse-plugin

    1. 运行MapReduce任务:在项目中右键选择"Run As" -&gt; "Hadoop Job",Eclipse会调用Hadoop的命令行工具提交任务到集群。你可以跟踪任务的状态,查看日志,了解任务运行情况。 2. 调试MapReduce任务:通过"Debug As" ...

    Hadoop学习总结之四:Map-Reduce过程解析

    - **任务调度**:根据资源可用性及优先级等因素,将作业分解成多个任务(Map和Reduce任务),并分配给合适的TaskTracker进行执行。 - **状态监控**:跟踪所有TaskTracker的状态,以及各个任务的执行情况,确保作业...

    hadoop3\bin

    描述中提到的操作是针对Windows环境的,即替换Windows上的Hadoop `bin`目录,并将`hadoop.dll`文件复制到系统目录`C:\Windows\System32`,这是为了让Hadoop在Windows环境下能够正确识别和调用依赖的动态链接库。...

    hadoop-eclipse-plugin含WINDOWS下调试文档

    4. 对于调试,可以利用Eclipse的断点功能,设置在代码中需要检查的地方,当执行到该位置时,程序会暂停,便于查看变量值和调用堆栈。 六、注意事项 1. 确保Hadoop服务在本地或者远程集群上正常运行。 2. 确认Hadoop...

    hadoop 命令大全

    停止Map/Reduce服务则需要执行命令`$bin/stop-mapred.sh`。该命令同样会根据JobTracker上的`$HADOOP_CONF_DIR/slaves`文件,在所有列出的从节点上停止TaskTracker守护进程。 **10. 启动所有服务** 使用命令`$bin/...

    web 工程调用hadoop集群

    web工程调用hadoop集群的实例,包括一个wordcount例子。 输入输入和输出路径点击提交即可提交任务到hadoop集群,同时含有map和reduce过程的监控。 注意点:要把hadoop相关包放入WEB_INF/lib下面;

    java web程序调用hadoop2.6

    Java Web程序调用Hadoop 2.6是一个关键的技术整合,它允许Web应用程序与Hadoop分布式文件系统(HDFS)和MapReduce框架交互,以处理大规模数据。在本示例中,我们将深入探讨如何实现这一集成,以及涉及的关键概念和...

    hadoop_map_reduce:Hadoop Map reduce 示例

    在压缩包文件`hadoop_map_reduce-master`中,可能包含了完整的MapReduce示例代码,包括Mapper、Reducer的实现,以及主程序。你可以通过阅读和运行这些代码来学习如何在实际项目中应用Hadoop MapReduce解决大数据问题...

    eclipse连接hadoop相关工具

    最后,通过Eclipse的“Window” -&gt; “Preferences” -&gt; “Hadoop Map/Reduce”设置Hadoop集群的相关信息,如JobTracker和NameNode的地址。 在实际开发中,使用这些工具可以提高开发效率,减少手动配置的工作量。...

    map-reduce详细

    ### Map-Reduce 实现细节与问题解决 #### 客户端操作流程 Map-Reduce 的启动过程始于客户端向系统提交任务。此过程的核心是通过 `JobClient` 类的 `runJob` 静态方法来实现。具体步骤如下: 1. **JobClient 对象...

    Hadoop plugin for eclipse

    3. **创建Hadoop项目**: 使用Eclipse的New -&gt; Project菜单,选择Hadoop相关的项目类型,如Hadoop Map/Reduce Project,然后按照向导完成项目创建。 4. **编写MapReduce代码**: 在创建的项目中,可以像其他Java项目...

    如何使用eclipse调试Hadoop作业

    Eclipse会通过Hadoop的本地模式运行作业,使得你可以逐行执行代码,查看变量状态,调用栈等信息。这对于找出程序中的逻辑错误非常有帮助。 对于远程调试,你可能需要在Hadoop集群上启动作业时启用调试模式。在启动...

    hadoop-eclipse-plugin-2.7.3.jar

    3. 重启Eclipse,如果在"Window &gt; Preferences &gt; Hadoop Map/Reduce"中能看到配置选项,说明插件已成功安装。 4. 配置Hadoop集群的连接信息,包括Hadoop的安装路径、NameNode和JobTracker的地址。 需要注意的是,...

    使用hadoop-streaming运行Python编写的MapReduce程序.rar

    3. **创建提交脚本**:创建一个提交脚本(通常为bash脚本),用于指定Map和Reduce任务的输入、输出路径,以及调用Hadoop Streaming命令。命令格式如下: ``` hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/...

Global site tag (gtag.js) - Google Analytics