`

oozie-工作流Map-Reduce行为

 
阅读更多
Map-Reduce行为


A map-reduce action can be configured to perform file system cleanup and directory creation before starting the map reduce job. This capability enables Oozie to retry a Hadoop job in the situation of a transient failure (Hadoop checks the non-existence of the job output directory and then creates it when the Hadoop job is starting, thus a retry without cleanup of the job output directory would fail).

The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the workflow execution path.

The counters of the Hadoop job and job exit status (=FAILED=, KILLED or SUCCEEDED ) must be available to the workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions configurations.

The map-reduce action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop map/reduce job.

Hadoop JobConf properties can be specified in a JobConf XML file bundled with the workflow application or they can be indicated inline in the map-reduce action configuration.

The configuration properties are loaded in the following order, streaming , job-xml and configuration , and later values override earlier values.

Streaming and inline property values can be parameterized (templatized) using EL expressions.

The Hadoop mapred.job.tracker and fs.default.name properties must not be present in the job-xml and inline configuration.


3.2.2.1 Adding Files and Archives for the Job

The file , archive elements make available, to map-reduce jobs, files and archives. If the specified path is relative, it is assumed the file or archiver are within the application directory, in the corresponding sub-path. If the path is absolute, the file or archive it is expected in the given absolute path.

Files specified with the file element, will be symbolic links in the home directory of the task.

If a file is a native library (an '.so' or a '.so.#' file), it will be symlinked as and '.so' file in the task running directory, thus available to the task JVM.

To force a symlink for a file on the task running directory, use a '#' followed by the symlink name. For example 'mycat.sh#cat'.

Refer to Hadoop distributed cache documentation for details more details on files and archives.


3.2.2.2 Streaming

Streaming information can be specified in the streaming element.

The mapper and reducer elements are used to specify the executable/script to be used as mapper and reducer.

User defined scripts must be bundled with the workflow application and they must be declared in the files element of the streaming configuration. If the are not declared in the files element of the configuration it is assumed they will be available (and in the command PATH) of the Hadoop slave machines.

Some streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using the file and archive elements described in the previous section.

The Mapper/Reducer can be overridden by a mapred.mapper.class or mapred.reducer.class properties in the job-xml file or configuration elements.


3.2.2.3 Pipes

Pipes information can be specified in the pipes element.

A subset of the command line options which can be used while using the Hadoop Pipes Submitter can be specified via elements - map , reduce , inputformat , partitioner , writer , program .

The program element is used to specify the executable/script to be used.

User defined program must be bundled with the workflow application.

Some pipe jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using the file and archive elements described in the previous section.

Pipe properties can be overridden by specifying them in the job-xml file or configuration element.

3.2.2.4 Syntax

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <map-reduce>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
                <delete path="[PATH]"/>
                ...
                <mkdir path="[PATH]"/>
                ...
            </prepare>
            <streaming>
                <mapper>[MAPPER-PROCESS]</mapper>
                <reducer>[REDUCER-PROCESS]</reducer>
                <record-reader>[RECORD-READER-CLASS]</record-reader>
                <record-reader-mapping>[NAME=VALUE]</record-reader-mapping>
                ...
                <env>[NAME=VALUE]</env>
                ...
            </streaming>
<!-- Either streaming or pipes can be specified for an action, not both -->
            <pipes>
                <map>[MAPPER]</map>
                <reduce>[REDUCER]</reducer>
                <inputformat>[INPUTFORMAT]</inputformat>
                <partitioner>[PARTITIONER]</partitioner>
                <writer>[OUTPUTFORMAT]</writer>
                <program>[EXECUTABLE]</program>
            </pipes>
            <job-xml>[JOB-XML-FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </map-reduce>        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>
The prepare element, if present, indicates a list of path do delete before starting the job. This should be used exclusively for directory cleanup for the job to be executed. The delete operation will be performed in the fs.default.name filesystem.

The job-xml element, if present, must refer to a Hadoop JobConf job.xml file bundled in the workflow application. The job-xml element is optional and if present it can be only one.

The configuration element, if present, contains JobConf properties for the Hadoop job.

Properties specified in the configuration element override properties specified in the file specified in the job-xml element.

The file element, if present, must specify the target sybolic link for binaries by separating the original file and target with a # (file#target-sym-link). This is not required for libraries.

The mapper and reducer process for streaming jobs, should specify the executable command with URL encoding. e.g. '%' should be replaced by '%25'.

Example:

<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirstHadoopJob">
        <map-reduce>
            <job-tracker>foo:9001</job-tracker>
            <name-node>bar:9000</name-node>
            <prepare>
                <delete path="hdfs://foo:9000/usr/tucu/output-data"/>
            </prepare>
            <job-xml>/myfirstjob.xml</job-xml>
            <configuration>
                <property>
                    <name>mapred.input.dir</name>
                    <value>/usr/tucu/input-data</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>/usr/tucu/input-data</value>
                </property>
                <property>
                    <name>mapred.reduce.tasks</name>
                    <value>${firstJobReducers}</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="myNextAction"/>
        <error to="errorCleanup"/>
    </action>
    ...
</workflow-app>
In the above example, the number of Reducers to be used by the Map/Reduce job has to be specified as a parameter of the workflow job configuration when creating the workflow job.

Streaming Example:

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="firstjob">
        <map-reduce>
            <job-tracker>foo:9001</job-tracker>
            <name-node>bar:9000</name-node>
            <prepare>
                <delete path="${output}"/>
            </prepare>
            <streaming>
                <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper>
                <reducer>/bin/bash testarchive/bin/reducer.sh</reducer>
            </streaming>
            <configuration>
                <property>
                    <name>mapred.input.dir</name>
                    <value>${input}</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>${output}</value>
                </property>
                <property>
                    <name>stream.num.map.output.key.fields</name>
                    <value>3</value>
                </property>
            </configuration>
            <file>/users/blabla/testfile.sh#testfile</file>
            <archive>/users/blabla/testarchive.jar#testarchive</archive>
        </map-reduce>
        <ok to="end"/>
        <error to="kill"/>
    </action>
  ...
</workflow-app>
Pipes Example:

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="firstjob">
        <map-reduce>
            <job-tracker>foo:9001</job-tracker>
            <name-node>bar:9000</name-node>
            <prepare>
                <delete path="${output}"/>
            </prepare>
            <pipes>
                <program>bin/wordcount-simple#wordcount-simple</program>
            </pipes>
            <configuration>
                <property>
                    <name>mapred.input.dir</name>
                    <value>${input}</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>${output}</value>
                </property>
            </configuration>
            <archive>/users/blabla/testarchive.jar#testarchive</archive>
        </map-reduce>
        <ok to="end"/>
        <error to="kill"/>
    </action>
  ...
</workflow-app>
分享到:
评论

相关推荐

    Oozie的安装与配置.docx

    【Oozie的安装与配置】是Hadoop生态系统中的一个重要环节,Oozie是一个工作流调度系统,用于管理和协调Hadoop集群上的各种作业,如MapReduce、Pig、Hive、Spark等。以下是对Oozie安装配置过程的详细解释: 1. **所...

    工作流oozie文档

    ### Oozie工作流引擎详解 #### 一、Oozie基本介绍 Oozie是一款由Apache基金会维护的开源工作流调度管理系统,主要用于管理和编排Hadoop生态系统中的各种任务。通过对不同类型的任务进行编排和调度,Oozie帮助...

    大数据Oozie架构原理.pdf

    Oozie 是一个基于工作流引擎的开源框架,运行在 Tomcat 容器中,使用数据库存储工作流定义和实例,支持多种类型的 Hadoop 作业调度。Oozie 的架构原理可以分为四个部分:Client、Console、SDK 和 DB。 Oozie 的架构...

    远程调用执行Hadoop Map/Reduce

    7. **工具集成**:有许多开源工具可以帮助我们远程提交和管理Hadoop作业,如Hadoop命令行工具、Hadoop的Web UI、Apache Oozie工作流管理系统等。这些工具提供了方便的接口,使开发者能便捷地与集群交互。 8. **安全...

    oozie

    Oozie是一个工作流调度程序系统,用于管理Apache Hadoop作业。 Oozie与其余Hadoop堆栈集成在一起,支持开箱即用的几种类型的Hadoop作业(例如Java map-reduce,Streaming map-reduce,Pig,Hive,Sqoop和Distcp)...

    测试mr 和测试ooziedemo

    而“src”目录可能包含了源代码或者脚本,比如MapReduce作业的Java源代码,或者是Oozie工作流定义的XML文件。在MapReduce中,源代码通常包括Mapper和Reducer类,它们实现了业务逻辑。在Oozie工作流中,XML文件定义了...

    oozie:Apache Oozie的镜子

    阿帕奇·奥兹(Apache Oozie)什么是...Oozie概述Oozie是基于服务器的工作流引擎,专门用于运行具有运行Hadoop Map / Reduce和Pig作业的操作的工作流作业。 Oozie是在Java servlet容器中运行的Java Web应用程序。 出于

    分布式大数据处理架构.pptx

    - **Oozie工作流调度器**: - **功能**:编排和调度Hadoop作业。 - **特点**: - 提供可视化编辑器。 - 支持多种调度策略。 #### 三、MapReduce并行计算模型 - **Map阶段**: - **过程**: - 输入数据被拆分...

    Hadoop.Beginners.Guide

    - Oozie:工作流管理系统,用于调度Hadoop作业。 - Spark:一个更快、更通用的大数据处理框架,可以与Hadoop集成。 4. **安装与配置**: - 单机模式:适合初学者学习,所有Hadoop组件都在单个节点上运行。 -伪...

    Hadoop实战

    - Oozie:工作流调度系统,管理Hadoop作业和其他计算框架的作业。 4. Hadoop实战: - 数据上传与下载:如何使用Hadoop命令行工具将数据导入HDFS,以及如何从HDFS中提取数据。 - MapReduce编程:编写Java ...

    Hadoop The Definitive Guide 2 example code

    - Oozie:工作流管理系统,调度Hadoop作业和工作流程。 4. 示例代码实践 - 在本地模式下运行:这适用于初步测试和调试,所有Hadoop进程都在单个节点上运行。 - 在伪分布式模式下运行:模拟分布式环境,但所有...

    hadoop笔记

    - Oozie:工作流管理系统,协调Hadoop作业的执行。 - ZooKeeper:分布式协调服务,管理配置信息、命名服务等。 五、Hadoop实际应用 Hadoop广泛应用于互联网行业的日志分析、推荐系统、用户行为分析、广告定向投放等...

    Hadoop帮助文档

    Hadoop生态系统还包括其他组件,如YARN(资源调度器),HBase(NoSQL数据库),Hive(数据仓库工具),Pig(数据流处理),Oozie(工作流调度),Zookeeper(协调服务)等。这些组件共同构建了强大的大数据处理平台...

    基于Hadoop的大数据应用分析.rar

    - Oozie:工作流调度系统,管理Hadoop作业和协调其他系统任务。 5. 性能优化: - YARN(Yet Another Resource Negotiator):Hadoop 2.x引入的新资源管理系统,提升了集群资源利用率和性能。 - Spark:作为更...

    用Hadoop搭建分布式存储和分布式运算集群.zip )

    - ZooKeeper用于集群管理和协调,Oozie管理工作流,Ambari提供集群监控和管理界面。 6. **优化与调优**: - 调整Hadoop性能涉及参数优化,如增大Map和Reduce任务的内存、调整数据块大小、优化网络通信等。 - 对...

    hadoop-2.0.0-cdh4.2.1的src

    - Oozie:工作流调度系统,用于协调Hadoop作业。 - Flume:日志收集、聚合和传输系统。 - Sqoop:用于在Hadoop和传统数据库之间导入导出数据。 6. **源码分析** 深入研究`src`目录下的源代码,可以了解Hadoop的...

    【面试宝典】2021年超全超详细的最新大数据开发面试题,附答案解析(一版).pdf

    - shuffle阶段发生在MapTask和ReduceTask之间。 - MapTask对中间结果进行分区、排序和缓存。 - ReduceTask按需从MapTask拉取数据。 - 这一阶段涉及网络传输、排序和归并操作。 **13. Shuffle阶段的数据压缩机制** ...

    hadoop-2.0.5-alpha

    Hadoop生态系统的其他组件,如HBase(分布式列式数据库)、Hive(基于Hadoop的数据仓库工具)、Pig(高级数据流语言)和Oozie(工作流调度系统)等,都在这个版本中得到了支持和优化,为用户提供了一套完整的数据...

Global site tag (gtag.js) - Google Analytics