`

Tez: 3 Runtime API in Apache Tez

    博客分类:
  • Tez
 
阅读更多

Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing. The user logic, that analyses and modifies the data, sits in the vertices. Edges determine the consumer of the data, how the data is transferred and the dependency between the producer and consumer vertices.

For users of MapReduce (MR), the most primitive functionality that Tez can provide is an ability to run a chain of Reduce stages as compared to a single Reduce stage in the current MR implementation. Via the Task API, Tez can do this and much more by facilitating execution of any form of processing logic that does not need to be retrofitted into a Map or Reduce task and also by supporting multiple options of data transfer between different vertices that are not restricted to the MapReduce shuffle transport mechanism.

The Building Blocks of Tez

The Task API provides the building blocks for a user to plug-in their logic to analyze and modify data into the vertex and augment this processing logic with the necessary plugins to transfer and route data between vertices.

Tez models the user logic running in each vertex as a composition of a set of Inputs, a Processor and a set of Outputs.

  • Input: An input represents a pipe through which a processor can accept input data from a data source such as HDFS or the output generated by another vertex.
  • Processor: The entity responsible for consuming one or more Inputs and producing one or more Outputs.
  • Output: An output represents a pipe through which a processor can generate output data for another vertex to consume or to a data sink such as HDFS.

Given that an edge in a DAG is a logical entity that represents a number of physical connections between the tasks of 2 connected vertices, to improve ease of programmability for a developer implementing a new Processor, there are 2 kinds of Inputs and Outputs to either expose or hide the level of complexity:

  • Logical: A corresponding pair of a LogicalInput and a LogicalOutput represent the  logical edge between 2 vertices. The implementation of Logical objects hides all the underlying physical connections and exposes a single view to the data.
  • Physical: The pair of Physical Input and Output represents the connection between a task of the Source vertex and a task of a Destination vertex.

An example of the Reduce stage within an MR job would be a Reduce Processor that receives data from the maps via ShuffleInput and generates output to HDFS. Likewise, an intermediate Reduce stage in an MRR chain would be quite similar to the final Reduce stage except for the difference in the Output type.

 

 

 

Tez Runtime API



  To implement a new Input, Processor or Output, a user to implement the appropriate interfaces mentioned above. All objects are given a Context object in their initialize functions. This context is the hook for these objects to communicate to the Tez framework. The Inputs and Outputs are expected to provide implementations for their respective Readers and Writers which are then used by the Processor to read/write data. In a task, after the Tez framework has initialized all the necessary Inputs, Outputs and the Processor, the  Tez framework invokes the Processor’s run function and passes the appropriate handles to all the Inputs and Outputs for that particular task.

Tez allows all inputs and outputs to be pluggable. This requires support for passing of information from the Output of a source vertex to the Input of the destination vertex. For example, let us assume that the Output of a source vertex writes all of its data to a key-value store. The Output would need to communicate the “key” to the Input of the next stage so that the Input can retrieve the correct data from the key-value store. To facilitate this, Tez uses Events.

 

Events in Tez

Events in Tez are a way to pass information amongst different components.

  • The Tez framework uses Events to pass information of system events such as task failures to the required components.
  • Inputs of a vertex can inform the framework of any failures encountered when trying to retrieve data from the source vertex’s Output that in turn can be used by the framework to take failure recovery measures.
  • An Output can pass information of the location of the data, which it generates, to the Inputs of the destination vertex.  An example of this is described in the Shuffle Event diagram which shows how the output of a Map stage informs the Shuffle Input of the Reduce stage of the location of its output via a Data Movement Event.



 Another use of Events is to enable run-time changes to the DAG execution plan. For example, based on the amount of the data being generated by a Map stage, it may be more optimal to run less reduce tasks within the following Reduce stage. Events generated by Outputs are routed to the pluggable Vertex/Edge management modules, allowing them to make the necessary decisions to modify some run-time parameters as needed.

 

Available implementations of Inputs/Processors/Outputs

The flexibility of Tez allows anyone to implement their Inputs and Outputs, whether they use blocking/non-blocking transport protocols, handle data in the form of raw bytes/records/key-value pairs etc., and build Processors to handle these variety of Inputs and Outputs.

There is already a small repository of various implementations of Inputs/Outputs/Processors:

  • MRInput and MROutput: Basic input and outputs to handle data to/from HDFS that are MapReduce compatible as they use MapReduce constructs such as InputFormat, RecordReader, OutputFormat and RecordWriter.
  • OnFileSortedOutput and ShuffleMergedInput: A pair of key-value based Input and Output that use the local disk for all I/O and provide the same sort+merge functionality that is required for the “shuffle” edge between the Map and Reduce stages in a MapReduce job.
  • OnFileUnorderedKVOutput and ShuffledUnorderedKVInput: These are similar to the shuffle pair mentioned earlier except that the data is not sorted implicitly. This can be a big performance boost in various situations.
  • MapProcessor and ReduceProcessor: As the names suggest, these processors are available for anyone trying to run a MapReduce job on the Tez execution framework. They can be used to run an MRR chain too.

 

orignal doc:

http://hortonworks.com/blog/task-api-apache-tez/

 

 

 

 

 

 

  • 描述: tez
  • 大小: 128.8 KB
  • 大小: 60.1 KB
  • 大小: 67.6 KB
分享到:
评论

相关推荐

    hive on tez 常见报错问题收集

    这个问题是由于Hive中的一个已知bug,具体问题可以在Apache JIRA的HIVE-16398中找到。为了解决这个问题,可以尝试增加`hive.localize.resource.num.wait.attempts`的属性值,将其从默认的5提升到如25这样的更高数值...

    tez-0.9.1.tar.gz

    3. `tez-runtime-library-0.9.1.jar`:包含了Tez运行时库,用于任务执行和数据交换,包括输入/输出格式、 Shuffle Handler等。 4. `tez-mapreduce-0.9.1.jar`:为MapReduce应用程序提供对Tez的支持,使得MapReduce...

    tez-0.9.2.tar.gz

    3. tez-api-0.9.2.jar:API库,提供了与Tez交互的接口,供开发者在应用程序中调用。 4. tez-runtime-library-0.9.2.jar:运行时库,包含了Tez在执行任务时所需的基本组件和服务。 5. tez-mapreduce-0.9.2.jar:Tez...

    tez-0.9.1-minimal.tar.gz

    3. `tez-api-0.9.1.jar` 提供了 Tez API,开发者可以使用这些接口来编写自己的 Tez 应用程序。 4. `tez-runtime-library-0.9.1.jar` 包含了 Tez 运行时库,处理任务的执行逻辑,如任务调度、资源管理和数据传输。 ...

    tez-0.10.1-SNAPSHOT.tar.gz

    3. `tez-api-0.10.1-SNAPSHOT.jar`:提供了Tez API,开发者可以使用这些API构建自己的应用程序,利用Tez的并行处理能力。 4. `tez-runtime-library-0.10.1-SNAPSHOT.jar`、`tez-mapreduce-0.10.1-SNAPSHOT.jar`、`...

    tez-0.10.1-SNAPSHOT-minimal.tar.gz

    3. **tez-api-0.10.1-SNAPSHOT.jar**:API(Application Programming Interface)文件提供了开发人员编写Tez应用程序所需的接口和类库。 4. **tez-runtime-library-0.10.1-SNAPSHOT.jar**:运行时库,包含Tez执行...

    tez.tar.gz

    6. **tez-runtime-library-0.8.5.jar**、**tez-runtime-internals-0.8.5.jar** - 运行时库和内部组件,实现Tez的任务执行逻辑。 在升级或替换CDH中的Tez时,需要理解这些组件的作用,并确保新版本与系统其余部分的...

    tez-0.9.1.tar-Centos6.10.gz

    3. **tez-api-0.9.1.jar**:包含Tez API,开发者可以使用这些API来构建自己的数据处理应用程序,与Tez引擎交互。 4. **tez-runtime-library-0.9.1.jar**:运行时库,提供了Tez任务运行所需的基本功能,如输入/输出...

    tez about hadoop-2.7.1

    - `tez-runtime-library-0.5.4.jar`和`tez-runtime-internals-0.5.4.jar`:提供Tez运行时库和内部实现,包括任务执行和数据传输等功能。 - `tez-mapreduce-0.5.4.jar`:支持Tez与MapReduce的交互,使MapReduce作业...

    李呈祥-Apache Flink: The Next Big Thing?

    Apache Flink是一个开源流处理框架,它为数据处理提供了一个高度伸缩和高性能的平台。在2015年阿帕奇中国路演中,李呈祥作为Intel大数据组的软件工程师和Apache Flink的贡献者,对Flink进行了深入探讨,分享了其特点...

    hadoop-2.7.7.tar.gz

    Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它的核心设计是解决大数据处理的问题。Hadoop 2.7.7是Hadoop发展过程中的一个重要版本,提供了稳定性和性能的优化。这个版本在Windows 10操作系统上进行了...

Global site tag (gtag.js) - Google Analytics