`

Tez: 2 Data Processing API in Apache Tez

    博客分类:
  • Tez
 
阅读更多

Overview

Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing. Thus user logic, that analyses and modifies the data, sits in the vertices. Edges determine the consumer of the data, how the data is transferred and the dependency between the producer and consumer vertices. This model concisely captures the logical definition of the computation. When the Tez job executes on the cluster, it expands this logical graph into a physical graph by adding parallelism at the vertices to scale to the data size being processed. Multiple tasks are created per logical vertex to perform the computation in parallel.

DAG Definition API

More technically, the data processing is expressed in the form of a directed acyclic graph (DAG). The processing starts at the root vertices of the DAG and continues down the directed edges till it reaches the leaf vertices. When all the vertices in the DAG have completed then the data processing job is done. The graph does not have cycles because the fault tolerance mechanism used by Tez is re-execution of failed tasks. When the input to a task is lost then the producer task of the input is re-executed and so Tez needs to be able to walk up the graph edges to locate a non-failed task from which to re-start the computation. Cycles in the graph can make this walk difficult to perform. In some cases, cycles may be handled by unrolling them to create a DAG.

Tez defines a simple Java API to express a DAG of data processing. The API has three components

  • DAG. this defines the overall job. The user creates a DAG object for each data processing job.
  • Vertex. this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
  • Edge. this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.



 The diagram shows a dataflow graph and its definition using the DAG API (simplified). The job consists of 2 vertices performing a “Map” operation on 2 datasets. Their output is consumed by 2 vertices that do a “Reduce” operation. Their output is brought together in the last vertex that does a “Join” operation.

Tez handles expanding this logical graph at runtime to perform the operations in parallel using multiple tasks. The diagram shows a runtime expansion in which the first M-R pair has a parallelism of 2 while the second has a parallelism of 3. Both branches of computation merge in the Join operation that has a parallelism of 2. Edge properties are at the heart of this runtime activity.



 

Edge Properties

The following edge properties enable Tez to instantiate the tasks, configure their inputs and outputs, schedule them appropriately and help route the data between the tasks. The parallelism for each vertex is determined based on user guidance, data size and resources.

  • tez3Data movement. Defines routing of data between tasks
    • One-To-One: Data from the ith producer task routes to the ith consumer task.
    • Broadcast: Data from a producer task routes to all consumer tasks.
    • Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ithconsumer task.
  • Scheduling. Defines when a consumer task is scheduled
    • Sequential: Consumer task may be scheduled after a producer task completes.
    • Concurrent: Consumer task must be co-scheduled with a producer task.
  • Data source. Defines the lifetime/reliability of a task output
    • Persisted: Output will be available after the task exits. Output may be lost later on.
    • Persisted-Reliable: Output is reliably stored and will always be available
    • Ephemeral: Output is available only while the producer task is running

Some real life use cases will help in clarifying the edge properties. Mapreduce would be expressed with the scatter-gather, sequential and persisted edge properties. Map tasks scatter partitions and reduce tasks gather them. Reduce tasks are scheduled after the map tasks complete and the map task outputs are written to local disk and hence available after the map tasks have completed. When a vertex checkpoints its output into HDFS then its output edge has a persisted-reliable property. If a producer vertex is streaming data directly to a consumer vertex then the edge between them has ephemeral and concurrent properties. A broadcast property is used on a sampler vertex that produces a global histogram of data ranges for range partitioning.

We hope that the Tez dataflow definition API will be able to express a broad spectrum of data processing topologies and enable higher level languages to elegantly transform their queries into Tez jobs.

 

 

 

orginal doc : http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/

  • 大小: 145.4 KB
  • 大小: 64.8 KB
分享到:
评论

相关推荐

    docker-hive-on-tez:在 Tez 上运行的 Apache Hive 的 Docker 镜像

    在 Tez 上运行 Apache Hive 的 Docker 镜像此存储库包含一个 docker 文件,用于构建 docker 映像以在 Tez 上运行 Apache Hive。 这个 docker 文件依赖于我的其他包含和 基础镜像的存储库。当前版本Apache Hive(主干...

    tez:Apache Tez

    Apache Tez是一个通用的数据处理管道引擎,被设想为用于更高抽象的低级引擎,例如Apache Hadoop Map-Reduce,Apache Pig,Apache Hive等。 从本质上讲,tez非常简单,只有两个组成部分: 数据处理流水线引擎可以...

    Apache TEZ部署手册

    Apache TEZ 部署手册 Apache TEZ 是一个基于 Hadoop 的数据处理引擎,它提供了高性能、可扩展的数据处理能力。Apache TEZ 部署手册是一份详细的指导手册,涵盖了 Apache TEZ 的部署、配置和使用。 一、准备 在...

    apache-tez-0.9.2-bin.tar.gz

    Apache Tez 是一个高度可扩展和灵活的数据处理框架,它为Apache Hadoop生态系统提供了一个高效、低延迟的处理引擎。Tez 建立在Hadoop MapReduce之上,旨在优化大规模数据处理作业的性能,特别是在复杂的计算任务和...

    apache-tez-0.8.5-bin.tar.gz

    Apache Tez 是一个高度可扩展和灵活的数据处理框架,它为Apache Hadoop生态系统提供了一种更高效、低延迟的执行模型。在标题中提到的"apache-tez-0.8.5-bin.tar.gz"是一个包含Apache Tez 0.8.5版本二进制文件的...

    TEZ:训练pytorch模型更快rrrr......。-Python开发

    tez:训练pytorch模型fastrrrr ....... tez:训练pytorch模型fastrrrr .......注意:当前,我们不接受任何拉取请求! 所有公共关系将被关闭。 如果您需要某个功能或某些功能不起作用,请创建一个问题。 意思是“锐利...

    apache-tez源码

    Apache Tez 是一个高度可扩展和灵活的数据处理框架,它构建在 Apache Hadoop 上,用于执行复杂的、有向无环图(DAG)任务。这个框架优化了 MapReduce 模型,提供了更高效的并行计算能力,适用于大规模数据处理工作。...

    源码apache-tez-0.8.3编译后的hadoop2.7.3版本hive-tez包tez-0.8.3.tar.gz

    源码使用的是apache-tez-0.8.3,对应的hadoop版本2.7.3,源码包中的nodejs的版本是v0.12.3,很难编译通过,最后把nodejs改成了v4.0.0才编译通过tez-ui2模块。

    storm-tez:使用TEZ在纱线POC上进行风暴

    1. **Apache Storm**:Apache Storm是一个开源的分布式实时计算系统,用于处理无界数据流。它允许开发者定义一个数据处理拓扑,其中包含 bolts(处理数据的逻辑单元)和 spouts(产生数据流的源)。Storm保证每个...

    tez:Tez是用于PyTorch的超级简单且轻巧的Trainer。 它还带有许多实用程序,可用于解决PyTorch中90%以上的深度学习项目

    Tez:简单的pytorch培训师 注意:当前,我们不接受任何拉取请求! 所有公共关系将被关闭。 如果您需要某个功能或某些功能不起作用,请创建一个问题。 意思是“锐利,快速,活跃”。 这是一个简单的要点库,使您的...

    CDH6.3.2集成tez

    export MVN_HOME=/data/module/apache-maven-3.6.3 export PATH=$PATH:$MVN_HOME/bin ``` - **验证 Maven 版本**: ```bash mvn -v ``` 2. **Protocol Buffers (protobuf) 安装**: - **下载指定版本**:...

    apache-tez-0.9.0-bin.tar.gz

    Tez是Apache开源的支持DAG作业的计算框架,它直接源于MapReduce框架,核心思想是将Map和Reduce两个操作进一步拆分,即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、...

    tez-0.9.1.tar.gz

    2. `tez-api-0.9.1.jar`:提供了Tez API,开发者可以使用这些API来编写自定义的Tez应用程序,定义任务和数据处理逻辑。 3. `tez-runtime-library-0.9.1.jar`:包含了Tez运行时库,用于任务执行和数据交换,包括输入...

    cdh继承tez引擎 cdh添加tez引擎 hive引擎增加

    从Apache官方网站下载Tez的源代码(例如,0.9.1版本)。解压缩后,进入源码目录准备进行编译。 4. **Maven的安装与配置**: 安装Maven 3.x或更高版本,确保其环境变量配置正确。同时,修改Maven的`settings.xml`...

    tez-ui-0.10.1.war

    【tez-ui-0.10.1.war】是一个重要的组件,它是Apache Tez用户界面的WAR(Web ARchive)文件,用于提供对Tez执行引擎的可视化监控和管理。Apache Tez是一个高性能、灵活的数据处理框架,它被广泛应用于Hadoop生态系统...

    tez-0.10.1-SNAPSHOT.tar.gz

    3. `tez-api-0.10.1-SNAPSHOT.jar`:提供了Tez API,开发者可以使用这些API构建自己的应用程序,利用Tez的并行处理能力。 4. `tez-runtime-library-0.10.1-SNAPSHOT.jar`、`tez-mapreduce-0.10.1-SNAPSHOT.jar`、`...

    apache-tez-0.8.3-src.tar.gz

    Tez是Apache开源的支持DAG作业的计算框架,它直接源于MapReduce框架,核心思想是将Map和Reduce两个操作进一步拆分,即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、...

    tez-0.9.1-minimal.tar.gz

    Tez 是 Apache 软件基金会的一个项目,设计用于处理大规模数据处理任务,提供了一个灵活和高效的执行引擎,可以用于替代传统的 MapReduce。 【描述】提到,这个 Tez 0.9.1 版本是基于 CDH6.2(Cloudera Data Hub ...

    Apache Tez

    Apache Tez 是一个开源框架,主要作用是构建和执行数据处理应用程序,特别强调了数据流驱动的处理运行时。Tez 旨在优化基于YARN(Yet Another Resource Negotiator)的Hadoop环境中的数据处理,通过对数据执行过程的...

    运行引擎Tez.zip

    Tez是Apache Hadoop生态系统中的一款高性能、可扩展的计算框架,主要用于优化大数据处理任务。它被设计为Hive的默认执行引擎,提供了一种比传统MapReduce(MR)更高效的数据处理模型。在本文中,我们将深入探讨Tez的...

Global site tag (gtag.js) - Google Analytics