`
sunwinner
  • 浏览: 202538 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

Cascading Terminology and Concepts

 
阅读更多

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading's "local mode" can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling. Java developers can leverage Cascading to develop robust Data Analytics and Data Management applications on Apache Hadoop. You can find a kick start example in this blog post.

 

Terminology of Cascading

The Cascading processing model is based on a metaphor of pipes (data streams) and filters (data operations). Thus the Cascading API allows the developer to assemble pipe assemblies that split, merge, group, or join streams of data while applying operations to each data record or groups of records.

 

In Cascading, we call a data record a tuple, a simple chain of pipes without forks or merges a branch, an interconnected set of pipe branches a pipe assembly, and a series of tuples passing through a pipe branch or assembly a tuple stream.

 

Pipe assemblies are specified independently of the data source they are to process. So before a pipe assembly can be executed, it must be bound to taps, i.e., data sources and sinks. The result of binding one or more pipe assemblies to taps is a flow, which is executed on a computer or cluster using the Hadoop framework.

 

 

Multiple flows can be grouped together and executed as a single process. In this context, if one flow depends on the output of another, it is not executed until all of its data dependencies are satisfied. Such a collection of flows is called a cascade.

 

Concepts of Cascading

  • Pipe Assemblies:  Pipe assemblies define what work should be done against tuple streams, which are read from tap sources and written to tap sinks. The work performed on the data stream may include actions such as filtering, transforming, organizing, and calculating. Pipe assemblies may use multiple sources and multiple sinks, and may define splits, merges, and joins to manipulate the tuple streams.
  • Pipes:  The base class cascading.pipe.Pipe and its subclasses are shown in the diagram below.

    The following table summarizes the different types of pipes.

    We will talk more about pipes in another blog post.
  • Connector: Cascading supports pluggable planners that allow it to execute on differing platforms. Planners are invoked by an associated FlowConnector subclass. Currently, only two planners are provided: LocalFlowConnector and HadoopFlowConnector.

    LocalFlowConnector provides a local mode planner for running Cascading completely in memory on the current computer while HadoopFlowConnector provides a planner for running Cascading on an Apache Hadoop cluster.
  • Tap:  All input data comes in from, and all output data goes out to, some instance of cascading.tap.Tap. A tap can be read from, which makes it a source, or written to, which makes it a sink. Or, more commonly, taps act as both sinks and sources when shared between flows. Below is the class diagram of Taps:

    We're not going to talk about all of taps here, please refer to Cascading javadoc for details of these taps classes.
  • Scheme: If the Tap is about where the data is and how to access it, the Scheme is about what the data is and how to read it. Every Tap must have a Scheme that describes the data. Cascading provides four Scheme classes: TextLine, TextDelimited, SequenceFile, WritableSequenceFile, below is the class diagram of Scheme:

     
  • Field Set:  Cascading applications can perform complex manipulation or "field algebra" on the fields stored in tuples, using Fields sets, a feature of the Fields class that provides a sort of wildcard tool for referencing sets of field values. These predefined Fields sets are constant values on the Fields class. They can be used in many places where the Fields class is expected. 
     /** Field UNKNOWN */
      public static final Fields UNKNOWN = new Fields( Kind.UNKNOWN );
      /** Field NONE represents a wildcard for no fields */
      public static final Fields NONE = new Fields( Kind.NONE );
      /** Field ALL represents a wildcard for all fields */
      public static final Fields ALL = new Fields( Kind.ALL );
      /** Field KEYS represents all fields used as they key for the last grouping */
      public static final Fields GROUP = new Fields( Kind.GROUP );
      /** Field VALUES represents all fields used as values for the last grouping */
      public static final Fields VALUES = new Fields( Kind.VALUES );
      /** Field ARGS represents all fields used as the arguments for the current operation */
      public static final Fields ARGS = new Fields( Kind.ARGS );
      /** Field RESULTS represents all fields returned by the current operation */
      public static final Fields RESULTS = new Fields( Kind.RESULTS );
      /** Field REPLACE represents all incoming fields, and allows their values to be replaced by the current operation results. */
      public static final Fields REPLACE = new Fields( Kind.REPLACE );
      /** Field SWAP represents all fields not used as arguments for the current operation and the operations results. */
      public static final Fields SWAP = new Fields( Kind.SWAP );
      /** Field FIRST represents the first field position, 0 */
      public static final Fields FIRST = new Fields( 0 );
      /** Field LAST represents the last field position, -1 */
      public static final Fields LAST = new Fields( -1 );
    The chart below shows common ways to merge input and result fields for the desired output fields. A few minutes with this chart may help clarify the discussion of fields, tuples, and pipes. Also see Each and Every Pipes for details on the different columns and their relationships to the Each and Every pipes and Functions, Aggregators, and Buffers.

     
  • Flow:  When pipe assemblies are bound to source and sink taps, a Flow is created. Flows are executable in the sense that, once they are created, they can be started and will execute on the specified platform. If the Hadoop platform is specified, the Flow will execute on a Hadoop cluster. A Flow is essentially a data processing pipeline that reads data from sources, processes the data as defined by the pipe assembly, and writes data to the sinks. 

     
  • Cascade:  A Cascade allows multiple Flow instances to be executed as a single logical unit. If there are dependencies between the Flows, they are executed in the correct order. Further, Cascades act like Ant builds or Unix make files - that is, a Cascade only executes Flows that have stale sinks (i.e., output data that is older than the input data). 
    CascadeConnector connector = new CascadeConnector();
    Cascade cascade = connector.connect( flowFirst, flowSecond, flowThird );
     

Reference: http://docs.cascading.org/cascading/2.1/userguide/pdf/userguide.pdf

  • 大小: 77.7 KB
  • 大小: 94.9 KB
  • 大小: 118.4 KB
  • 大小: 67.8 KB
  • 大小: 41.3 KB
  • 大小: 171.4 KB
  • 大小: 29.6 KB
分享到:
评论

相关推荐

    hadoop 编程框架 cascading

    Hadoop编程框架Cascading是基于Hadoop MapReduce的一个高级抽象层,它的设计目标是让开发者能够更加便捷地编写和管理大数据处理作业。Cascading不仅简化了MapReduce的复杂性,还提供了一种声明式的数据流编程模型,...

    Cascading Style Sheet 2.0 中文手册

    Cascading Style Sheet 2.0 中文手册 Cascading Style Sheet 2.0 中文手册 Cascading Style Sheet 2.0 中文手册Cascading Style Sheet 2.0 中文手册

    Cascading user guide

    ### 关于Cascading #### 1.1 什么是Cascading? Cascading是一款用于构建、优化并简化Apache Hadoop应用程序开发的高级抽象层工具。它为开发者提供了丰富的API来处理大规模数据集,使Hadoop MapReduce作业的编写变...

    DHTMLET - Cascading Style Sheet 2.0 中文手册

    Cascading Style Sheets(CSS)是网页设计中的核心技术之一,它用于控制网页的布局和样式。CSS2.0作为CSS的重要版本,为网页设计师提供了更为丰富的样式控制功能,使得网页设计更具灵活性和可维护性。本手册...

    cascading_web

    cascading_web

    Scalding—CAscading的Scala接口

    Scalding是Cascading的一个高级接口,专为Scala编程语言设计。Cascading是一个用于构建数据处理应用程序的Java库,广泛应用于Hadoop生态系统。Scalding通过提供更符合Scala语法和习惯的API,简化了在Hadoop上进行大...

    Cascading Style Sheet 2.0 中文手册.chm

    Cascading Style Sheet 2.0 中文手册.chm Introduction To CSS2样式表简介 说明: 本手册针对的是已有一定网页设计制作经验的读者。其目的是提供最新最全的样式表内容的快速索引及注释。所以对于样式表的基础知识,...

    Cascading Style Sheet 2.0 中文手册.zip

    《Cascading Style Sheet 2.0 中文手册》是一个为中文用户提供的CSS2.0技术参考资源。CSS,全称层叠样式表(Cascading Style Sheets),是用于描述HTML或XML(包括如SVG、MathML等各种XML方言)文档样式的网页设计...

    CSS2.0中文手册DHTMLET-Cascading Style Sheet 2.0 中文手册

    这个中文手册,即"DHTMLET-Cascading Style Sheet 2.0 中文手册",是学习和掌握CSS2.0的基础资源,适用于初学者和有一定经验的开发者。 首先,CSS2.0的核心概念包括选择器、属性和值。选择器用于定位文档中的特定...

    DHTMLET-Cascading Style Sheet 2.0中文手册中文沈小雨版.zip

    "DHTMLET-Cascading Style Sheet 2.0中文手册中文沈小雨版"是一个中文版的CSS 2.0教程,由沈小雨翻译。CHM文件是一种Windows帮助文件格式,包含了详细的章节和索引,是学习和查阅CSS 2.0特性的宝贵资料。通过阅读这...

    ASP.NET MVC with Entity Framework and CSS

    • Manage CRUD operations including model binding as recommended by Microsoft and dealing with cascading deletions correctly • Input and data validation using Attributes • Sorting and paging through...

    Python库 | django-cascading-dropdown-widget-0.2.6.tar.gz

    资源分类:Python库 所属语言:Python 资源全名:django-cascading-dropdown-widget-0.2.6.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059

    cascading style sheet handbook

    在网页设计领域,CSS(Cascading Style Sheets)是一种不可或缺的技术,用于控制网页的布局和外观。它允许设计师将样式与HTML或XML等结构化语言分离,使得页面的样式调整变得更为灵活和方便。"Cascading Style Sheet...

    Cascading Style Sheet 样式表中文手册

    2. **层叠性**:CSS的名字“Cascading”来源于其层叠规则,意味着多个样式可以应用于同一个元素,它们会按照特定的顺序进行合并,形成最终样式。 3. **继承**:子元素可以继承父元素的一些样式,这使得整个文档风格...

    Cascading Style Sheet2.0 中文手册

    9. **边框和背景(Borders and Backgrounds)**:支持虚线边框、圆角边框以及复杂的背景图像和颜色混合。 10. **字体(Fonts)**:CSS2.0增加了对更多字体格式的支持,并允许定义字体家族、大小、风格等。 ### ...

    DHTML手册Cascading Style Sheet 2.0 中文手册

    《DHTML手册Cascading Style Sheet 2.0 中文手册》是一份全面解析动态超文本标记语言(DHTML)及层叠样式表(CSS 2.0)的专业参考资料,旨在帮助开发者深入理解和掌握这两项关键技术。DHTML是HTML、CSS、JavaScript...

    Cascading Style

    这是960框架中的一部分css,主要正对Cascading写的style样式,代码相当长,但是非常精炼

    CSS——Cascading Style Sheets

    CSS——Cascading Style Sheets 层叠样式表 作用 定义了HTML元素怎样去显示 一般存储在样式表中 也可以存储在外部样式文件.css文件中

Global site tag (gtag.js) - Google Analytics