`
sunwinner
  • 浏览: 204530 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

Cascading Terminology and Concepts

 
阅读更多

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading's "local mode" can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling. Java developers can leverage Cascading to develop robust Data Analytics and Data Management applications on Apache Hadoop. You can find a kick start example in this blog post.

 

Terminology of Cascading

The Cascading processing model is based on a metaphor of pipes (data streams) and filters (data operations). Thus the Cascading API allows the developer to assemble pipe assemblies that split, merge, group, or join streams of data while applying operations to each data record or groups of records.

 

In Cascading, we call a data record a tuple, a simple chain of pipes without forks or merges a branch, an interconnected set of pipe branches a pipe assembly, and a series of tuples passing through a pipe branch or assembly a tuple stream.

 

Pipe assemblies are specified independently of the data source they are to process. So before a pipe assembly can be executed, it must be bound to taps, i.e., data sources and sinks. The result of binding one or more pipe assemblies to taps is a flow, which is executed on a computer or cluster using the Hadoop framework.

 

 

Multiple flows can be grouped together and executed as a single process. In this context, if one flow depends on the output of another, it is not executed until all of its data dependencies are satisfied. Such a collection of flows is called a cascade.

 

Concepts of Cascading

  • Pipe Assemblies:  Pipe assemblies define what work should be done against tuple streams, which are read from tap sources and written to tap sinks. The work performed on the data stream may include actions such as filtering, transforming, organizing, and calculating. Pipe assemblies may use multiple sources and multiple sinks, and may define splits, merges, and joins to manipulate the tuple streams.
  • Pipes:  The base class cascading.pipe.Pipe and its subclasses are shown in the diagram below.

    The following table summarizes the different types of pipes.

    We will talk more about pipes in another blog post.
  • Connector: Cascading supports pluggable planners that allow it to execute on differing platforms. Planners are invoked by an associated FlowConnector subclass. Currently, only two planners are provided: LocalFlowConnector and HadoopFlowConnector.

    LocalFlowConnector provides a local mode planner for running Cascading completely in memory on the current computer while HadoopFlowConnector provides a planner for running Cascading on an Apache Hadoop cluster.
  • Tap:  All input data comes in from, and all output data goes out to, some instance of cascading.tap.Tap. A tap can be read from, which makes it a source, or written to, which makes it a sink. Or, more commonly, taps act as both sinks and sources when shared between flows. Below is the class diagram of Taps:

    We're not going to talk about all of taps here, please refer to Cascading javadoc for details of these taps classes.
  • Scheme: If the Tap is about where the data is and how to access it, the Scheme is about what the data is and how to read it. Every Tap must have a Scheme that describes the data. Cascading provides four Scheme classes: TextLine, TextDelimited, SequenceFile, WritableSequenceFile, below is the class diagram of Scheme:

     
  • Field Set:  Cascading applications can perform complex manipulation or "field algebra" on the fields stored in tuples, using Fields sets, a feature of the Fields class that provides a sort of wildcard tool for referencing sets of field values. These predefined Fields sets are constant values on the Fields class. They can be used in many places where the Fields class is expected. 
     /** Field UNKNOWN */
      public static final Fields UNKNOWN = new Fields( Kind.UNKNOWN );
      /** Field NONE represents a wildcard for no fields */
      public static final Fields NONE = new Fields( Kind.NONE );
      /** Field ALL represents a wildcard for all fields */
      public static final Fields ALL = new Fields( Kind.ALL );
      /** Field KEYS represents all fields used as they key for the last grouping */
      public static final Fields GROUP = new Fields( Kind.GROUP );
      /** Field VALUES represents all fields used as values for the last grouping */
      public static final Fields VALUES = new Fields( Kind.VALUES );
      /** Field ARGS represents all fields used as the arguments for the current operation */
      public static final Fields ARGS = new Fields( Kind.ARGS );
      /** Field RESULTS represents all fields returned by the current operation */
      public static final Fields RESULTS = new Fields( Kind.RESULTS );
      /** Field REPLACE represents all incoming fields, and allows their values to be replaced by the current operation results. */
      public static final Fields REPLACE = new Fields( Kind.REPLACE );
      /** Field SWAP represents all fields not used as arguments for the current operation and the operations results. */
      public static final Fields SWAP = new Fields( Kind.SWAP );
      /** Field FIRST represents the first field position, 0 */
      public static final Fields FIRST = new Fields( 0 );
      /** Field LAST represents the last field position, -1 */
      public static final Fields LAST = new Fields( -1 );
    The chart below shows common ways to merge input and result fields for the desired output fields. A few minutes with this chart may help clarify the discussion of fields, tuples, and pipes. Also see Each and Every Pipes for details on the different columns and their relationships to the Each and Every pipes and Functions, Aggregators, and Buffers.

     
  • Flow:  When pipe assemblies are bound to source and sink taps, a Flow is created. Flows are executable in the sense that, once they are created, they can be started and will execute on the specified platform. If the Hadoop platform is specified, the Flow will execute on a Hadoop cluster. A Flow is essentially a data processing pipeline that reads data from sources, processes the data as defined by the pipe assembly, and writes data to the sinks. 

     
  • Cascade:  A Cascade allows multiple Flow instances to be executed as a single logical unit. If there are dependencies between the Flows, they are executed in the correct order. Further, Cascades act like Ant builds or Unix make files - that is, a Cascade only executes Flows that have stale sinks (i.e., output data that is older than the input data). 
    CascadeConnector connector = new CascadeConnector();
    Cascade cascade = connector.connect( flowFirst, flowSecond, flowThird );
     

Reference: http://docs.cascading.org/cascading/2.1/userguide/pdf/userguide.pdf

  • 大小: 77.7 KB
  • 大小: 94.9 KB
  • 大小: 118.4 KB
  • 大小: 67.8 KB
  • 大小: 41.3 KB
  • 大小: 171.4 KB
  • 大小: 29.6 KB
分享到:
评论

相关推荐

    Service Support-英文原版

    1.8 Customers and Users....................................................................................................- 10 - 1.9 A Code of Practice for IT Service Management – PD0005...............

    一个基于Qt Creator(qt,C++)实现中国象棋人机对战

    qt 一个基于Qt Creator(qt,C++)实现中国象棋人机对战.

    热带雨林自驾游自然奇观探索.doc

    热带雨林自驾游自然奇观探索

    冰川湖自驾游冰雪交融景象.doc

    冰川湖自驾游冰雪交融景象

    C51 单片机数码管使用 Keil项目C语言源码

    C51 单片机数码管使用 Keil项目C语言源码

    基于智能算法的无人机路径规划研究 附Matlab代码.rar

    1.版本:matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点:参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

    前端分析-2023071100789s12

    前端分析-2023071100789s12

    Delphi 12.3控件之Laz-制作了一些窗体和对话框样式.7z

    Laz_制作了一些窗体和对话框样式.7z

    ocaml-docs-4.05.0-6.el7.x64-86.rpm.tar.gz

    1、文件内容:ocaml-docs-4.05.0-6.el7.rpm以及相关依赖 2、文件形式:tar.gz压缩包 3、安装指令: #Step1、解压 tar -zxvf /mnt/data/output/ocaml-docs-4.05.0-6.el7.tar.gz #Step2、进入解压后的目录,执行安装 sudo rpm -ivh *.rpm 4、更多资源/技术支持:公众号禅静编程坊

    学习笔记-沁恒第六讲-米醋

    学习笔记-沁恒第六讲-米醋

    工业机器人技术讲解【36页】.pptx

    工业机器人技术讲解【36页】

    基于CentOS 7和Docker环境下安装和配置Elasticsearch数据库

    内容概要:本文档详细介绍了在 CentOS 7 上利用 Docker 容器化环境来部署和配置 Elasticsearch 数据库的过程。首先概述了 Elasticsearch 的特点及其主要应用场景如全文检索、日志和数据分析等,并强调了其分布式架构带来的高性能与可扩展性。之后针对具体的安装流程进行了讲解,涉及创建所需的工作目录,准备docker-compose.yml文件以及通过docker-compose工具自动化完成镜像下载和服务启动的一系列命令;同时对可能出现的问题提供了应对策略并附带解决了分词功能出现的问题。 适合人群:从事IT运维工作的技术人员或对NoSQL数据库感兴趣的开发者。 使用场景及目标:该教程旨在帮助读者掌握如何在一个Linux系统中使用现代化的应用交付方式搭建企业级搜索引擎解决方案,特别适用于希望深入了解Elastic Stack生态体系的个人研究与团队项目实践中。 阅读建议:建议按照文中给出的具体步骤进行实验验证,尤其是要注意调整相关参数配置适配自身环境。对于初次接触此话题的朋友来说,应该提前熟悉一下Linux操作系统的基础命令行知识和Docker的相关基础知识

    基于CNN和FNN的进化神经元模型的快速响应尖峰神经网络 附Matlab代码.rar

    1.版本:matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点:参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

    网络小说的类型创新、情节设计与角色塑造.doc

    网络小说的类型创新、情节设计与角色塑造

    毕业设计-基于springboot+vue开发的学生考勤管理系统【源码+sql+可运行】50311.zip

    毕业设计_基于springboot+vue开发的学生考勤管理系统【源码+sql+可运行】【50311】.zip 全部代码均可运行,亲测可用,尽我所能,为你服务; 1.代码压缩包内容 代码:springboo后端代码+vue前端页面代码 脚本:数据库SQL脚本 效果图:运行结果请看资源详情效果图 2.环境准备: - JDK1.8+ - maven3.6+ - nodejs14+ - mysql5.6+ - redis 3.技术栈 - 后台:springboot+mybatisPlus+Shiro - 前台:vue+iview+Vuex+Axios - 开发工具: idea、navicate 4.功能列表 - 系统设置:用户管理、角色管理、资源管理、系统日志 - 业务管理:班级信息、学生信息、课程信息、考勤记录、假期信息、公告信息 3.运行步骤: 步骤一:修改数据库连接信息(ip、port修改) 步骤二:找到启动类xxxApplication启动 4.若不会,可私信博主!!!

    57页-智慧办公园区智能化设计方案.pdf

    在智慧城市建设的大潮中,智慧园区作为其中的璀璨明珠,正以其独特的魅力引领着产业园区的新一轮变革。想象一下,一个集绿色、高端、智能、创新于一体的未来园区,它不仅融合了科技研发、商业居住、办公文创等多种功能,更通过深度应用信息技术,实现了从传统到智慧的华丽转身。 智慧园区通过“四化”建设——即园区运营精细化、园区体验智能化、园区服务专业化和园区设施信息化,彻底颠覆了传统园区的管理模式。在这里,基础设施的数据收集与分析让管理变得更加主动和高效,从温湿度监控到烟雾报警,从消防水箱液位监测到消防栓防盗水装置,每一处细节都彰显着智能的力量。而远程抄表、空调和变配电的智能化管控,更是在节能降耗的同时,极大地提升了园区的运维效率。更令人兴奋的是,通过智慧监控、人流统计和自动访客系统等高科技手段,园区的安全防范能力得到了质的飞跃,让每一位入驻企业和个人都能享受到“拎包入住”般的便捷与安心。 更令人瞩目的是,智慧园区还构建了集信息服务、企业服务、物业服务于一体的综合服务体系。无论是通过园区门户进行信息查询、投诉反馈,还是享受便捷的电商服务、法律咨询和融资支持,亦或是利用云ERP和云OA系统提升企业的管理水平和运营效率,智慧园区都以其全面、专业、高效的服务,为企业的发展插上了腾飞的翅膀。而这一切的背后,是大数据、云计算、人工智能等前沿技术的深度融合与应用,它们如同智慧的大脑,让园区的管理和服务变得更加聪明、更加贴心。走进智慧园区,就像踏入了一个充满无限可能的未来世界,这里不仅有科技的魅力,更有生活的温度,让人不禁对未来充满了无限的憧憬与期待。

    一种欠定盲源分离方法及其在模态识别中的应用 附Matlab代码.rar

    1.版本:matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点:参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

    Matlab实现基于BO贝叶斯优化Transformer结合GRU门控循环单元时间序列预测的详细项目实例(含完整的程序,GUI设计和代码详解)

    内容概要:本文介绍了使用 Matlab 实现基于 BO(贝叶斯优化)的 Transformer 结合 GRU 门控循环单元时间序列预测的具体项目案例。文章首先介绍了时间序列预测的重要性及其现有方法存在的限制,随后深入阐述了该项目的目标、挑战与特色。重点描述了项目中采用的技术手段——结合 Transformer 和 GRU 模型的优点,通过贝叶斯优化进行超参数调整。文中给出了模型的具体实现步骤、代码示例以及完整的项目流程。同时强调了数据预处理、特征提取、窗口化分割、超参数搜索等关键技术点,并讨论了系统的设计部署细节、可视化界面制作等内容。 适合人群:具有一定机器学习基础,尤其是熟悉时间序列预测与深度学习的科研工作者或从业者。 使用场景及目标:适用于金融、医疗、能源等多个行业的高精度时间序列预测。该模型可通过捕捉长时间跨度下的复杂模式,提供更为精准的趋势预判,辅助相关机构作出合理的前瞻规划。 其他说明:此项目还涵盖了从数据采集到模型发布的全流程讲解,以及GUI图形用户界面的设计实现,有助于用户友好性提升和技术应用落地。此外,文档包含了详尽的操作指南和丰富的附录资料,包括完整的程序清单、性能评价指标等,便于读者动手实践。

    漫画与青少年教育关系.doc

    漫画与青少年教育关系

    励志图书的成功案例分享、人生智慧提炼与自我提升策略.doc

    励志图书的成功案例分享、人生智慧提炼与自我提升策略

Global site tag (gtag.js) - Google Analytics