`

Architecture of Flume NG

 
阅读更多

https://blogs.apache.org/flume/entry/flume_ng_architecture

Apache Flume - Architecture of Flume NG

 

 

 

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/flume. Flume NG is work related to new major revision of Flume and is the subject of this post.

Prior to entering the incubator, Flume saw incremental releases leading up to  version 0.9.4. As Flume became adopted it became clear that certain design choices would need to be reworked in order to address problems reported in the field. The work necessary to make this change began a few months ago under the JIRA issue FLUME-728. This work currently resides on a separate branch by the name flume-728, and is informally referred to as Flume NG. At the time of writing this post Flume NG had gone through two internal milestones - NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.

At a high-level, Flume NG uses a single-hop message delivery guarantee semantics to provide end-to-end reliability for the system. To accomplish this, certain new concepts have been incorporated into its design, while certain other existing concepts have been either redefined, reused or dropped completely.

In this blog post, I will describe the fundamental concepts incorporated in Flume NG and talk about it’s high-level architecture. This is a first in a series of blog posts by Flume team that will go into further details of it’s design and implementation.

 

 

Core Concepts
The purpose of Flume is to provide a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. The architecture of Flume NG is based on a few concepts that together help achieve this objective. Some of these concepts have existed in the past implementation, but have changed drastically. Here is a summary of concepts that Flume NG introduces, redefines, or reuses from earlier implementation:
  • Event: A byte payload with optional string headers that represent the unit of data that Flume can transport from it’s point of origination to it’s final destination.
  • Flow: Movement of events from the point of origin to their final destination is considered a data flow, or simply flow. This is not a rigorous definition and is used only at a high level for description purposes.
  • Client: An interface implementation that operates at the point of origin of events and delivers them to a Flume agent. Clients typically operate in the process space of the application they are consuming data from. For example, Flume Log4j Appender is a client.
  • Agent: An independent process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their next-hop destination.
  • Source: An interface implementation that can consume events delivered to it via a specific mechanism. For example, an Avro source is a source implementation that can be used to receive Avro events from clients or other agents in the flow. When a source receives an event, it hands it over to one or more channels.
  • Channel: A transient store for events, where events are delivered to the channel via sources operating within the agent. An event put in a channel stays in that channel until a sink removes it for further transport. An example of channel is the JDBC channel that uses a file-system backed embedded database to persist the events until they are removed by a sink. Channels play an important role in ensuring durability of the flows.
  • Sink: An interface implementation that can remove events from a channel and transmit them to the next agent in the flow, or to the event’s final destination. Sinks that transmit the event to it’s final destination are also known as terminal sinks. The Flume HDFS sink is an example of a terminal sink. Whereas the Flume Avro sink is an example of a regular sink that can transmit messages to other agents that are running an Avro source.

 

These concepts help in simplifying the architecture, implementation, configuration and deployment of Flume.

 

 

Flow Pipeline
A flow in Flume NG starts from the client. The client transmits the event to it’s next hop destination. This destination is an agent. More precisely, the destination is a source operating within the agent. The source receiving this event will then deliver it to one or more channels. The channels that receive the event are drained by one or more sinks operating within the same agent. If the sink is a regular sink, it will forward the event to it’s next-hop destination which will be another agent. If instead it is a terminal sink, it will forward the event to it’s final destination. Channels allow the decoupling of sources from sinks using the familiar producer-consumer model of data exchange. This allows sources and sinks to have different performance and runtime characteristics and yet be able to effectively use the physical resources available to the system.

 

Figure 1 below shows how the various components interact with each other within a flow pipeline.

 

Schematic showing logical components in a flow. The arrows represent the direction in which events travel across the system. This also illustrates how flows can fan-out by having one source write the event out to multiple channels.

 

 

 

Figure 1: Schematic showing logical components in a flow. The arrows represent the direction in which events travel across the system. This also illustrates how flows can fan-out by having one source write the event out to multiple channels.

 

 

By configuring a source to deliver the event to more than one channel, flows can fan-out to more than one destination. This is illustrated in Figure 1 where the source within the operating Agent writes the event out to two channels - Channel 1 and Channel 2.

 

Conversely, flows can be converged by having multiple sources operating within the same agent write to the same channel. A example of the physical layout of a converging flow is show in Figure 2 below.

 

 A simple converging flow on Flume NG.

Figure 2: A simple converging flow on Flume NG.

 

 Reliability and Failure Handling


Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. In order for the sending agent to commit it’s transaction, it must receive success indication from the receiving agent. The receiving agent only returns a success indication if it’s own transaction commits properly first. This ensures guaranteed delivery semantics between the hops that the flow makes. Figure 3 below shows a sequence diagram that illustrates the relative scope and duration of the transactions operating within the two interacting agents.

Transactional exchange of events between agents.

Figure 3: Transactional exchange of events between agents.

 

This mechanism also forms the basis for failure handling in Flume NG. When a flow that passes through many different agents encounters a communication failure on any leg of the flow, the affected events start getting buffered at the last unaffected agent in the flow. If the failure is not resolved on time, this may lead to the failure of the last unaffected agent, which then would force the agent before it to start buffering the events. Eventually if the failure occurs when the client transmits the event to its first-hop destination, the failure will be reported back to the client which can then allow the application generating the events to take appropriate action.

On the other hand, if the failure is resolved before the first-hop agent fails, the buffered events in various agents downstream will start draining towards their destination. Eventually the flow will be restored to its original characteristic throughput levels. Figure 4 below illustrates a scenario where a flow comprising of two intermediary agents between the client and the central store go through a transient failure. The failure occurs between agent 2 and the central store, resulting in the events getting buffered at the agent 2 itself. Once the failing link has been restored to normal, the buffered events drain out to the central store and the flow is restored to its original throughput characteristics.

Failure handling in flows. In (a) the flow is normal and events can travel from the client to the central store. In (b) a communication failure occurs between Agent 2 and the event store resulting in events being buffered on Agent 2. In (c) the cause of failure was addressed and the flow was restored and any events buffered in Agent 2 were drained to the store.

Figure 4: Failure handling in flows. In (a) the flow is normal and events can travel from the client to the central store. In (b) a communication failure occurs between Agent 2 and the event store resulting in events being buffered on Agent 2. In (c) the cause of failure was addressed and the flow was restored and any events buffered in Agent 2 were drained to the store.

 

 
Wrapping up

In this post I described the various concepts that are a part of Flume NG and its high-level architecture. This is first of a series of posts from the Flume team that will highlight the design and implementation of this system. In the meantime, if you need anymore information, please feel free to drop an email on the project’s user or developer lists, or alternatively file the appropriate JIRA issues. Your contribution in any form is welcome on the project.

 

 

 
 
 
分享到:
评论

相关推荐

    flume-ng安装

    Flume-NG 安装与配置指南 Flume-NG 是一个分布式日志收集系统,能够从各种数据源中实时采集数据,并将其传输到集中式存储系统中。本文将指导您完成 Flume-NG 的安装和基本配置。 安装 Flume-NG 1. 先决条件:...

    Flume ng share

    ### Flume NG 分享资料详解 #### Flume NG 概述 Flume NG 是一个分布式、可靠且可用的服务,用于高效地收集、聚合并移动大量的日志数据。它具有简单而灵活的架构,基于流式数据流。Flume NG 非常健壮且能够容忍...

    Flume-ng资料合集

    Flume NG是Cloudera提供的一个分布式、可靠、可用的系统,它能够将不同数据源的海量日志数据进行高效收集、聚合、移动,最后存储到一个中心化数据存储系统中。由原来的Flume OG到现在的Flume NG,进行了架构重构,...

    FLUME-FlumeNG-210517-1655-5858

    FlumeNG是Apache Flume的一个分支版本,旨在通过重写和重构来解决现有版本中的一些已知问题和限制。Flume是Cloudera开发的一个分布式、可靠且可用的系统,用于有效地收集、聚合和移动大量日志数据。它的主要用途是将...

    Flume-ng在windows环境搭建并测试+log4j日志通过Flume输出到HDFS.docx

    Flume-ng 在 Windows 环境搭建并测试 + Log4j 日志通过 Flume 输出到 HDFS Flume-ng 是一个高可用、可靠、分布式的日志聚合系统,可以实时地从各种数据源(如日志文件、网络 socket、数据库等)中收集数据,并将其...

    mvn flume ng sdk

    `Mvn Flume NG SDK` 是一个用于Apache Flume集成开发的重要工具,它基于Maven构建系统,使得在Java环境中开发、管理和部署Flume插件变得更加便捷。Apache Flume是一款高度可配置的数据收集系统,广泛应用于日志聚合...

    flume-ng-sql-source-1.5.2

    Flume-ng-sql-source-1.5.2是Apache Flume的一个扩展,它允许Flume从SQL数据库中收集数据。Apache Flume是一个分布式、可靠且可用于有效聚合、移动大量日志数据的系统。"ng"代表"next generation",表明这是Flume的...

    flume-ng-1.6.0-cdh5.5.2-src.tar.gz

    《Flume NG 1.6.0 在 CDH 5.5.2 中的应用与解析》 Flume NG,全称为“Next Generation Flume”,是Apache Hadoop项目中用于高效、可靠、分布式地收集、聚合和移动大量日志数据的工具。在CDH(Cloudera Distribution...

    flume-ng-sql-source-1.5.2.jar

    flume-ng-sql-source-1.5.2.jar从数据库中增量读取数据到hdfs中的jar包

    Flume-ng-1.6.0-cdh.zip

    Flume-ng-1.6.0-cdh.zip 内压缩了 3 个项目,分别为:flume-ng-1.6.0-cdh5.5.0.tar.gz、flume-ng-1.6.0-cdh5.7.0.tar.gz 和 flume-ng-1.6.0-cdh5.10.1.tar.gz,选择你需要的版本。

    flume-ng-sql-source-release-1.5.2.zip

    Flume-ng-sql-source是Apache Flume的一个扩展插件,主要功能是允许用户从各种数据库中抽取数据并将其传输到其他目的地,如Apache Kafka。在本案例中,我们讨论的是版本1.5.2的发布包,即"flume-ng-sql-source-...

    flume-ng-1.5.0-cdh5.3.6.rar

    flume-ng-1.5.0-cdh5.3.6.rarflume-ng-1.5.0-cdh5.3.6.rar flume-ng-1.5.0-cdh5.3.6.rar flume-ng-1.5.0-cdh5.3.6.rar flume-ng-1.5.0-cdh5.3.6.rar flume-ng-1.5.0-cdh5.3.6.rar flume-ng-1.5.0-cdh5.3.6.rar flume...

    flume-ng-sql-source-1.5.3.jar

    flume-ng-sql-source-1.5.3.jar,flume采集mysql数据jar包,将此文件拖入FLUME_HOME/lib目录下,如果是CM下CDH版本的flume,则放到/opt/cloudera/parcels/CDH-xxxx/lib/flume-ng/lib下,同样需要的包还有mysql-...

    flume-ng-1.6.0-cdh5.5.0.tar.gz

    "flume-ng-1.6.0-cdh5.5.0.tar.gz" 是 Apache Flume 的一个特定版本,具体来说是 "Next Generation" (ng) 版本的 1.6.0,与 Cloudera Data Hub (CDH) 5.5.0 发行版兼容。CDH 是一个包含多个开源大数据组件的商业发行...

    flume-ng-sql-source-1.5.1

    flume-ng-sql-source-1.5.1 flume连接数据库 很好用的工具

    flumeng-kafka-plugin:flumeng-kafka-plugin

    《Flume与Kafka集成:深入理解flumeng-kafka-plugin》 在大数据处理领域,Apache Flume 和 Apache Kafka 都扮演着至关重要的角色。Flume 是一款用于收集、聚合和移动大量日志数据的工具,而 Kafka 则是一个分布式流...

    flume-ng-1.5.0-cdh5.3.6.tar.gz

    《Flume NG 1.5.0-cdh5.3.6:大数据日志收集利器》 Apache Flume,作为一款高效、可靠且分布式的海量日志聚合工具,是大数据处理领域的重要组件之一。在CDH(Cloudera Distribution Including Apache Hadoop)5.3.6...

    flume-ng-elasticsearch-sink-6.5.4.jar.zip

    《Flume NG与Elasticsearch 6.5.4集成详解》 Flume NG,全称为Apache Flume,是一款由Apache软件基金会开发的数据收集系统,主要用于日志聚合、监控和数据传输。它设计的目标是高效、可靠且易于扩展,特别适合...

    flume-ng-1.6.0-cdh5.12.0.tar.gz

    《Flume NG 1.6.0-cdh5.12.0在大数据生态中的角色与应用》 Apache Flume,作为一个分布式、可靠且可用的数据收集系统,是大数据处理链路中不可或缺的一部分。Flume NG(Next Generation)是其发展到第二代后的版本...

    Flume-ng搭建及sink配置

    Flume­ng简介 Apache Flume是从不同数据源收集、聚合、传输大量数据、日志到数据中心的分布式系统,具有可靠、可伸缩、可定制、高可用、高性能等明显优点。其主要特点有:声明式配置,可动态更新;提供上下文路由,...

Global site tag (gtag.js) - Google Analytics