【Apache 之Flink介绍】

gaojingsong

浏览: 1182973 次
性别:
来自: 深圳

最近访客更多访客>>

zah5897

xckouy

lengyun3566

Z865785437029

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Apache 之Flink

Apache 之Flink介绍

一、Apache Flink介绍

Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

flink 是一个开源的针对批量数据和流数据的处理引擎，已经发展为 ASF 的顶级项目之一。Flink 的核心是一个提供了数据分发以及并行化计算的流数据处理引擎，已经支持了 API 化的 SQL 查询，包括图操作和机器学习的相关算法。

Flink 是一个针对流数据和批数据的分布式处理引擎。它主要是由 Java 代码实现。目前主要还是依靠开源社区的贡献而发展。对 Flink 而言，其所要处理的主要场景就是流数据，批数据只是流数据的一个极限特例而已。再换句话说，Flink 会把所有任务当成流来处理，这也是其最大的特点。Flink 可以支持本地的快速迭代，以及一些环形的迭代任务。并且 Flink 可以定制化内存管理。在这点，如果要对比 Flink 和 Spark 的话，Flink 并没有将内存完全交给应用层。这也是为什么 Spark 相对于 Flink，更容易出现 OOM 的原因（out of memory）。就框架本身与应用场景来说，Flink 更相似与 Storm。如果之前了解过 Storm 或者 Flume 的读者，可能会更容易理解 Flink 的架构和很多概念。

二、Apache Flink特点

Apache Flink作为一个新的流处理系统，其特点是：

1. 低延迟的流处理器

2.丰富的API能够帮助程序员快速开发流数据应用

3.灵活的操作状态和流窗口

4.高效的流与数据的容错

三、Apache Flink生态圈

一个计算框架要有长远的发展，必须打造一个完整的 Stack。不然就跟纸上谈兵一样，没有任何意义。只有上层有了具体的应用，并能很好的发挥计算框架本身的优势，那么这个计算框架才能吸引更多的资源，才会更快的进步。所以 Flink 也在努力构建自己的 Stack。

四、Flink 中的调度简述

在 Flink 集群中，计算资源被定义为 Task Slot。每个 TaskManager 会拥有一个或多个 Slots。JobManager 会以 Slot 为单位调度 Task。但是这里的 Task 跟我们在 Hadoop 中的理解是有区别的。对 Flink 的 JobManager 来说，其调度的是一个 Pipeline 的 Task，而不是一个点。举个例子，在 Hadoop 中 Map 和 Reduce 是两个独立调度的 Task，并且都会去占用计算资源。对 Flink 来说 MapReduce 是一个 Pipeline 的 Task，只占用一个计算资源。类同的，如果有一个 MRR 的 Pipeline Task，在 Flink 中其也是一个被整体调度的 Pipeline Task。在 TaskManager 中，根据其所拥有的 Slot 个数，同时会拥有多个 Pipeline。

在 Flink StandAlone 的部署模式中，这个还比较容易理解。因为 Flink 自身也需要简单的管理计算资源（Slot）。当 Flink 部署在 Yarn 上面之后，Flink 并没有弱化资源管理。也就是说这时候的 Flink 在做一些 Yarn 该做的事情。从设计角度来讲，我认为这是不太合理的。如果 Yarn 的 Container 无法完全隔离 CPU 资源，这时候对 Flink 的 TaskManager 配置多个 Slot，应该会出现资源不公平利用的现象。Flink 如果想在数据中心更好的与其他计算框架共享计算资源，应该尽量不要干预计算资源的分配和定义。

五、Flink 的部署

Flink 有三种部署模式，分别是 Local、Standalone Cluster 和 Yarn Cluster。对于 Local 模式来说，JobManager 和 TaskManager 会公用一个 JVM 来完成 Workload。如果要验证一个简单的应用，Local 模式是最方便的。实际应用中大多使用 Standalone 或者 Yarn Cluster。

Flink is well-suited for:

A variety of (sometimes unreliable) data sources: When data is generated by millions of different users or devices, it’s safe to assume that some events will arrive out of the order they actually occurred–and in the case of more significant upstream failures, some events might come hours later than they’re supposed to. Late data needs to be handled so that results are accurate.
Applications with state: When applications become more complex than simple filtering or enhancing of single data records, managing state within these applications (e.g., counters, windows of past data, state machines, embedded databases) becomes hard. Flink provides tools so that state is efficient, fault tolerant, and manageable from the outside so you don’t have to build these capabilities yourself.
Data that is processed quickly: There is a focus in these use cases on real-time or near-real-time scenarios, where insights from data should be available at nearly the same moment that the data is generated. Flink is fully capable of meeting these latency requirements when necessary.
Data in large volumes: These programs would need to be distributed across many nodes (in some cases, thousands) to support the required scale. Flink can run on large clusters just as seamlessly as it runs on small ones.