`
gaojingsong
  • 浏览: 1201780 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
文章分类
社区版块
存档分类
最新评论

【Apache之 Samza 介绍】

阅读更多

一、What is messaging?

Apache Samza是LinkedIn最近开源的一款流处理器。

Samza是由LinkedIn开源的一个技术,它是一个开源的分布式流处理系统,非常类似于Storm。不同的是它运行在Hadoop之上,并且使用了自己开发的Kafka分布式消息处理系统。

这是Linkin开发的一个小而美的项目,如何美呢?

1. 只有几千行代码,完成的功能就可以和Storm媲美,当然目前还有很多的不足

2. 和Kafka结合紧密,更方便的处理数据

 

3. 运行在Yarn上

Messaging systems are a popular way of implementing near-realtime asynchronous computation. Messages can be added to a message queue (ActiveMQ, RabbitMQ), pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when something happens. Downstream consumers read messages from these systems, and process them or take actions based on the message contents.

 

Suppose you have a website, and every time someone loads a page, you send a “user viewed page” event to a messaging system. You might then have consumers which do any of the following:

 

1)Store the message in Hadoop for future analysis

2)Count page views and update a dashboard

3)Trigger an alert if a page view fails

4)Send an email notification to another user

5)Join the page view event with the user’s profile, and send the message back to the messaging system

A messaging system lets you decouple all of this work from the actual web page serving.

 

二、What is stream processing?

A messaging system is a fairly low-level piece of infrastructure—it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer. Samza aims to help with these problems.

 

Consider the counting example, above (count page views and update a dashboard). What happens when the machine that your consumer is running on fails, and your current counter values are lost? How do you recover? Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? (Unless you are careful, your counts will be incorrect.) What if you want to count page views grouped by the page URL? How do you distribute the computation across multiple machines if it’s too much for a single machine to handle?

 

Stream processing is a higher level of abstraction on top of messaging systems, and it’s meant to address precisely this category of problems.

 

 

三、What is Samza?

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

 

1)Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple callback-based “process message” API comparable to MapReduce.

2)Managed state: Samza manages snapshotting and restoration of a stream processor’s state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition).

3)Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine.

4)Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.

5)Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.

6)Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.

7)Processor isolation: Samza works with Apache YARN, which supports Hadoop’s security model, and resource isolation through Linux CGroups.

 

 

四、Architecture

Samza is made up of three layers:

A streaming layer.

An execution layer.

A processing layer.

Samza主要包含三层,

1.流处理层 --> Kafka

2.执行层     --> YARN

3.处理层    --> Samza API

Samza provides out of the box support for all three layers.

Streaming: Kafka

Execution: YARN

Processing: Samza API

These three pieces fit together to form Samza:



 

  • 大小: 37.9 KB
0
0
分享到:
评论

相关推荐

    opensource-apache-samza:Apache Samza开源代码-apache source code

    Apache Samza是一款分布式流处理框架,它由LinkedIn开发并贡献给了Apache Software Foundation,成为顶级项目之一。Apache Samza的核心设计理念是将数据处理与状态管理相结合,以实现低延迟、高可扩展性和容错性。在...

    Apache Samza大规模数据流处理.pptx

    Apache Samza 是一个分布式流处理和批量处理框架,专门设计用于处理大规模的数据流。它由LinkedIn开发,并在2014年12月成为Apache顶级项目。Samza的核心功能包括数据库变更捕获、与Kafka、Hadoop、Kinesis等系统的...

    samza-hello-samza:Apache Samza的镜像

    你好三am Hello Samza是职位的入门项目。关于是项目的一部分。...1.获取代码查看hello-samza项目: git clone https://gitbox.apache.org/repos/asf/samza-hello-samza.git hello-samzacd hello-sam

    samza:Apache Samza的镜像

    什么是Samza? 是一个分布式流处理框架。 它使用进行消息传递,并使用提供容错,处理器隔离,安全性和资源管理。 Samza的主要功能包括: 简单的API:与大多数低级消息传递系统API不同,Samza提供了一个非常简单的...

    samza-scala-example-project:用Scala编写的Apache Samza流处理作业

    [Release] ][ ]介绍这是用Scala为[Apache Samza] 框架编写的简单流处理作业([介绍性博客文章] ),处理来自[Apache Kafka] 主题的JSON事件,并定期向第二个Kafka主题发出聚合。 它由[Snowplow Analytics] 的数据...

    Freshet-Old:用于 Apache Samza 的基于 CQL 的 Clojure DSL

    Freshet 在 Apache Samza 之上实现了 CQL 的一个子集(选择、窗口化、聚合)。 Freshet 实现了RStream和IStream关系到流运算符,基于元组和时间的滑动窗口将流转换为关系,将基本关系转换为关系运算符以实现业务...

    samza

    Apache Samza提供了一种用于处理来自发布订阅系统(例如Apache Kafka)的流数据的系统。 开发人员编写一个流处理任务,并将其作为Samza作业执行。 然后,Samza在流处理任务和该消息所针对的发布-订阅系统之间路由...

    医院建设大数据项目的技术路线选择.docx

    然后,需要了解当前主流的大数据处理框架有哪些,包括 Apache Hadoop、Apache Storm、Apache Samza、Apache Flink、Apache Spark 等。 Apache Hadoop 是一种专用于批处理的处理框架。它通过 MapReduce 处理引擎可以...

    流式处理框架stormspark和samza的对比共5页

    最后,Apache Samza是由LinkedIn开发的分布式流处理系统,它在Apache Kafka之上运行,Kafka作为其默认的消息队列。Samza采用状态ful的处理模型,允许每个任务保持自己的状态,这在处理复杂业务逻辑时非常有用。Samza...

Global site tag (gtag.js) - Google Analytics