Now that we understand a little about how producers and consumers work, let's discuss the semantic guarantees Kafka provides between producer and consumer. Clearly there are multiple possible message delivery guarantees that could be provided:
现在我们了解一些关于生产者和消费者是如何工作的,接下来我们来讨论kafka提供了生产者和消费者之间的担保语义。有多种可能的消息传递保证可以提供:
-
At most once—Messages may be lost but are never redelivered.
最多一次 --- 消息可能丢失,但绝不会重发。 -
At least once—Messages are never lost but may be redelivered.
至少一次 --- 消息绝不会丢失,但有可能重新发送。 -
Exactly once—this is what people actually want, each message is delivered once and only once.
正好一次 --- 这是人们真正想要的,每个消息传递一次且仅一次。
It's worth noting that this breaks down into two problems: the durability guarantees for publishing a message and the guarantees when consuming a message.
可分解成两个问题:发送消息时的耐久性保障和消费消息的保障。
Many systems claim to provide "exactly once" delivery semantics, but it is important to read the fine print, most of these claims are misleading (i.e. they don't translate to the case where consumers or producers can fail, or cases where there are multiple consumer processes, or cases where data written to disk can be lost).
很多消息系统声称提供“正好一次”的传递语义,但是在阅读相关文章时,更多是误导(例如,它们没有解释消费者或生产者可能失败的情况,有多个消费者进程的情况,或写入磁盘的数据可能丢失的情况)
Kafka's semantics are straight-forward. When publishing a message we have a notion of the message being "committed" to the log. Once a published message is committed it will not be lost as long as one broker that replicates the partition to which this message was written remains "alive". The definition of alive as well as a description of which types of failures we attempt to handle will be described in more detail in the next section. For now let's assume a perfect, lossless broker and try to understand the guarantees to the producer and consumer. If a producer attempts to publish a message and experiences a network error it cannot be sure if this error happened before or after the message was committed. This is similar to the semantics of inserting into a database table with an autogenerated key.
kafka的语义是很直接的,我们有一个概念,当发布一条消息时,该消息 “committed(承诺)” 到了日志,一旦发布的消息是”承诺“的,只要副本分区写入了此消息的一个broker仍然"活着”,它就不会丢失。“活着”的定义以及描述的类型,我们处理失败的情况将在下一节中详细描述。现在让我们假设一个完美的不会丢消息的broker,并去了解如何保障生产者和消费者的,如果一个生产者发布消息并且正好遇到网络错误,就不能确定已提交的消息是否是在这个错误发生之前或之后。这类似于用自动生成key插入到一个数据库表。
Prior to 0.11.0.0, if a producer failed to receive a response indicating that a message was committed, it had little choice but to resend the message. This provides at-least-once delivery semantics since the message may be written to the log again during resending if the original request had in fact succeeded. Since 0.11.0.0, the Kafka producer also supports an idempotent delivery option which guarantees that resending will not result in duplicate entries in the log. To achieve this, the broker assigns each producer an ID and deduplicates messages using a sequence number that is sent by the producer along with every message. Also beginning with 0.11.0.0, the producer supports the ability to send messages to multiple topic partitions using transaction-like semantics: i.e. either all messages are successfully written or none of them are. The main use case for this is exactly-once processing between Kafka topics (described below).
在0.11.0.0之前,如果一个生产者没有收到消息提交的响应,那么只能重新发送消息。 这提供了至少一次传递语义,因为如果原始请求实际上已成功,则在重新发送期间再次将消息写入到日志中。自0.11.0.0起,Kafka生产者支持幂等传递选项,保证重新发送不会导致日志中重复。 broker为每个生产者分配一个ID,并通过生产者发送的序列号为每个消息进行去重。从0.11.0.0开始,生产者支持使用类似事务的语义将消息发送到多个topic分区的能力:即所有消息都被成功写入,或者没有。这个主要用于Kafka topic之间“正好一次“处理(如下所述)。
Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durability level it desires. If the producer specifies that it wants to wait on the message being committed this can take on the order of 10 ms. However the producer can also specify that it wants to perform the send completely asynchronously or that it wants to wait only until the leader (but not necessarily the followers) have the message.
并不是所有的情况都需要这么强力的保障,对于延迟敏感的,我们允许生产者指定它想要的耐用性水平。如生产者可以指定它获取需等待10毫秒量级上的响应。生产者也可以指定异步发送,或只等待leader(不需要副本的响应)有响应,
Now let's describe the semantics from the point-of-view of the consumer. All replicas have the exact same log with the same offsets. The consumer controls its position in this log. If the consumer never crashed it could just store this position in memory, but if the consumer fails and we want this topic partition to be taken over by another process the new process will need to choose an appropriate position from which to start processing. Let's say the consumer reads some messages -- it has several options for processing the messages and updating its position.
现在让我们从消费者的角度描述语义。所有的副本都有相同的日志相同的偏移量。消费者控制offset在日志中的位置。如果消费者永不宕机它可能只是在内存中存储这个位置,但是如果消费者故障,我们希望这个topic分区被另一个进程接管,新进程需要选择一个合适的位置开始处理。我们假设消费者读取了一些消息,几种选项用于处理消息和更新它的位置。
- It can read the messages, then save its position in the log, and finally process the messages. In this case there is a possibility that the consumer process crashes after saving its position but before saving the output of its message processing. In this case the process that took over processing would start at the saved position even though a few messages prior to that position had not been processed. This corresponds to "at-most-once" semantics as in the case of a consumer failure messages may not be processed.
读取消息,然后在日志中保存它的位置,最后处理消息。在这种情况下,有可能消费者保存了位置之后,但是处理消息输出之前崩溃了。在这种情况下,接管处理的进程会在已保存的位置开始,即使该位置之前有几个消息尚未处理。这对应于“最多一次” ,在消费者处理失败消息的情况下,不进行处理。
- It can read the messages, process the messages, and finally save its position. In this case there is a possibility that the consumer process crashes after processing messages but before saving its position. In this case when the new process takes over the first few messages it receives will already have been processed. This corresponds to the "at-least-once" semantics in the case of consumer failure. In many cases messages have a primary key and so the updates are idempotent (receiving the same message twice just overwrites a record with another copy of itself).
读取消息,处理消息,最后保存消息的位置。在这种情况下,可能消费进程处理消息之后,但保存它的位置之前崩溃了。在这种情况下,当新的进程接管了它,这将接收已经被处理的前几个消息。这就符合了“至少一次”的语义。在多数情况下消息有一个主键,以便更新幂等(其任意多次执行所产生的影响均与一次执行的影响相同)。
- So what about exactly once semantics (i.e. the thing you actually want)? When consuming from a Kafka topic and producing to another topic (as in a Kafka Streams application), we can leverage the new transactional producer capabilities in 0.11.0.0 that were mentioned above. The consumer's position is stored as a message in a topic, so we can write the offset to Kafka in the same transaction as the output topics receiving the processed data. If the transaction is aborted, the consumer's position will revert to its old value and the produced data on the output topics will not be visible to other consumers, depending on their "isolation level." In the default "read_uncommitted" isolation level, all messages are visible to consumers even if they were part of an aborted transaction, but in "read_committed," the consumer will only return messages from transactions which were committed (and any messages which were not part of a transaction).
那么什么是“正好一次”语义(也就是你真正想要的东西)? 当从Kafka主题消费并生产到另一个topic时(例如Kafka Stream),我们可以利用之前提到0.11.0.0中的生产者新事物功能。消费者的位置作为消息存储到topic中,因此我们可以与接收处理后的数据的输出topic使用相同的事务写入offset到Kafka。如果事物中断,则消费者的位置将恢复到老的值,根据其”隔离级别“,其他消费者将不会看到输出topic的生成数据,在默认的”读取未提交“隔离级别中,所有消息对消费者都是可见的,即使是被中断的事务的消息。但是在”读取提交“中,消费者将只从已提交的事物中返回消息。
When writing to an external system, the limitation is in the need to coordinate the consumer's position with what is actually stored as output. The classic way of achieving this would be to introduce a two-phase commit between the storage of the consumer position and the storage of the consumers output. But this can be handled more simply and generally by letting the consumer store its offset in the same place as its output. This is better because many of the output systems a consumer might want to write to will not support a two-phase commit. As an example of this, consider a Kafka Connect connector which populates data in HDFS along with the offsets of the data it reads so that it is guaranteed that either data and offsets are both updated or neither is. We follow similar patterns for many other data systems which require these stronger semantics and for which the messages do not have a primary key to allow for deduplication.
当写入到外部系统时,需要将消费者的位置与实际存储为输出的位置进行协调。实现这一目标的典型方法是在消费者位置的存储和消费者输出的存储之间引入两阶段的”提交“。但是,这可以更简单,通过让消费者将其offset存储在与其输出相同的位置。这样最好,因为大多数的输出系统不支持两阶段”提交“。作为一个例子,考虑一个Kafka Connect连接器,它填充HDFS中的数据以及它读取的数据的offset,以保证数据和offset都被更新,或者都不更新。 对于需要这些更强大语义的许多其他数据系统,我们遵循类似的模式,并且消息不具有允许重复数据删除的主键。
So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.
kafka默认是保证“至少一次”传递,并允许用户通过禁止生产者重试和处理一批消息前提交它的偏移量来实现 “最多一次”传递。而“正好一次”传递需要与目标存储系统合作,但kafka提供了偏移量,所以实现这个很简单。
http://orchome.com/21
相关推荐
通过引入特定的jar包,如ApacheJMeter_kafka-0.2.2.jar,JMeter可以进一步支持对分布式消息传递系统Apache Kafka的测试。 Kafka是一种高吞吐量、低延迟的分布式发布订阅消息系统,广泛应用于大数据实时处理和流数据...
总的来说,“kafka-2.13-3.5.1.tgz”提供的Kafka版本提供了高效、可靠的消息传递能力,适用于各种大数据处理场景。无论是开发实时流处理应用,还是构建事件驱动的微服务架构,Kafka都是一个不可或缺的工具。通过正确...
总结,“Kafka-Grpc_kafka_assignment_”项目展示了如何结合Kafka的分布式消息传递能力和GRPC的高效通信机制,解决特定的作业问题。通过对Kafka主题、分区、生产者消费者模型的深入理解,以及巧妙利用GRPC的特性,...
- **消息队列**:Kafka作为一种消息队列系统,能够实现消息的异步传递,降低系统间的耦合度。 - **实时数据分析**:Kafka可以与流计算引擎(如Apache Flink、Apache Spark Streaming)结合,实现对实时数据流的分析...
用户可以通过解压并配置这些文件,启动Kafka服务,然后就可以开始使用Kafka进行消息传递和流处理了。对于Scala 2.13的支持意味着这个版本的Kafka可以更好地与现代的Scala生态系统兼容,利用最新的语言特性和库。
在本文中,我们将深入探讨如何在Apache Kafka中配置SASL/PLAIN认证机制,并通过具体的密码验证实现安全的通信...这为保障数据的安全传输提供了基础,同时,了解和实践这些配置将有助于我们更好地理解和管理Kafka系统。
### Apache Kafka 知识点概览 #### 一、引言 Apache Kafka 是一款开源的流处理平台,由 LinkedIn 开发并捐赠给 ...无论是日志聚合、消息传递还是流式处理,Kafka 都能发挥出色的作用,为企业级应用带来显著的价值。
首先,Kafka是一种分布式流处理平台,广泛应用于大数据实时处理、消息传递等领域。然而,随着Kafka集群规模的扩大,管理和监控任务变得日益复杂,这就需要一款强大的管理工具,Kafka-Manager应运而生。最新版本kafka...
它由Apache软件基金会开发,使用Scala和Java编写,支持高吞吐量的消息传递,常用于处理网站用户行为数据、日志聚合、流式数据处理等场景。 **发布与订阅** Kafka的核心功能是发布和订阅消息。生产者(Publishers)...
这本书首先介绍Kafka的分布式消息传递模型,让读者对消息系统有一个初步的认识。随后,书中逐步讲解了Kafka的核心概念,例如主题(Topic)与分区(Partition),以及生产者(Producer)和消费者(Consumer)的API...
《Kafka_2.11-1.1.0:深入了解分布式消息系统》 Kafka是Apache软件基金会开发的一个开源流处理平台,最初由LinkedIn设计并贡献,现已成为...无论是日志处理、实时分析还是消息传递,Kafka都能在其中发挥关键作用。
它被设计成能够处理海量数据,提供高吞吐量、低延迟的消息传递能力。Kafka主要用于构建实时数据管道和流应用,将数据从生产者高效地传输到消费者,同时支持数据的持久化和容错。 1. **Kafka应用场景**: Kafka广泛...
在现代大数据处理领域,Apache Kafka作为一个高效、可扩展的消息中间件,广泛应用于实时数据流处理和分布式系统间的数据传递。为了确保Kafka在生产环境中的稳定性和性能,测试工具扮演着至关重要的角色。JMeter,一...
Kafka 在数据处理领域,尤其是在日志管理和消息传递方面表现出色。 **特点**: - **可靠性**: Kafka 使用持久化存储,确保数据不会丢失。 - **高性能**: 即使面对非常大的数据量,Kafka 也能保持高速的数据吞吐率。 ...
Kafka是分布式流处理平台,它被广泛用于日志收集、实时数据处理以及消息传递等场景。Cmak作为Kafka的管理工具,提供了以下关键功能: 1. **集群视图**:Cmak提供了一个清晰的集群概览,显示了集群中的Brokers(节点...
Java Kafka的生产者/消费者示例展示了如何在Java应用程序中使用Kafka进行消息传递。Kafka与Zookeeper的协作确保了集群的稳定运行,而与MySQL的集成则提供了消息的持久化保障。理解并掌握这些基本概念和操作,对于...
它最初是为了处理LinkedIn内部的大规模实时数据流而设计的,如今已被广泛应用于日志收集、消息传递、流处理等多个领域。Kafka因其高吞吐量、低延迟和持久性存储等特点,在大数据领域占有举足轻重的地位。 #### 二、...
消息传递语义(Message Delivery Semantics)描述了Kafka中的消息交付保证级别,包括最多一次(at most once)、至少一次(at least once)和仅一次(exactly once)。 Kafka的设计理念(Design Motivation)部分...
Kafka,作为一款分布式流处理平台,广泛应用于大数据实时处理和消息传递场景。然而,随着Kafka集群规模的扩大,管理和监控的复杂性也随之增加。这时,Kafka Eagle的出现就显得尤为必要。它提供了丰富的可视化界面,...
最新版的Kafka_2.12-2.6.1针对Linux平台进行了优化,为用户提供了一种稳定且高性能的消息传递解决方案。 ### 1. Kafka核心概念 - **生产者(Producer)**: 生产者是数据的来源,负责将消息发送到Kafka的Topic...