1, 关于如何转移分区: 以及如何新增节点的问题, 我们在 Kafka中文文档 中已经有过叙述。详细参考
2, 分析命令的执行过程 : 分区调用的脚本是 kafka-reassign-partitions.sh, 具体内容是:
exec $(dirname $0)/kafka-run-class.sh kafka.admin.ReassignPartitionsCommand $@
3, 分析一下 ReassignPartitionsCommand 的代码, 主要有三个方法,分别对应生成分区
分配的方案, 执行分配方案, 以及验证分配方案执行情况
4, 生成分区分配方案代码如下:
def generateAssignment(zkClient: ZkClient, opts: ReassignPartitionsCommandOptions) { if(!(opts.options.has(opts.topicsToMoveJsonFileOpt) && opts.options.has(opts.brokerListOpt))) CommandLineUtils.printUsageAndDie(opts.parser, "If --generate option is used, command must include both --topics-to-move-json-file and --broker-list options") val topicsToMoveJsonFile = opts.options.valueOf(opts.topicsToMoveJsonFileOpt) val brokerListToReassign = opts.options.valueOf(opts.brokerListOpt).split(',').map(_.toInt) val duplicateReassignments = CoreUtils.duplicates(brokerListToReassign) if (duplicateReassignments.nonEmpty) throw new AdminCommandFailedException("Broker list contains duplicate entries: %s".format(duplicateReassignments.mkString(","))) val topicsToMoveJsonString = Utils.readFileAsString(topicsToMoveJsonFile) val topicsToReassign = ZkUtils.parseTopicsData(topicsToMoveJsonString) val duplicateTopicsToReassign = CoreUtils.duplicates(topicsToReassign) if (duplicateTopicsToReassign.nonEmpty) throw new AdminCommandFailedException("List of topics to reassign contains duplicate entries: %s".format(duplicateTopicsToReassign.mkString(","))) val topicPartitionsToReassign = ZkUtils.getReplicaAssignmentForTopics(zkClient, topicsToReassign) var partitionsToBeReassigned : Map[TopicAndPartition, Seq[Int]] = new mutable.HashMap[TopicAndPartition, List[Int]]() val groupedByTopic = topicPartitionsToReassign.groupBy(tp => tp._1.topic) groupedByTopic.foreach { topicInfo => val assignedReplicas = AdminUtils.assignReplicasToBrokers(brokerListToReassign, topicInfo._2.size, topicInfo._2.head._2.size) partitionsToBeReassigned ++= assignedReplicas.map(replicaInfo => (TopicAndPartition(topicInfo._1, replicaInfo._1) -> replicaInfo._2)) } val currentPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, partitionsToBeReassigned.map(_._1.topic).toSeq) println("Current partition replica assignment\n\n%s" .format(ZkUtils.getPartitionReassignmentZkData(currentPartitionReplicaAssignment))) println("Proposed partition reassignment configuration\n\n%s".format(ZkUtils.getPartitionReassignmentZkData(partitionsToBeReassigned))) }
实际最终调用那个的是AdminUtils.assignReplicasToBrokers方法, 这个方法的注释如下:
/**
* There are 2 goals of replica assignment:
* 1. Spread the replicas evenly among brokers.
* 2. For partitions assigned to a particular broker, their other replicas are spread over the other brokers.
*
* To achieve this goal, we:
* 1. Assign the first replica of each partition by round-robin, starting from a random position in the broker list.
* 2. Assign the remaining replicas of each partition with an increasing shift.
*
* Here is an example of assigning
* broker-0 broker-1 broker-2 broker-3 broker-4
* p0 p1 p2 p3 p4 (1st replica)
* p5 p6 p7 p8 p9 (1st replica)
* p4 p0 p1 p2 p3 (2nd replica)
* p8 p9 p5 p6 p7 (2nd replica)
* p3 p4 p0 p1 p2 (3nd replica)
* p7 p8 p9 p5 p6 (3nd replica)
*/
可以看到, 具体的算法是根据broker以及副本数来循环分区数,这样就能得到每个broker
上面关于这个的分区结果。generateAssignment方法会循环调用这个方法来处理不同的topic。
具体的返回结果应该是一个map, map的主键是 TopicAndPartition, 键值是关于这个topic的
partition的broker的id的List。然后会把这个分配的结果给打印出来。
5, 执行分配代码分析
def executeAssignment(zkClient: ZkClient, opts: ReassignPartitionsCommandOptions) { if(!opts.options.has(opts.reassignmentJsonFileOpt)) CommandLineUtils.printUsageAndDie(opts.parser, "If --execute option is used, command must include --reassignment-json-file that was output " + "during the --generate option") val reassignmentJsonFile = opts.options.valueOf(opts.reassignmentJsonFileOpt) val reassignmentJsonString = Utils.readFileAsString(reassignmentJsonFile) val partitionsToBeReassigned = ZkUtils.parsePartitionReassignmentDataWithoutDedup(reassignmentJsonString) if (partitionsToBeReassigned.isEmpty) throw new AdminCommandFailedException("Partition reassignment data file %s is empty".format(reassignmentJsonFile)) val duplicateReassignedPartitions = CoreUtils.duplicates(partitionsToBeReassigned.map{ case(tp,replicas) => tp}) if (duplicateReassignedPartitions.nonEmpty) throw new AdminCommandFailedException("Partition reassignment contains duplicate topic partitions: %s".format(duplicateReassignedPartitions.mkString(","))) val duplicateEntries= partitionsToBeReassigned .map{ case(tp,replicas) => (tp, CoreUtils.duplicates(replicas))} .filter{ case (tp,duplicatedReplicas) => duplicatedReplicas.nonEmpty } if (duplicateEntries.nonEmpty) { val duplicatesMsg = duplicateEntries .map{ case (tp,duplicateReplicas) => "%s contains multiple entries for %s".format(tp, duplicateReplicas.mkString(",")) } .mkString(". ") throw new AdminCommandFailedException("Partition replica lists may not contain duplicate entries: %s".format(duplicatesMsg)) } val reassignPartitionsCommand = new ReassignPartitionsCommand(zkClient, partitionsToBeReassigned.toMap) // before starting assignment, output the current replica assignment to facilitate rollback val currentPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, partitionsToBeReassigned.map(_._1.topic).toSeq) println("Current partition replica assignment\n\n%s\n\nSave this to use as the --reassignment-json-file option during rollback" .format(ZkUtils.getPartitionReassignmentZkData(currentPartitionReplicaAssignment))) // start the reassignment if(reassignPartitionsCommand.reassignPartitions()) println("Successfully started reassignment of partitions %s".format(ZkUtils.getPartitionReassignmentZkData(partitionsToBeReassigned.toMap))) else println("Failed to reassign partitions %s".format(partitionsToBeReassigned)) }
这个方法前面是一些常规的参数检查, 后面世构造了一个 ReassignPartitionsCommand , 然后执行这个
command的reassignPartitionsCommand.reassignPartitions() 的方法。 主要看下 reassignPartitions方法
是如何执行的。
这个方法先是验证参数的有效性, 然后将新的分配方案写入到zookeeper的/admin/reassign_partitions路径
中, 其他的事情也没有做。
6, 如何验证分配是否执行完成
def verifyAssignment(zkClient: ZkClient, opts: ReassignPartitionsCommandOptions) { if(!opts.options.has(opts.reassignmentJsonFileOpt)) CommandLineUtils.printUsageAndDie(opts.parser, "If --verify option is used, command must include --reassignment-json-file that was used during the --execute option") val jsonFile = opts.options.valueOf(opts.reassignmentJsonFileOpt) val jsonString = Utils.readFileAsString(jsonFile) val partitionsToBeReassigned = ZkUtils.parsePartitionReassignmentData(jsonString) println("Status of partition reassignment:") val reassignedPartitionsStatus = checkIfReassignmentSucceeded(zkClient, partitionsToBeReassigned) reassignedPartitionsStatus.foreach { partition => partition._2 match { case ReassignmentCompleted => println("Reassignment of partition %s completed successfully".format(partition._1)) case ReassignmentFailed => println("Reassignment of partition %s failed".format(partition._1)) case ReassignmentInProgress => println("Reassignment of partition %s is still in progress".format(partition._1)) } } }
具体的逻辑也就是查看根据新的分配方案, 查看zookeeper对应的topic中是否有符合新的分配方案的broker的id,如果有的话,则是成功,如果没有则是失败。 默认的状态是进行中。
7, 分配如何进行的
目前看到的结果只是将新的分配方案写入到zookeeper对应的路径上面, 那到底是那个进程来处理这些事情呢?
在 Kafka 启动过程 中我们看到了有调用 kafkaController.startup, 这个过程会去正当controller, 并不是所
有的节点都会成为controller, 只有成为主的controller才会回调 onControllerFailover 这个方法, 这个方法
注册了监听 "/admin/reassign_partitions" 目录的事件, 处理 事件是通过 PartitionsReassignedListener 的
handleDataChange来处理的。 实际上最终处理的是通过 onPartitionReassignment的方法
/** * This callback is invoked by the reassigned partitions listener. When an admin command initiates a partition * reassignment, it creates the /admin/reassign_partitions path that triggers the zookeeper listener. * Reassigning replicas for a partition goes through a few steps listed in the code. * RAR = Reassigned replicas * OAR = Original list of replicas for partition * AR = current assigned replicas * * 1. Update AR in ZK with OAR + RAR. * 2. Send LeaderAndIsr request to every replica in OAR + RAR (with AR as OAR + RAR). We do this by forcing an update * of the leader epoch in zookeeper. * 3. Start new replicas RAR - OAR by moving replicas in RAR - OAR to NewReplica state. * 4. Wait until all replicas in RAR are in sync with the leader. * 5 Move all replicas in RAR to OnlineReplica state. * 6. Set AR to RAR in memory. * 7. If the leader is not in RAR, elect a new leader from RAR. If new leader needs to be elected from RAR, a LeaderAndIsr * will be sent. If not, then leader epoch will be incremented in zookeeper and a LeaderAndIsr request will be sent. * In any case, the LeaderAndIsr request will have AR = RAR. This will prevent the leader from adding any replica in * RAR - OAR back in the isr. * 8. Move all replicas in OAR - RAR to OfflineReplica state. As part of OfflineReplica state change, we shrink the * isr to remove OAR - RAR in zookeeper and sent a LeaderAndIsr ONLY to the Leader to notify it of the shrunk isr. * After that, we send a StopReplica (delete = false) to the replicas in OAR - RAR. * 9. Move all replicas in OAR - RAR to NonExistentReplica state. This will send a StopReplica (delete = false) to * the replicas in OAR - RAR to physically delete the replicas on disk. * 10. Update AR in ZK with RAR. * 11. Update the /admin/reassign_partitions path in ZK to remove this partition. * 12. After electing leader, the replicas and isr information changes. So resend the update metadata request to every broker. * * For example, if OAR = {1, 2, 3} and RAR = {4,5,6}, the values in the assigned replica (AR) and leader/isr path in ZK * may go through the following transition. * AR leader/isr * {1,2,3} 1/{1,2,3} (initial state) * {1,2,3,4,5,6} 1/{1,2,3} (step 2) * {1,2,3,4,5,6} 1/{1,2,3,4,5,6} (step 4) * {1,2,3,4,5,6} 4/{1,2,3,4,5,6} (step 7) * {1,2,3,4,5,6} 4/{4,5,6} (step 8) * {4,5,6} 4/{4,5,6} (step 10) * * Note that we have to update AR in ZK with RAR last since it's the only place where we store OAR persistently. * This way, if the controller crashes before that step, we can still recover. */ def onPartitionReassignment(topicAndPartition: TopicAndPartition, reassignedPartitionContext: ReassignedPartitionsContext) { val reassignedReplicas = reassignedPartitionContext.newReplicas areReplicasInIsr(topicAndPartition.topic, topicAndPartition.partition, reassignedReplicas) match { case false => info("New replicas %s for partition %s being ".format(reassignedReplicas.mkString(","), topicAndPartition) + "reassigned not yet caught up with the leader") val newReplicasNotInOldReplicaList = reassignedReplicas.toSet -- controllerContext.partitionReplicaAssignment(topicAndPartition).toSet val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicAndPartition)).toSet //1. Update AR in ZK with OAR + RAR. updateAssignedReplicasForPartition(topicAndPartition, newAndOldReplicas.toSeq) //2. Send LeaderAndIsr request to every replica in OAR + RAR (with AR as OAR + RAR). updateLeaderEpochAndSendRequest(topicAndPartition, controllerContext.partitionReplicaAssignment(topicAndPartition), newAndOldReplicas.toSeq) //3. replicas in RAR - OAR -> NewReplica startNewReplicasForReassignedPartition(topicAndPartition, reassignedPartitionContext, newReplicasNotInOldReplicaList) info("Waiting for new replicas %s for partition %s being ".format(reassignedReplicas.mkString(","), topicAndPartition) + "reassigned to catch up with the leader") case true => //4. Wait until all replicas in RAR are in sync with the leader. val oldReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).toSet -- reassignedReplicas.toSet //5. replicas in RAR -> OnlineReplica reassignedReplicas.foreach { replica => replicaStateMachine.handleStateChanges(Set(new PartitionAndReplica(topicAndPartition.topic, topicAndPartition.partition, replica)), OnlineReplica) } //6. Set AR to RAR in memory. //7. Send LeaderAndIsr request with a potential new leader (if current leader not in RAR) and // a new AR (using RAR) and same isr to every broker in RAR moveReassignedPartitionLeaderIfRequired(topicAndPartition, reassignedPartitionContext) //8. replicas in OAR - RAR -> Offline (force those replicas out of isr) //9. replicas in OAR - RAR -> NonExistentReplica (force those replicas to be deleted) stopOldReplicasOfReassignedPartition(topicAndPartition, reassignedPartitionContext, oldReplicas) //10. Update AR in ZK with RAR. updateAssignedReplicasForPartition(topicAndPartition, reassignedReplicas) //11. Update the /admin/reassign_partitions path in ZK to remove this partition. removePartitionFromReassignedPartitions(topicAndPartition) info("Removed partition %s from the list of reassigned partitions in zookeeper".format(topicAndPartition)) controllerContext.partitionsBeingReassigned.remove(topicAndPartition) //12. After electing leader, the replicas and isr information changes, so resend the update metadata request to every broker sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicAndPartition)) // signal delete topic thread if reassignment for some partitions belonging to topics being deleted just completed deleteTopicManager.resumeDeletionForTopics(Set(topicAndPartition.topic)) } }
参考例子可以得到,先是新增节点,然后通过replicas机制将新增的节点变为ISR,然后再把需要替换掉的ISR
去除。
相关推荐
3. **副本与故障转移**:Kafka支持分区的多副本,每个分区都有一个主副本和若干个从副本。当主副本故障时,从副本会自动晋升为主副本,确保服务的连续性。 4. **Kafka的Producers API**:生产者API允许开发者向...
3. **容错性**: 通过复制和故障转移机制,Kafka能够容忍服务器故障,保持服务的连续性。 4. **实时处理**: Kafka支持实时数据流处理,消息一旦发布,消费者即可立即消费。 5. **可伸缩性**: Kafka能够轻松添加或...
它的消息模型、分区和复制机制、以及保留策略等特性,使其在大数据和实时分析场景中具有很高的价值。学习和理解 Kafka 的参数配置,对于优化 Kafka 集群的性能和稳定性至关重要。例如,调整消息批次大小、保留策略、...
2. **消费者位移管理**:消费者位移现在默认存储在Kafka主题中,提高了消费者恢复速度和故障转移的可靠性。 3. **连接器(Connectors)**:Kafka Connect允许快速地建立到外部系统的连接,如数据库或文件系统,简化...
- **副本与故障转移**:每个分区可以有多个副本,其中一个作为 Leader,其他为 Follower。当 Leader 故障时,一个 Follower 会自动晋升为新的 Leader。 - **ISR(In-Sync Replicas)**:ISR 是一组与 Leader 保持...
Kafka是一个广泛使用的开源消息系统,由LinkedIn开发并贡献给了Apache软件基金会。作为一个分布式流处理平台,Kafka在大数据...通过深入学习和实践,你可以熟练掌握Kafka,将其应用于大数据处理、实时分析等多个领域。
- **容错性**:通过复制策略,每个分区都有多个副本,可以在副本间进行故障转移。 - **可扩展性**:通过增加节点,Kafka集群可以轻松扩展以处理更多流量。 - **实时性**:Kafka支持实时数据处理,适合大数据分析...
同时,每个分区都有一个副本,主副本负责写入,其他副本则用于故障转移。当主副本出现故障时,其他副本可以接管,保证服务的连续性。 4. **性能优化**: Kafka 的高性能源于其设计中的几个关键点:零拷贝技术减少...
- **高可用性**:通过复制策略,Kafka可以实现故障转移,确保服务不间断。 - **高性能**:Kafka具有极高的吞吐量,可以在秒级别处理数百万条消息。 - **实时处理**:Kafka支持实时数据流处理,适合大数据实时分析...
配合Kafka的自动故障转移机制,CMAK能有效提升集群的容错性。 另外,CMAK还支持用户权限管理,允许设置不同级别的访问控制,确保只有授权的用户才能进行关键操作,增强了系统的安全性。同时,它的日志记录和审计...
3. **容错性**:通过复制机制,Kafka提供了一定程度的故障转移和恢复能力。 4. **实时性**:Kafka能够处理实时数据流,消息的发布和订阅延迟极低。 **Kafka的使用场景** 1. **日志收集**:Kafka常用于收集应用...
3. **高可用性**:通过复制和故障转移,Kafka能够在集群中实现高可用性。 4. **可伸缩性**:Kafka可以通过添加更多的broker节点来水平扩展。 5. **多消费者模型**:支持多个消费组,每个组内的消费者可以并行消费...
它旨在自动化执行Kafka集群中的资源分配、负载均衡、健康监测和故障转移。Cruise Control的架构设计是解决这些挑战的关键,它需要处理来自生产者和消费者的请求,确保集群在故障发生时能够自动恢复,并进行资源的...
- **创建主题**:使用Kafka的命令行工具创建主题,设置分区和副本数量。 - **配置生产者与消费者**:根据实际需求配置生产者和消费者的参数,如acks、batch.size等。 5. **最佳实践** - **合理分区**:根据数据...
- **流式处理**:作为流处理平台,Kafka可以连接到Storm或Spark Streaming等系统,实现实时分析。 - **消息传递**:替代传统的消息队列系统,如RabbitMQ或ActiveMQ。 6. **优化与扩展**: - **调整配置**:如...
- **主题(Topic)**:Kafka中的数据被组织成主题,每个主题可以分为多个分区(Partition)。 - **分区(Partition)**:保证消息顺序的关键,每个分区内部的消息是有序的。 - **生产者(Producer)**:负责向...
分区的特性使得Kafka能提供高吞吐量的读写操作,并且保证消息的顺序。 3. **消费者组** 消费者通过加入消费者组来协同工作。每个主题的消息会被分发给消费者组内的不同消费者。如果一个消费者失败,其分配的任务会...
2. **实时处理**:通过集成Storm、Spark Streaming等实时处理框架,对流入Kafka的数据进行实时分析和处理。 3. **数据集成**:Kafka可以作为不同系统间数据交换的中间件,实现异构系统的数据集成。 4. **消息队列*...
- **复制与容错**:每个分区都有多个副本,主副本处理写操作,从副本用于故障转移。 - **实时处理**:支持实时数据流处理,无需预先批处理所有数据。 - **可伸缩性**:通过添加更多的服务器,可以动态调整集群...