`

Kafka 转移分区分析

 
阅读更多

1, 关于如何转移分区: 以及如何新增节点的问题, 我们在 Kafka中文文档 中已经有过叙述。详细参考

 

2, 分析命令的执行过程 : 分区调用的脚本是 kafka-reassign-partitions.sh, 具体内容是:

exec $(dirname $0)/kafka-run-class.sh kafka.admin.ReassignPartitionsCommand $@

3, 分析一下 ReassignPartitionsCommand 的代码, 主要有三个方法,分别对应生成分区

分配的方案, 执行分配方案, 以及验证分配方案执行情况

 

4,   生成分区分配方案代码如下:

 

 

  def generateAssignment(zkClient: ZkClient, opts: ReassignPartitionsCommandOptions) {
    if(!(opts.options.has(opts.topicsToMoveJsonFileOpt) && opts.options.has(opts.brokerListOpt)))
      CommandLineUtils.printUsageAndDie(opts.parser, "If --generate option is used, command must include both --topics-to-move-json-file and --broker-list options")
    val topicsToMoveJsonFile = opts.options.valueOf(opts.topicsToMoveJsonFileOpt)
    val brokerListToReassign = opts.options.valueOf(opts.brokerListOpt).split(',').map(_.toInt)
    val duplicateReassignments = CoreUtils.duplicates(brokerListToReassign)
    if (duplicateReassignments.nonEmpty)
      throw new AdminCommandFailedException("Broker list contains duplicate entries: %s".format(duplicateReassignments.mkString(",")))
    val topicsToMoveJsonString = Utils.readFileAsString(topicsToMoveJsonFile)
    val topicsToReassign = ZkUtils.parseTopicsData(topicsToMoveJsonString)
    val duplicateTopicsToReassign = CoreUtils.duplicates(topicsToReassign)
    if (duplicateTopicsToReassign.nonEmpty)
      throw new AdminCommandFailedException("List of topics to reassign contains duplicate entries: %s".format(duplicateTopicsToReassign.mkString(",")))
    val topicPartitionsToReassign = ZkUtils.getReplicaAssignmentForTopics(zkClient, topicsToReassign)

    var partitionsToBeReassigned : Map[TopicAndPartition, Seq[Int]] = new mutable.HashMap[TopicAndPartition, List[Int]]()
    val groupedByTopic = topicPartitionsToReassign.groupBy(tp => tp._1.topic)
    groupedByTopic.foreach { topicInfo =>
      val assignedReplicas = AdminUtils.assignReplicasToBrokers(brokerListToReassign, topicInfo._2.size,
        topicInfo._2.head._2.size)
      partitionsToBeReassigned ++= assignedReplicas.map(replicaInfo => (TopicAndPartition(topicInfo._1, replicaInfo._1) -> replicaInfo._2))
    }
    val currentPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, partitionsToBeReassigned.map(_._1.topic).toSeq)
    println("Current partition replica assignment\n\n%s"
      .format(ZkUtils.getPartitionReassignmentZkData(currentPartitionReplicaAssignment)))
    println("Proposed partition reassignment configuration\n\n%s".format(ZkUtils.getPartitionReassignmentZkData(partitionsToBeReassigned)))
  }
 

 

实际最终调用那个的是AdminUtils.assignReplicasToBrokers方法, 这个方法的注释如下:

 

  /**
   * There are 2 goals of replica assignment:
   * 1. Spread the replicas evenly among brokers.
   * 2. For partitions assigned to a particular broker, their other replicas are spread over the other brokers.
   *
   * To achieve this goal, we:
   * 1. Assign the first replica of each partition by round-robin, starting from a random position in the broker list.
   * 2. Assign the remaining replicas of each partition with an increasing shift.
   *
   * Here is an example of assigning
   * broker-0  broker-1  broker-2  broker-3  broker-4
   * p0        p1        p2        p3        p4       (1st replica)
   * p5        p6        p7        p8        p9       (1st replica)
   * p4        p0        p1        p2        p3       (2nd replica)
   * p8        p9        p5        p6        p7       (2nd replica)
   * p3        p4        p0        p1        p2       (3nd replica)
   * p7        p8        p9        p5        p6       (3nd replica)
   */

可以看到, 具体的算法是根据broker以及副本数来循环分区数,这样就能得到每个broker

上面关于这个的分区结果。generateAssignment方法会循环调用这个方法来处理不同的topic。

具体的返回结果应该是一个map, map的主键是 TopicAndPartition, 键值是关于这个topic的

partition的broker的id的List。然后会把这个分配的结果给打印出来。

 

 

 

5, 执行分配代码分析

 

 

def executeAssignment(zkClient: ZkClient, opts: ReassignPartitionsCommandOptions) {
    if(!opts.options.has(opts.reassignmentJsonFileOpt))
      CommandLineUtils.printUsageAndDie(opts.parser, "If --execute option is used, command must include --reassignment-json-file that was output " + "during the --generate option")
    val reassignmentJsonFile =  opts.options.valueOf(opts.reassignmentJsonFileOpt)
    val reassignmentJsonString = Utils.readFileAsString(reassignmentJsonFile)
    val partitionsToBeReassigned = ZkUtils.parsePartitionReassignmentDataWithoutDedup(reassignmentJsonString)
    if (partitionsToBeReassigned.isEmpty)
      throw new AdminCommandFailedException("Partition reassignment data file %s is empty".format(reassignmentJsonFile))
    val duplicateReassignedPartitions = CoreUtils.duplicates(partitionsToBeReassigned.map{ case(tp,replicas) => tp})
    if (duplicateReassignedPartitions.nonEmpty)
      throw new AdminCommandFailedException("Partition reassignment contains duplicate topic partitions: %s".format(duplicateReassignedPartitions.mkString(",")))
    val duplicateEntries= partitionsToBeReassigned
      .map{ case(tp,replicas) => (tp, CoreUtils.duplicates(replicas))}
      .filter{ case (tp,duplicatedReplicas) => duplicatedReplicas.nonEmpty }
    if (duplicateEntries.nonEmpty) {
      val duplicatesMsg = duplicateEntries
        .map{ case (tp,duplicateReplicas) => "%s contains multiple entries for %s".format(tp, duplicateReplicas.mkString(",")) }
        .mkString(". ")
      throw new AdminCommandFailedException("Partition replica lists may not contain duplicate entries: %s".format(duplicatesMsg))
    }
    val reassignPartitionsCommand = new ReassignPartitionsCommand(zkClient, partitionsToBeReassigned.toMap)
    // before starting assignment, output the current replica assignment to facilitate rollback
    val currentPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, partitionsToBeReassigned.map(_._1.topic).toSeq)
    println("Current partition replica assignment\n\n%s\n\nSave this to use as the --reassignment-json-file option during rollback"
      .format(ZkUtils.getPartitionReassignmentZkData(currentPartitionReplicaAssignment)))
    // start the reassignment
    if(reassignPartitionsCommand.reassignPartitions())
      println("Successfully started reassignment of partitions %s".format(ZkUtils.getPartitionReassignmentZkData(partitionsToBeReassigned.toMap)))
    else
      println("Failed to reassign partitions %s".format(partitionsToBeReassigned))
  }
 

 

这个方法前面是一些常规的参数检查, 后面世构造了一个 ReassignPartitionsCommand , 然后执行这个

command的reassignPartitionsCommand.reassignPartitions() 的方法。 主要看下 reassignPartitions方法

是如何执行的。

这个方法先是验证参数的有效性, 然后将新的分配方案写入到zookeeper的/admin/reassign_partitions路径

中, 其他的事情也没有做。

 

 

6, 如何验证分配是否执行完成

 

 

  def verifyAssignment(zkClient: ZkClient, opts: ReassignPartitionsCommandOptions) {
    if(!opts.options.has(opts.reassignmentJsonFileOpt))
      CommandLineUtils.printUsageAndDie(opts.parser, "If --verify option is used, command must include --reassignment-json-file that was used during the --execute option")
    val jsonFile = opts.options.valueOf(opts.reassignmentJsonFileOpt)
    val jsonString = Utils.readFileAsString(jsonFile)
    val partitionsToBeReassigned = ZkUtils.parsePartitionReassignmentData(jsonString)

    println("Status of partition reassignment:")
    val reassignedPartitionsStatus = checkIfReassignmentSucceeded(zkClient, partitionsToBeReassigned)
    reassignedPartitionsStatus.foreach { partition =>
      partition._2 match {
        case ReassignmentCompleted =>
          println("Reassignment of partition %s completed successfully".format(partition._1))
        case ReassignmentFailed =>
          println("Reassignment of partition %s failed".format(partition._1))
        case ReassignmentInProgress =>
          println("Reassignment of partition %s is still in progress".format(partition._1))
      }
    }
  }
 
具体的逻辑也就是查看根据新的分配方案, 查看zookeeper对应的topic中是否有符合新的分配方案的broker的id,如果有的话,则是成功,如果没有则是失败。 默认的状态是进行中。

 

 

 

7,  分配如何进行的

 

目前看到的结果只是将新的分配方案写入到zookeeper对应的路径上面, 那到底是那个进程来处理这些事情呢?

 

Kafka 启动过程 中我们看到了有调用 kafkaController.startup, 这个过程会去正当controller, 并不是所

有的节点都会成为controller, 只有成为主的controller才会回调 onControllerFailover 这个方法, 这个方法

注册了监听 "/admin/reassign_partitions" 目录的事件, 处理 事件是通过  PartitionsReassignedListener  的

handleDataChange来处理的。 实际上最终处理的是通过 onPartitionReassignment的方法

/**
   * This callback is invoked by the reassigned partitions listener. When an admin command initiates a partition
   * reassignment, it creates the /admin/reassign_partitions path that triggers the zookeeper listener.
   * Reassigning replicas for a partition goes through a few steps listed in the code.
   * RAR = Reassigned replicas
   * OAR = Original list of replicas for partition
   * AR = current assigned replicas
   *
   * 1. Update AR in ZK with OAR + RAR.
   * 2. Send LeaderAndIsr request to every replica in OAR + RAR (with AR as OAR + RAR). We do this by forcing an update
   *    of the leader epoch in zookeeper.
   * 3. Start new replicas RAR - OAR by moving replicas in RAR - OAR to NewReplica state.
   * 4. Wait until all replicas in RAR are in sync with the leader.
   * 5  Move all replicas in RAR to OnlineReplica state.
   * 6. Set AR to RAR in memory.
   * 7. If the leader is not in RAR, elect a new leader from RAR. If new leader needs to be elected from RAR, a LeaderAndIsr
   *    will be sent. If not, then leader epoch will be incremented in zookeeper and a LeaderAndIsr request will be sent.
   *    In any case, the LeaderAndIsr request will have AR = RAR. This will prevent the leader from adding any replica in
   *    RAR - OAR back in the isr.
   * 8. Move all replicas in OAR - RAR to OfflineReplica state. As part of OfflineReplica state change, we shrink the
   *    isr to remove OAR - RAR in zookeeper and sent a LeaderAndIsr ONLY to the Leader to notify it of the shrunk isr.
   *    After that, we send a StopReplica (delete = false) to the replicas in OAR - RAR.
   * 9. Move all replicas in OAR - RAR to NonExistentReplica state. This will send a StopReplica (delete = false) to
   *    the replicas in OAR - RAR to physically delete the replicas on disk.
   * 10. Update AR in ZK with RAR.
   * 11. Update the /admin/reassign_partitions path in ZK to remove this partition.
   * 12. After electing leader, the replicas and isr information changes. So resend the update metadata request to every broker.
   *
   * For example, if OAR = {1, 2, 3} and RAR = {4,5,6}, the values in the assigned replica (AR) and leader/isr path in ZK
   * may go through the following transition.
   * AR                 leader/isr
   * {1,2,3}            1/{1,2,3}           (initial state)
   * {1,2,3,4,5,6}      1/{1,2,3}           (step 2)
   * {1,2,3,4,5,6}      1/{1,2,3,4,5,6}     (step 4)
   * {1,2,3,4,5,6}      4/{1,2,3,4,5,6}     (step 7)
   * {1,2,3,4,5,6}      4/{4,5,6}           (step 8)
   * {4,5,6}            4/{4,5,6}           (step 10)
   *
   * Note that we have to update AR in ZK with RAR last since it's the only place where we store OAR persistently.
   * This way, if the controller crashes before that step, we can still recover.
   */
  def onPartitionReassignment(topicAndPartition: TopicAndPartition, reassignedPartitionContext: ReassignedPartitionsContext) {
    val reassignedReplicas = reassignedPartitionContext.newReplicas
    areReplicasInIsr(topicAndPartition.topic, topicAndPartition.partition, reassignedReplicas) match {
      case false =>
        info("New replicas %s for partition %s being ".format(reassignedReplicas.mkString(","), topicAndPartition) +
          "reassigned not yet caught up with the leader")
        val newReplicasNotInOldReplicaList = reassignedReplicas.toSet -- controllerContext.partitionReplicaAssignment(topicAndPartition).toSet
        val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicAndPartition)).toSet
        //1. Update AR in ZK with OAR + RAR.
        updateAssignedReplicasForPartition(topicAndPartition, newAndOldReplicas.toSeq)
        //2. Send LeaderAndIsr request to every replica in OAR + RAR (with AR as OAR + RAR).
        updateLeaderEpochAndSendRequest(topicAndPartition, controllerContext.partitionReplicaAssignment(topicAndPartition),
          newAndOldReplicas.toSeq)
        //3. replicas in RAR - OAR -> NewReplica
        startNewReplicasForReassignedPartition(topicAndPartition, reassignedPartitionContext, newReplicasNotInOldReplicaList)
        info("Waiting for new replicas %s for partition %s being ".format(reassignedReplicas.mkString(","), topicAndPartition) +
          "reassigned to catch up with the leader")
      case true =>
        //4. Wait until all replicas in RAR are in sync with the leader.
        val oldReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).toSet -- reassignedReplicas.toSet
        //5. replicas in RAR -> OnlineReplica
        reassignedReplicas.foreach { replica =>
          replicaStateMachine.handleStateChanges(Set(new PartitionAndReplica(topicAndPartition.topic, topicAndPartition.partition,
            replica)), OnlineReplica)
        }
        //6. Set AR to RAR in memory.
        //7. Send LeaderAndIsr request with a potential new leader (if current leader not in RAR) and
        //   a new AR (using RAR) and same isr to every broker in RAR
        moveReassignedPartitionLeaderIfRequired(topicAndPartition, reassignedPartitionContext)
        //8. replicas in OAR - RAR -> Offline (force those replicas out of isr)
        //9. replicas in OAR - RAR -> NonExistentReplica (force those replicas to be deleted)
        stopOldReplicasOfReassignedPartition(topicAndPartition, reassignedPartitionContext, oldReplicas)
        //10. Update AR in ZK with RAR.
        updateAssignedReplicasForPartition(topicAndPartition, reassignedReplicas)
        //11. Update the /admin/reassign_partitions path in ZK to remove this partition.
        removePartitionFromReassignedPartitions(topicAndPartition)
        info("Removed partition %s from the list of reassigned partitions in zookeeper".format(topicAndPartition))
        controllerContext.partitionsBeingReassigned.remove(topicAndPartition)
        //12. After electing leader, the replicas and isr information changes, so resend the update metadata request to every broker
        sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicAndPartition))
        // signal delete topic thread if reassignment for some partitions belonging to topics being deleted just completed
        deleteTopicManager.resumeDeletionForTopics(Set(topicAndPartition.topic))
    }
  }

 

参考例子可以得到,先是新增节点,然后通过replicas机制将新增的节点变为ISR,然后再把需要替换掉的ISR

去除。

 

 

 

 

 

 

分享到:
评论

相关推荐

    Kafka技术内幕:图文详解Kafka源码设计与实现

    3. **副本与故障转移**:Kafka支持分区的多副本,每个分区都有一个主副本和若干个从副本。当主副本故障时,从副本会自动晋升为主副本,确保服务的连续性。 4. **Kafka的Producers API**:生产者API允许开发者向...

    Kafka课程讲义.zip

    3. **容错性**: 通过复制和故障转移机制,Kafka能够容忍服务器故障,保持服务的连续性。 4. **实时处理**: Kafka支持实时数据流处理,消息一旦发布,消费者即可立即消费。 5. **可伸缩性**: Kafka能够轻松添加或...

    kafka笔记参数说明

    它的消息模型、分区和复制机制、以及保留策略等特性,使其在大数据和实时分析场景中具有很高的价值。学习和理解 Kafka 的参数配置,对于优化 Kafka 集群的性能和稳定性至关重要。例如,调整消息批次大小、保留策略、...

    kafka基础简介整合材料

    - **副本与故障转移**:每个分区可以有多个副本,其中一个作为 Leader,其他为 Follower。当 Leader 故障时,一个 Follower 会自动晋升为新的 Leader。 - **ISR(In-Sync Replicas)**:ISR 是一组与 Leader 保持...

    kafka学习资料.zip

    Kafka是一个广泛使用的开源消息系统,由LinkedIn开发并贡献给了Apache软件基金会。作为一个分布式流处理平台,Kafka在大数据...通过深入学习和实践,你可以熟练掌握Kafka,将其应用于大数据处理、实时分析等多个领域。

    kafka-2.12-3.3.2.tgz

    2. **消费者位移管理**:消费者位移现在默认存储在Kafka主题中,提高了消费者恢复速度和故障转移的可靠性。 3. **连接器(Connectors)**:Kafka Connect允许快速地建立到外部系统的连接,如数据库或文件系统,简化...

    kafka完整高清带书签资源

    - **容错性**:通过复制策略,每个分区都有多个副本,可以在副本间进行故障转移。 - **可扩展性**:通过增加节点,Kafka集群可以轻松扩展以处理更多流量。 - **实时性**:Kafka支持实时数据处理,适合大数据分析...

    kafka_2.12-2.6.0

    同时,每个分区都有一个副本,主副本负责写入,其他副本则用于故障转移。当主副本出现故障时,其他副本可以接管,保证服务的连续性。 4. **性能优化**: Kafka 的高性能源于其设计中的几个关键点:零拷贝技术减少...

    尚硅谷大数据视频_Kafka视频教程-笔记.zip

    - **高可用性**:通过复制策略,Kafka可以实现故障转移,确保服务不间断。 - **高性能**:Kafka具有极高的吞吐量,可以在秒级别处理数百万条消息。 - **实时处理**:Kafka支持实时数据流处理,适合大数据实时分析...

    kafka集群企业级管理工具

    配合Kafka的自动故障转移机制,CMAK能有效提升集群的容错性。 另外,CMAK还支持用户权限管理,允许设置不同级别的访问控制,确保只有授权的用户才能进行关键操作,增强了系统的安全性。同时,它的日志记录和审计...

    《Kafka并不难学!入门、进阶、商业实战》_邓杰_2018-11-01

    3. **容错性**:通过复制机制,Kafka提供了一定程度的故障转移和恢复能力。 4. **实时性**:Kafka能够处理实时数据流,消息的发布和订阅延迟极低。 **Kafka的使用场景** 1. **日志收集**:Kafka常用于收集应用...

    Kafka-2.12

    3. **高可用性**:通过复制和故障转移,Kafka能够在集群中实现高可用性。 4. **可伸缩性**:Kafka可以通过添加更多的broker节点来水平扩展。 5. **多消费者模型**:支持多个消费组,每个组内的消费者可以并行消费...

    kafka自动化管理与分布式状态系统导论

    它旨在自动化执行Kafka集群中的资源分配、负载均衡、健康监测和故障转移。Cruise Control的架构设计是解决这些挑战的关键,它需要处理来自生产者和消费者的请求,确保集群在故障发生时能够自动恢复,并进行资源的...

    kafka_2.12-2.2.0.tgz

    - **创建主题**:使用Kafka的命令行工具创建主题,设置分区和副本数量。 - **配置生产者与消费者**:根据实际需求配置生产者和消费者的参数,如acks、batch.size等。 5. **最佳实践** - **合理分区**:根据数据...

    kafka_2.9.1-0.8.2.2.tgz

    - **流式处理**:作为流处理平台,Kafka可以连接到Storm或Spark Streaming等系统,实现实时分析。 - **消息传递**:替代传统的消息队列系统,如RabbitMQ或ActiveMQ。 6. **优化与扩展**: - **调整配置**:如...

    Kafka深度解析PDF 下载

    - **主题(Topic)**:Kafka中的数据被组织成主题,每个主题可以分为多个分区(Partition)。 - **分区(Partition)**:保证消息顺序的关键,每个分区内部的消息是有序的。 - **生产者(Producer)**:负责向...

    Kafka 实战演练 2

    分区的特性使得Kafka能提供高吞吐量的读写操作,并且保证消息的顺序。 3. **消费者组** 消费者通过加入消费者组来协同工作。每个主题的消息会被分发给消费者组内的不同消费者。如果一个消费者失败,其分配的任务会...

    Apache Kafka实战.7z

    2. **实时处理**:通过集成Storm、Spark Streaming等实时处理框架,对流入Kafka的数据进行实时分析和处理。 3. **数据集成**:Kafka可以作为不同系统间数据交换的中间件,实现异构系统的数据集成。 4. **消息队列*...

    kafka_2.10-0.10.2.1.tgz

    - **复制与容错**:每个分区都有多个副本,主副本处理写操作,从副本用于故障转移。 - **实时处理**:支持实时数据流处理,无需预先批处理所有数据。 - **可伸缩性**:通过添加更多的服务器,可以动态调整集群...

Global site tag (gtag.js) - Google Analytics