[HBase]Region assignment

iwinit

浏览: 456520 次
性别:
来自: 杭州

最近访客更多访客>>

lingxiajiudu

plisking

rochine123

尼欧-张

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (45)

社区版块

存档分类

hbase region assigment

接上文，我们创建表t1，列族c1，hbase.root目录为/new。当创建空表时，系统会自动生成一个空region，我们以这个region分配过程看下Region是如何在HMaster和Region server(以下简称rs)中创建的。大致过程如下：

1.HMaster指定分配计划，一个region只会分配给一个rs，多个rs均匀分配

2.多个rs并发执行assiagnment操作

3.先在zk的/hbase/assiangment目录下创建region节点，状态为‘offline’

4.RPC对应rs，请求分配region

5.master端开始等待所有region都被分配，通过zk的节点状态通信

6.rs端收到请求，执行异步OpenRegion操作

7.rs先把zk节点状态改为'opening'

8.rs执行open region操作，并初始化region，主要是创建region的HDFS目录，初始化Store

9.rs修改meta表中region对应的记录信息

10.rs修改zk节点中的状态为'opened'

11.master收到'opened'信息，认为该region已经assiagnment成功

12.所有region都成功后，master认为region批量创建成功

大概类图

在HMaster端提供了BulkAssigner，用来批量分配region，默认采用随即均匀分配，分配过程是一个rpc调用

public boolean bulkAssign(boolean sync) throws InterruptedException,
      IOException {
    boolean result = false;
    ThreadFactoryBuilder builder = new ThreadFactoryBuilder();
    builder.setDaemon(true);
    builder.setNameFormat(getThreadNamePrefix() + "-%1$d");
    builder.setUncaughtExceptionHandler(getUncaughtExceptionHandler());
    int threadCount = getThreadCount();
    java.util.concurrent.ExecutorService pool =
      Executors.newFixedThreadPool(threadCount, builder.build());
    try {
	//提交任务，任务为SingleServerBulkAssigner
      populatePool(pool);
      // How long to wait on empty regions-in-transition.  If we timeout, the
      // RIT monitor should do fixup.
	//等待
      if (sync) result = waitUntilDone(getTimeoutOnRIT());
    } finally {
      // We're done with the pool.  It'll exit when its done all in queue.
      pool.shutdown();
    }
    return result;
  }

等待过程

  boolean waitUntilNoRegionsInTransition(final long timeout, Set<HRegionInfo> regions)
  throws InterruptedException {
    // Blocks until there are no regions in transition.
	//如果带处理的region有一个还在事务列表中，则继续等
	//超时时间由hbase.bulk.assignment.waiton.empty.rit设置，默认5分钟
    long startTime = System.currentTimeMillis();
    long remaining = timeout;
    boolean stillInTransition = true;
    synchronized (regionsInTransition) {
      while (regionsInTransition.size() > 0 && !this.master.isStopped() &&
          remaining > 0 && stillInTransition) {
        int count = 0;
        for (RegionState rs : regionsInTransition.values()) {
          if (regions.contains(rs.getRegion())) {
            count++;
            break;
          }
        }
        if (count == 0) {
          stillInTransition = false;
          break;
        }
        regionsInTransition.wait(remaining);
        remaining = timeout - (System.currentTimeMillis() - startTime);
      }
    }
    return stillInTransition;
  }

AssignmentManager提供了assign(final ServerName destination,final List<HRegionInfo> regions)给每个rs批量assign region

void assign(final ServerName destination,
      final List<HRegionInfo> regions) {
    ....
	//强制初始化状态为offline
    List<RegionState> states = new ArrayList<RegionState>(regions.size());
    synchronized (this.regionsInTransition) {
      for (HRegionInfo region: regions) {
        states.add(forceRegionStateToOffline(region));
      }
    }
    .....
    
    // Presumption is that only this thread will be updating the state at this
    // time; i.e. handlers on backend won't be trying to set it to OPEN, etc.
	//给每个带分配的region创建zk的节点，目录为/hbase/unassigned，并初始化状态为offline。
	//节点创建成功后，在callback中调用zk的exist，设置watcher，在exist操作的callback中将region的状态设为‘PENDING_OPEN’，递增counter
	//所有region都需要设置成功
    AtomicInteger counter = new AtomicInteger(0);
    CreateUnassignedAsyncCallback cb =
      new CreateUnassignedAsyncCallback(this.watcher, destination, counter);
    for (RegionState state: states) {
      if (!asyncSetOfflineInZooKeeper(state, destination, cb, state)) {
        return;
      }
    }
    // Wait until all unassigned nodes have been put up and watchers set.
    int total = regions.size();
    for (int oldCounter = 0; true;) {
      int count = counter.get();
      if (oldCounter != count) {
        LOG.info(destination.toString() + " unassigned znodes=" + count +
          " of total=" + total);
        oldCounter = count;
      }
      if (count == total) break;
      Threads.sleep(1);
    }
    // Move on to open regions.
    try {
      // Send OPEN RPC. If it fails on a IOE or RemoteException, the
      // TimeoutMonitor will pick up the pieces.
	//发送RPC请求给rs，如果rpc失败，可重试，最大超时时间60s
      long maxWaitTime = System.currentTimeMillis() +
        this.master.getConfiguration().
          getLong("hbase.regionserver.rpc.startup.waittime", 60000);
      while (!this.master.isStopped()) {
        try {
          this.serverManager.sendRegionOpen(destination, regions);
          break;
        } catch (RemoteException e) {
          IOException decodedException = e.unwrapRemoteException();
          if (decodedException instanceof RegionServerStoppedException) {
            LOG.warn("The region server was shut down, ", decodedException);
            // No need to retry, the region server is a goner.
            return;
          } else if (decodedException instanceof ServerNotRunningYetException) {
            // This is the one exception to retry.  For all else we should just fail
            // the startup.
            long now = System.currentTimeMillis();
            if (now > maxWaitTime) throw e;
            LOG.debug("Server is not yet up; waiting up to " +
                (maxWaitTime - now) + "ms", e);
            Thread.sleep(1000);
          }

          throw decodedException;
        }
      }
    } 
	.......
  }

rs的RPC接口HRegionInterface.openRegions(final List<HRegionInfo> regions)，rs初始化region，并通过zk状态告知master是否成功，这是一个异步过程。

用户表open region为OpenRegionHandler，处理

public void process() throws IOException {
    try {
     .....

      // If fails, just return.  Someone stole the region from under us.
      // Calling transitionZookeeperOfflineToOpening initalizes this.version.
	//将/hbase/unassigned下的节点状态从‘offline’改成‘opening’
      if (!transitionZookeeperOfflineToOpening(encodedName,
          versionOfOfflineNode)) {
        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
          encodedName);
        return;
      }

      // Open region.  After a successful open, failures in subsequent
      // processing needs to do a close as part of cleanup.
	//执行open操作
      region = openRegion();
      if (region == null) {
        tryTransitionToFailedOpen(regionInfo);
        return;
      }
      boolean failed = true;
	//open成功后，先更新下zk中的节点时间，再修改meta表中的region记录
	//主要是修改meta表中的serverstartcode和server列
      if (tickleOpening("post_region_open")) {
        if (updateMeta(region)) {
          failed = false;
        }
      }
	//如果修改失败，或者进入stop阶段，关闭region，将zk节点状态设为‘FAILED_OPEN’
      if (failed || this.server.isStopped() ||
          this.rsServices.isStopping()) {
        cleanupFailedOpen(region);
        tryTransitionToFailedOpen(regionInfo);
        return;
      }
	//将zk节点状态设为‘OPENED’，如果失败，关闭region
      if (!transitionToOpened(region)) {
        // If we fail to transition to opened, it's because of one of two cases:
        //    (a) we lost our ZK lease
        // OR (b) someone else opened the region before us
        // In either case, we don't need to transition to FAILED_OPEN state.
        // In case (a), the Master will process us as a dead server. In case
        // (b) the region is already being handled elsewhere anyway.
        cleanupFailedOpen(region);
        return;
      }
      // Successful region open, and add it to OnlineRegions
	//添加到online列表
      this.rsServices.addToOnlineRegions(region);

      .....
  }

Region初始化

private long initializeRegionInternals(final CancelableProgressable reporter,
      MonitoredTask status) throws IOException, UnsupportedEncodingException {
    .....

    // Write HRI to a file in case we need to recover .META.
    status.setStatus("Writing region info on filesystem");
	//写入.regioninfo文件，内容是HRegionInfo序列化的内容，region的元信息
    checkRegioninfoOnFilesystem();

    // Remove temporary data left over from old regions
    status.setStatus("Cleaning up temporary data from old regions");
	//.tmp目录删除
    cleanupTmpDir();

    // Load in all the HStores.
    //
    // Context: During replay we want to ensure that we do not lose any data. So, we
    // have to be conservative in how we replay logs. For each store, we calculate
    // the maxSeqId up to which the store was flushed. And, skip the edits which
    // is equal to or lower than maxSeqId for each store.
	//每个family启动一个线程加载store
	//等全部store都加载后，取最大的seqId和memstoreTS
    Map<byte[], Long> maxSeqIdInStores = new TreeMap<byte[], Long>(
        Bytes.BYTES_COMPARATOR);
    long maxSeqId = -1;
    // initialized to -1 so that we pick up MemstoreTS from column families
    long maxMemstoreTS = -1;

    if (this.htableDescriptor != null &&
        !htableDescriptor.getFamilies().isEmpty()) {
      // initialize the thread pool for opening stores in parallel.
      ThreadPoolExecutor storeOpenerThreadPool =
        getStoreOpenAndCloseThreadPool(
          "StoreOpenerThread-" + this.regionInfo.getRegionNameAsString());
      CompletionService<Store> completionService =
        new ExecutorCompletionService<Store>(storeOpenerThreadPool);

      // initialize each store in parallel
      for (final HColumnDescriptor family : htableDescriptor.getFamilies()) {
        status.setStatus("Instantiating store for column family " + family);
        completionService.submit(new Callable<Store>() {
          public Store call() throws IOException {
            return instantiateHStore(tableDir, family);
          }
        });
      }
      try {
        for (int i = 0; i < htableDescriptor.getFamilies().size(); i++) {
          Future<Store> future = completionService.take();
          Store store = future.get();

          this.stores.put(store.getColumnFamilyName().getBytes(), store);
          long storeSeqId = store.getMaxSequenceId();
          maxSeqIdInStores.put(store.getColumnFamilyName().getBytes(),
              storeSeqId);
          if (maxSeqId == -1 || storeSeqId > maxSeqId) {
            maxSeqId = storeSeqId;
          }
          long maxStoreMemstoreTS = store.getMaxMemstoreTS();
          if (maxStoreMemstoreTS > maxMemstoreTS) {
            maxMemstoreTS = maxStoreMemstoreTS;
          }
        }
      ......
    }
    mvcc.initialize(maxMemstoreTS + 1);
    // Recover any edits if available.
    maxSeqId = Math.max(maxSeqId, replayRecoveredEditsIfAny(
        this.regiondir, maxSeqIdInStores, reporter, status));

	.......
 
    this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis();
    // Use maximum of log sequenceid or that which was found in stores
    // (particularly if no recovered edits, seqid will be -1).
	//递增seqid
    long nextSeqid = maxSeqId + 1;
    ......
    return nextSeqid;
  }

rs端的处理就是这些，master端通过zk的watcher监听rs端的region状态修改，AssignmentManager的nodeDataChanged方法就是用来处理这个的。

  public void nodeDataChanged(String path) {
    if(path.startsWith(watcher.assignmentZNode)) {
      try {
        Stat stat = new Stat();
	//当data变化时，获取data，然后再设置watcher，下次继续处理
        RegionTransitionData data = ZKAssign.getDataAndWatch(watcher, path, stat);
        if (data == null) {
          return;
        }
        handleRegion(data, stat.getVersion());
      } catch (KeeperException e) {
        master.abort("Unexpected ZK exception reading unassigned node data", e);
      }
    }
  }

当rs把region状态设为opening时

case RS_ZK_REGION_OPENING:
          .....
          // Transition to OPENING (or update stamp if already OPENING)
	//更新时间
          regionState.update(RegionState.State.OPENING,
              data.getStamp(), data.getOrigin());
          break;

当rs把region状态设为‘opened‘时

case RS_ZK_REGION_OPENED:
          ......
          // Handle OPENED by removing from transition and deleted zk node
	//内存状态改为open
          regionState.update(RegionState.State.OPEN,
              data.getStamp(), data.getOrigin());
          this.executorService.submit(
            new OpenedRegionHandler(master, this, regionState.getRegion(),
              data.getOrigin(), expectedVersion));
          break;

OpenedRegionHandler主要是删除之前创建的/hbase/unassigned下的region节点

  public void process() {
    // Code to defend against case where we get SPLIT before region open
    // processing completes; temporary till we make SPLITs go via zk -- 0.92.
    RegionState regionState = this.assignmentManager.isRegionInTransition(regionInfo);
    boolean openedNodeDeleted = false;
    if (regionState != null
        && regionState.getState().equals(RegionState.State.OPEN)) {
      openedNodeDeleted = deleteOpenedNode(expectedVersion);
      if (!openedNodeDeleted) {
        LOG.error("The znode of region " + regionInfo.getRegionNameAsString()
            + " could not be deleted.");
      }
    } 
	......
  }

节点删除后，又有zk通知，AssignmentManager的nodeDeleted方法

  public void nodeDeleted(final String path) {
    if (path.startsWith(this.watcher.assignmentZNode)) {
      String regionName = ZKAssign.getRegionName(this.master.getZooKeeper(), path);
      RegionState rs = this.regionsInTransition.get(regionName);
      if (rs != null) {
        HRegionInfo regionInfo = rs.getRegion();
        if (rs.isSplit()) {
          LOG.debug("Ephemeral node deleted, regionserver crashed?, " +
            "clearing from RIT; rs=" + rs);
          regionOffline(rs.getRegion());
        } else {
          LOG.debug("The znode of region " + regionInfo.getRegionNameAsString()
              + " has been deleted.");
          if (rs.isOpened()) {
            makeRegionOnline(rs, regionInfo);
          }
        }
      }
    }
  }

region上线，将region从transition列表中删除，并更新servers和regions列表

  void regionOnline(HRegionInfo regionInfo, ServerName sn) {
    synchronized (this.regionsInTransition) {
      RegionState rs =
        this.regionsInTransition.remove(regionInfo.getEncodedName());
      if (rs != null) {
        this.regionsInTransition.notifyAll();
      }
    }
    synchronized (this.regions) {
      // Add check
      ServerName oldSn = this.regions.get(regionInfo);
      if (oldSn != null && serverManager.isServerOnline(oldSn)) {
        LOG.warn("Overwriting " + regionInfo.getEncodedName() + " on old:"
            + oldSn + " with new:" + sn);
        // remove region from old server
        Set<HRegionInfo> hris = servers.get(oldSn);
        if (hris != null) {
          hris.remove(regionInfo);
        }
      }
      
      if (isServerOnline(sn)) {
        this.regions.put(regionInfo, sn);
        addToServers(sn, regionInfo);
        this.regions.notifyAll();
      } else {
        LOG.info("The server is not in online servers, ServerName=" + 
          sn.getServerName() + ", region=" + regionInfo.getEncodedName());
      }
    }
    // Remove plan if one.
    clearRegionPlan(regionInfo);
    // Add the server to serversInUpdatingTimer
    addToServersInUpdatingTimer(sn);
  }

小节

region assignment主要关键点

1.region load balance，默认是随即均匀分配

2.master在/hbase/unassigned下建立region节点，方便后续和rs交互

3.rs初始化region在HDFS上的文件目录，包括.regioninfo文件和family目录

4.rs open region之后，将状态设为’opened‘，master认为region assignment成功，删除节点，并将region保存到online列表