HBase memflush源码分析 -

blackproof

浏览: 1408997 次
性别:
来自: 北京

最近访客更多访客>>

lingxiajiudu

youtao531

mengjingwo

xuycan

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

HBase memflush源码分析

博客分类：

hbase

hbase memflush compact split

源码为0.98.1

HRegionServer中起线程MemStoreFlusher

private void initializeThreads() throws IOException {
    // Cache flushing thread.
    this.cacheFlusher = new MemStoreFlusher(conf, this);

    // Compaction thread
    this.compactSplitThread = new CompactSplitThread(this);

   .......

  private void startServiceThreads() throws IOException {
    String n = Thread.currentThread().getName();
......
    this.cacheFlusher.start(uncaughtExceptionHandler);

    Threads.setDaemonThreadRunning(this.compactionChecker.getThread(), n +
      ".compactionChecker", uncaughtExceptionHandler);

.....

 /*
   * Run init. Sets up hlog and starts up all server threads.
   *
   * @param c Extra configuration.
   */
  protected void handleReportForDutyResponse(final RegionServerStartupResponse c)
  throws IOException {
....

      startServiceThreads();
.....

  public void run() {
    try {
      // Do pre-registration initializations; zookeeper, lease threads, etc.
      preRegistrationInitialization();
    } catch (Throwable e) {
      abort("Fatal exception during initialization", e);
    }

    try {
      // Try and register with the Master; tell it we are here.  Break if
      // server is stopped or the clusterup flag is down or hdfs went wacky.
      while (keepLooping()) {
        RegionServerStartupResponse w = reportForDuty();
        if (w == null) {
          LOG.warn("reportForDuty failed; sleeping and then retrying.");
          this.sleeper.sleep();
        } else {
          handleReportForDutyResponse(w);//启动所有hregionserver线程服务
          break;
        }
      }
....

主要的类，方法：memStoreFlusher的flushRegion

 private boolean flushRegion(final HRegion region, final boolean emergencyFlush) {
    synchronized (this.regionsInQueue) {
      FlushRegionEntry fqe = this.regionsInQueue.remove(region);
      if (fqe != null && emergencyFlush) {
        // Need to remove from region from delay queue.  When NOT an
        // emergencyFlush, then item was removed via a flushQueue.poll.
        flushQueue.remove(fqe);
     }
    }
    lock.readLock().lock();
    try {
      boolean shouldCompact = region.flushcache();
      // We just want to check the size
      boolean shouldSplit = region.checkSplit() != null;
      if (shouldSplit) {
        this.server.compactSplitThread.requestSplit(region);
      } else if (shouldCompact) {
        server.compactSplitThread.requestSystemCompaction(
            region, Thread.currentThread().getName());
      }
......

从flushQueue中取出FlushRegionEntry进行flush

获取读锁

调用HRegion进行flush，并返回是否需要compact
调用HRegion查看是否需要split
if(split) spliting elif(compact) compacting

以下是具体操作：

--------------------------------------------------------------------------------------------------------------------

1.HRegion

 protected boolean internalFlushcache(
      final HLog wal, final long myseqid, MonitoredTask status)
  throws IOException {
    if (this.rsServices != null && this.rsServices.isAborted()) {
      // Don't flush when server aborting, it's unsafe
      throw new IOException("Aborting flush because server is abortted...");
    }
    final long startTime = EnvironmentEdgeManager.currentTimeMillis();
    // Clear flush flag.
    // If nothing to flush, return and avoid logging start/stop flush.
    if (this.memstoreSize.get() <= 0) {
      if(LOG.isDebugEnabled()) {
        LOG.debug("Empty memstore size for the current region "+this);
      }
      return false;
    }
    if (LOG.isDebugEnabled()) {
      LOG.debug("Started memstore flush for " + this +
        ", current region memstore size " +
        StringUtils.humanReadableInt(this.memstoreSize.get()) +
        ((wal != null)? "": "; wal is null, using passed sequenceid=" + myseqid));
    }

    // Stop updates while we snapshot the memstore of all stores. We only have
    // to do this for a moment.  Its quick.  The subsequent sequence id that
    // goes into the HLog after we've flushed all these snapshots also goes
    // into the info file that sits beside the flushed files.
    // We also set the memstore size to zero here before we allow updates
    // again so its value will represent the size of the updates received
    // during the flush
    MultiVersionConsistencyControl.WriteEntry w = null;

    // We have to take a write lock during snapshot, or else a write could
    // end up in both snapshot and memstore (makes it difficult to do atomic
    // rows then)
    status.setStatus("Obtaining lock to block concurrent updates");
    // block waiting for the lock for internal flush
    this.updatesLock.writeLock().lock();
    long totalFlushableSize = 0;
    status.setStatus("Preparing to flush by snapshotting stores");
    List<StoreFlushContext> storeFlushCtxs = new ArrayList<StoreFlushContext>(stores.size());
    long flushSeqId = -1L;
    try {
      // Record the mvcc for all transactions in progress.
      w = mvcc.beginMemstoreInsert();
      mvcc.advanceMemstore(w);
      // check if it is not closing.
      if (wal != null) {
        if (!wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes())) {
          status.setStatus("Flush will not be started for ["
              + this.getRegionInfo().getEncodedName() + "] - because the WAL is closing.");
          return false;
        }
        flushSeqId = this.sequenceId.incrementAndGet();
      } else {
        // use the provided sequence Id as WAL is not being used for this flush.
        flushSeqId = myseqid;
      }

      for (Store s : stores.values()) {
        totalFlushableSize += s.getFlushableSize();
        storeFlushCtxs.add(s.createFlushContext(flushSeqId));
      }

      // prepare flush (take a snapshot)
      for (StoreFlushContext flush : storeFlushCtxs) {
//步骤1   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        flush.prepare(); 
      }
    } finally {
      this.updatesLock.writeLock().unlock();
    }
    String s = "Finished memstore snapshotting " + this +
      ", syncing WAL and waiting on mvcc, flushsize=" + totalFlushableSize;
    status.setStatus(s);
    if (LOG.isTraceEnabled()) LOG.trace(s);

    // sync unflushed WAL changes when deferred log sync is enabled
    // see HBASE-8208 for details
    if (wal != null && !shouldSyncLog()) {
//步骤2  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      wal.sync();
    }

    // wait for all in-progress transactions to commit to HLog before
    // we can start the flush. This prevents
    // uncommitted transactions from being written into HFiles.
    // We have to block before we start the flush, otherwise keys that
    // were removed via a rollbackMemstore could be written to Hfiles.
    mvcc.waitForRead(w);

    s = "Flushing stores of " + this;
    status.setStatus(s);
    if (LOG.isTraceEnabled()) LOG.trace(s);

    // Any failure from here on out will be catastrophic requiring server
    // restart so hlog content can be replayed and put back into the memstore.
    // Otherwise, the snapshot content while backed up in the hlog, it will not
    // be part of the current running servers state.
    boolean compactionRequested = false;
    try {
      // A.  Flush memstore to all the HStores.
      // Keep running vector of all store files that includes both old and the
      // just-made new flush store file. The new flushed file is still in the
      // tmp directory.

      for (StoreFlushContext flush : storeFlushCtxs) {
//步骤3   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        flush.flushCache(status);
      }

      // Switch snapshot (in memstore) -> new hfile (thus causing
      // all the store scanners to reset/reseek).
      for (StoreFlushContext flush : storeFlushCtxs) {
//步骤4   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        boolean needsCompaction = flush.commit(status);
        if (needsCompaction) {
          compactionRequested = true;
        }
      }
      storeFlushCtxs.clear();

      // Set down the memstore size by amount of flush.
      this.addAndGetGlobalMemstoreSize(-totalFlushableSize);
    } catch (Throwable t) {
      // An exception here means that the snapshot was not persisted.
      // The hlog needs to be replayed so its content is restored to memstore.
      // Currently, only a server restart will do this.
      // We used to only catch IOEs but its possible that we'd get other
      // exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch
      // all and sundry.
      if (wal != null) {
        wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
      }
      DroppedSnapshotException dse = new DroppedSnapshotException("region: " +
          Bytes.toStringBinary(getRegionName()));
      dse.initCause(t);
      status.abort("Flush failed: " + StringUtils.stringifyException(t));
      throw dse;
    }

    // If we get to here, the HStores have been written.
    if (wal != null) {
      wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
    }

    // Record latest flush time
    this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis();

    // Update the last flushed sequence id for region
    completeSequenceId = flushSeqId;

    // C. Finally notify anyone waiting on memstore to clear:
    // e.g. checkResources().
    synchronized (this) {
      notifyAll(); // FindBugs NN_NAKED_NOTIFY
    }

    long time = EnvironmentEdgeManager.currentTimeMillis() - startTime;
    long memstoresize = this.memstoreSize.get();
    String msg = "Finished memstore flush of ~" +
      StringUtils.humanReadableInt(totalFlushableSize) + "/" + totalFlushableSize +
      ", currentsize=" +
      StringUtils.humanReadableInt(memstoresize) + "/" + memstoresize +
      " for region " + this + " in " + time + "ms, sequenceid=" + flushSeqId +
      ", compaction requested=" + compactionRequested +
      ((wal == null)? "; wal=null": "");
    LOG.info(msg);
    status.setStatus(msg);
    this.recentFlushes.add(new Pair<Long,Long>(time/1000, totalFlushableSize));

    return compactionRequested;
  }

调用HRegion的internalFlushcache方法

1.HRegion 1661 HStore 1941 prepare （获取写锁）用memStore类复制kvset生成snapshot作为本次mem flush的内存

（每次flush会触发region内的所有store的flush，所以flush的最小单位是region，不是store，这也是不太建议多个cf理由的一个原因）

2.HRegion 1674 调用wal 等待wal完成

3.HRegion 1700 HStore flushCache生成tmpfile（一个HStore一个tmpfile，虽然用的tmpfiles是个List）

在

4.HRegion 1706 HStore将新生成的tmpfiles封装为HStorefile，

HStore调用updateStorefiles方法，获得写锁添加到StoreFileManager的List中，提供服务，清空snapshot

HStore 951 needsCompaction方法，调用RatioBasedCompactionPolicy.needsCompaction方法，判断storm是否需要compact

（判断方法hfile数量大于hbase.hstore.compaction.min 和 hbase.hstore.compactionThreshold的最大值数（默认值为3））

--------------------------------------------------------------------------------------------------------------------

2. hregion查看是否split，实现类为split策略类：IncreasingToUpperBoundRegionSplitPolicy

  @Override
  protected boolean shouldSplit() {
    if (region.shouldForceSplit()) return true;
    boolean foundABigStore = false;
    // Get count of regions that have the same common table as this.region
    int tableRegionsCount = getCountOfCommonTableRegions();
    // Get size to check
    long sizeToCheck = getSizeToCheck(tableRegionsCount);

    for (Store store : region.getStores().values()) {
      // If any of the stores is unable to split (eg they contain reference files)
      // then don't split
      if ((!store.canSplit())) {
        return false;
      }

      // Mark if any store is big enough
      long size = store.getSize();
      if (size > sizeToCheck) {
        LOG.debug("ShouldSplit because " + store.getColumnFamilyName() +
          " size=" + size + ", sizeToCheck=" + sizeToCheck +
          ", regionsWithCommonTable=" + tableRegionsCount);
        foundABigStore = true;
      }
    }

    return foundABigStore;
  }

调用IncreasingToUpperBoundRegionSplitPolicy 65 shouldSplit方法，判断，这个region是否需要split

（又是以一个region查看是否需要split的，所以多个cf真的不好）

（（init）initialSize = hbase.increasing.policy.initial.size（预先设置初始值大小）或hbase.hregion.memstore.flush.size （memflush大小））

获取this.region所在表的所有region数 getCountOfCommonTableRegions 为regioncount

当regioncount在0到100之间，取配置hbase.hregion.max.filesize（默认10G）和initialSize*(regioncount^3)的最小值否则取配置hbase.hregion.max.filesize（默认10G）

如，只有一个region，128*1^3=128M

128*2^3=1024M

128*3^3=3456M

128*4^3=8192M

128*5^3=16000M(15G) => 10G 当有5个region就可以用配置了

--------------------------------------------------------------------------------------------------------------------

3.if(split) spliting elif(compact) compacting

http://blackproof.iteye.com/blog/2037159

之前做过笔记，自己都快忘了

又写了一份region split的

生成两个子region的代码：.stepsBeforePONR

 public PairOfSameType<HRegion> stepsBeforePONR(final Server server,
      final RegionServerServices services, boolean testing) throws IOException {
    // Set ephemeral SPLITTING znode up in zk.  Mocked servers sometimes don't
    // have zookeeper so don't do zk stuff if server or zookeeper is null
    if (server != null && server.getZooKeeper() != null) {
      try {
    	    //步骤1@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        createNodeSplitting(server.getZooKeeper(),
          parent.getRegionInfo(), server.getServerName(), hri_a, hri_b);
      } catch (KeeperException e) {
        throw new IOException("Failed creating PENDING_SPLIT znode on " +
          this.parent.getRegionNameAsString(), e);
      }
    }
    this.journal.add(JournalEntry.SET_SPLITTING_IN_ZK);
    if (server != null && server.getZooKeeper() != null) {
      // After creating the split node, wait for master to transition it
      // from PENDING_SPLIT to SPLITTING so that we can move on. We want master
      // knows about it and won't transition any region which is splitting.
	    //步骤2@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      znodeVersion = getZKNode(server, services);
    }

    //步骤3@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    this.parent.getRegionFileSystem().createSplitsDir();
    this.journal.add(JournalEntry.CREATE_SPLIT_DIR);

    Map<byte[], List<StoreFile>> hstoreFilesToSplit = null;
    Exception exceptionToThrow = null;
    try{
	    //步骤4@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      hstoreFilesToSplit = this.parent.close(false);
    } catch (Exception e) {
      exceptionToThrow = e;
    }
    if (exceptionToThrow == null && hstoreFilesToSplit == null) {
      // The region was closed by a concurrent thread.  We can't continue
      // with the split, instead we must just abandon the split.  If we
      // reopen or split this could cause problems because the region has
      // probably already been moved to a different server, or is in the
      // process of moving to a different server.
      exceptionToThrow = closedByOtherException;
    }
    if (exceptionToThrow != closedByOtherException) {
      this.journal.add(JournalEntry.CLOSED_PARENT_REGION);
    }
    if (exceptionToThrow != null) {
      if (exceptionToThrow instanceof IOException) throw (IOException)exceptionToThrow;
      throw new IOException(exceptionToThrow);
    }
    if (!testing) {
	    //步骤5@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      services.removeFromOnlineRegions(this.parent, null);
    }
    this.journal.add(JournalEntry.OFFLINED_PARENT);

    // TODO: If splitStoreFiles were multithreaded would we complete steps in
    // less elapsed time?  St.Ack 20100920
    //
    // splitStoreFiles creates daughter region dirs under the parent splits dir
    // Nothing to unroll here if failure -- clean up of CREATE_SPLIT_DIR will
    // clean this up.
    //步骤6@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    splitStoreFiles(hstoreFilesToSplit);

    // Log to the journal that we are creating region A, the first daughter
    // region.  We could fail halfway through.  If we do, we could have left
    // stuff in fs that needs cleanup -- a storefile or two.  Thats why we
    // add entry to journal BEFORE rather than AFTER the change.
    //步骤7@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    this.journal.add(JournalEntry.STARTED_REGION_A_CREATION);
    HRegion a = this.parent.createDaughterRegionFromSplits(this.hri_a);

    // Ditto
    this.journal.add(JournalEntry.STARTED_REGION_B_CREATION);
    HRegion b = this.parent.createDaughterRegionFromSplits(this.hri_b);
    return new PairOfSameType<HRegion>(a, b);
  }

1.RegionSplitPolicy.getSplitPoint()获得region split的split point ，最大store的中间点midpoint最为split point

2.SplitRequest.run()

实例化SplitTransaction

st.prepare()：split前准备：region是否关闭，所有hfile是否被引用

st.execute:执行split操作

1.createDaughters 创建两个region，获得parent region的写锁

1在zk上创建一个临时的node splitting point，

2等待master直到这个region转为splitting状态

3之后建立splitting的文件夹，

4等待region的flush和compact都完成后，关闭这个region

5从HRegionServer上移除，加入到下线region中

6进行regionsplit操作，创建线程池，用StoreFileSplitter类将region下的所有Hfile（StoreFile）进行split，

（split row在hfile中的不管，其他的都进行引用，把引用文件分别写到region下边）

7.生成左右两个子region，删除meta上parent，根据引用文件生成子region的regioninfo，写到hdfs上

2.stepsAfterPONR 调用DaughterOpener类run打开两个子region，调用initilize

a)向hdfs上写入.regionInfo文件以便meta挂掉以便恢复

b)初始化其下的HStore，主要是LoadStoreFiles函数：

对于该store函数会构造storefile对象，从hdfs上获取路径和文件，每个文件一个

storefile对象，对每个storefile对象会读取文件上的内容创建一个

HalfStoreFileReader读对象来操作该region的父region上的相应的文件，及该

region上目前存储的是引用文件，其指向的是其父region上的相应的文件，对该

region的所有读或写都将关联到父region上

将子Region添加到rs的online region列表上，并添加到meta表上

分享到：

hadoop hive hbase shell小贴 | hbase compact和split策略

2015-02-14 16:34
浏览 1625
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

HBase memflush源码分析

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

HBase memflush源码分析

评论

发表评论

相关推荐

hbase hbck流程

ERROR: Found lingering reference file hdfs

hbase Number of empty REGIONINFO_QUALIFIER rows in hbase:meta: 1

Java线上应用故障排查之一：高CPU占用

hbase报错 java.io.IOException: Connection reset by peer

hive整合hbase

hbase increment代码

hbase问题

hbase export import table

HBase MSLAB和MemStoreChunkPool源码

hbase split log转cloudera的文章

IllegalAccessError HBaseZeroCopyByteString

hbase hlog源码

hbase mvcc

hbase split log源码分析

hbase0.98.1源码编译

hbase put源码分析

HBase RegionServer线程启动

hadoop和hbase lzo压缩

hbase blockcache BucketCache源码分析

最近访客更多访客>>