HBase memflush源码分析

源码为0.98.1

HRegionServer中起线程MemStoreFlusher

private void initializeThreads() throws IOException {
    // Cache flushing thread.
    this.cacheFlusher = new MemStoreFlusher(conf, this);

    // Compaction thread
    this.compactSplitThread = new CompactSplitThread(this);

   .......

  private void startServiceThreads() throws IOException {
    String n = Thread.currentThread().getName();
......
    this.cacheFlusher.start(uncaughtExceptionHandler);

    Threads.setDaemonThreadRunning(this.compactionChecker.getThread(), n +
      ".compactionChecker", uncaughtExceptionHandler);

.....

 /*
   * Run init. Sets up hlog and starts up all server threads.
   *
   * @param c Extra configuration.
   */
  protected void handleReportForDutyResponse(final RegionServerStartupResponse c)
  throws IOException {
....

      startServiceThreads();
.....

  public void run() {
    try {
      // Do pre-registration initializations; zookeeper, lease threads, etc.
      preRegistrationInitialization();
    } catch (Throwable e) {
      abort("Fatal exception during initialization", e);
    }

    try {
      // Try and register with the Master; tell it we are here.  Break if
      // server is stopped or the clusterup flag is down or hdfs went wacky.
      while (keepLooping()) {
        RegionServerStartupResponse w = reportForDuty();
        if (w == null) {
          LOG.warn("reportForDuty failed; sleeping and then retrying.");
          this.sleeper.sleep();
        } else {
          handleReportForDutyResponse(w);//启动所有hregionserver线程服务
          break;
        }
      }
....

主要的类，方法：memStoreFlusher的flushRegion

 private boolean flushRegion(final HRegion region, final boolean emergencyFlush) {
    synchronized (this.regionsInQueue) {
      FlushRegionEntry fqe = this.regionsInQueue.remove(region);
      if (fqe != null && emergencyFlush) {
        // Need to remove from region from delay queue.  When NOT an
        // emergencyFlush, then item was removed via a flushQueue.poll.
        flushQueue.remove(fqe);
     }
    }
    lock.readLock().lock();
    try {
      boolean shouldCompact = region.flushcache();
      // We just want to check the size
      boolean shouldSplit = region.checkSplit() != null;
      if (shouldSplit) {
        this.server.compactSplitThread.requestSplit(region);
      } else if (shouldCompact) {
        server.compactSplitThread.requestSystemCompaction(
            region, Thread.currentThread().getName());
      }
......

从flushQueue中取出FlushRegionEntry进行flush

获取读锁

调用HRegion进行flush，并返回是否需要compact
调用HRegion查看是否需要split
if(split) spliting elif(compact) compacting

以下是具体操作：

--------------------------------------------------------------------------------------------------------------------

1.HRegion

 protected boolean internalFlushcache(
      final HLog wal, final long myseqid, MonitoredTask status)
  throws IOException {
    if (this.rsServices != null && this.rsServices.isAborted()) {
      // Don't flush when server aborting, it's unsafe
      throw new IOException("Aborting flush because server is abortted...");
    }
    final long startTime = EnvironmentEdgeManager.currentTimeMillis();
    // Clear flush flag.
    // If nothing to flush, return and avoid logging start/stop flush.
    if (this.memstoreSize.get() <= 0) {
      if(LOG.isDebugEnabled()) {
        LOG.debug("Empty memstore size for the current region "+this);
      }
      return false;
    }
    if (LOG.isDebugEnabled()) {
      LOG.debug("Started memstore flush for " + this +
        ", current region memstore size " +
        StringUtils.humanReadableInt(this.memstoreSize.get()) +
        ((wal != null)? "": "; wal is null, using passed sequenceid=" + myseqid));
    }

    // Stop updates while we snapshot the memstore of all stores. We only have
    // to do this for a moment.  Its quick.  The subsequent sequence id that
    // goes into the HLog after we've flushed all these snapshots also goes
    // into the info file that sits beside the flushed files.
    // We also set the memstore size to zero here before we allow updates
    // again so its value will represent the size of the updates received
    // during the flush
    MultiVersionConsistencyControl.WriteEntry w = null;

    // We have to take a write lock during snapshot, or else a write could
    // end up in both snapshot and memstore (makes it difficult to do atomic
    // rows then)
    status.setStatus("Obtaining lock to block concurrent updates");
    // block waiting for the lock for internal flush
    this.updatesLock.writeLock().lock();
    long totalFlushableSize = 0;
    status.setStatus("Preparing to flush by snapshotting stores");
    List<StoreFlushContext> storeFlushCtxs = new ArrayList<StoreFlushContext>(stores.size());
    long flushSeqId = -1L;
    try {
      // Record the mvcc for all transactions in progress.
      w = mvcc.beginMemstoreInsert();
      mvcc.advanceMemstore(w);
      // check if it is not closing.
      if (wal != null) {
        if (!wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes())) {
          status.setStatus("Flush will not be started for ["
              + this.getRegionInfo().getEncodedName() + "] - because the WAL is closing.");
          return false;
        }
        flushSeqId = this.sequenceId.incrementAndGet();
      } else {
        // use the provided sequence Id as WAL is not being used for this flush.
        flushSeqId = myseqid;
      }

      for (Store s : stores.values()) {
        totalFlushableSize += s.getFlushableSize();
        storeFlushCtxs.add(s.createFlushContext(flushSeqId));
      }

      // prepare flush (take a snapshot)
      for (StoreFlushContext flush : storeFlushCtxs) {
//步骤1   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        flush.prepare(); 
      }
    } finally {
      this.updatesLock.writeLock().unlock();
    }
    String s = "Finished memstore snapshotting " + this +
      ", syncing WAL and waiting on mvcc, flushsize=" + totalFlushableSize;
    status.setStatus(s);
    if (LOG.isTraceEnabled()) LOG.trace(s);

    // sync unflushed WAL changes when deferred log sync is enabled
    // see HBASE-8208 for details
    if (wal != null && !shouldSyncLog()) {
//步骤2  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      wal.sync();
    }

    // wait for all in-progress transactions to commit to HLog before
    // we can start the flush. This prevents
    // uncommitted transactions from being written into HFiles.
    // We have to block before we start the flush, otherwise keys that
    // were removed via a rollbackMemstore could be written to Hfiles.
    mvcc.waitForRead(w);

    s = "Flushing stores of " + this;
    status.setStatus(s);
    if (LOG.isTraceEnabled()) LOG.trace(s);

    // Any failure from here on out will be catastrophic requiring server
    // restart so hlog content can be replayed and put back into the memstore.
    // Otherwise, the snapshot content while backed up in the hlog, it will not
    // be part of the current running servers state.
    boolean compactionRequested = false;
    try {
      // A.  Flush memstore to all the HStores.
      // Keep running vector of all store files that includes both old and the
      // just-made new flush store file. The new flushed file is still in the
      // tmp directory.

      for (StoreFlushContext flush : storeFlushCtxs) {
//步骤3   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        flush.flushCache(status);
      }

      // Switch snapshot (in memstore) -> new hfile (thus causing
      // all the store scanners to reset/reseek).
      for (StoreFlushContext flush : storeFlushCtxs) {
//步骤4   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        boolean needsCompaction = flush.commit(status);
        if (needsCompaction) {
          compactionRequested = true;
        }
      }
      storeFlushCtxs.clear();

      // Set down the memstore size by amount of flush.
      this.addAndGetGlobalMemstoreSize(-totalFlushableSize);
    } catch (Throwable t) {
      // An exception here means that the snapshot was not persisted.
      // The hlog needs to be replayed so its content is restored to memstore.
      // Currently, only a server restart will do this.
      // We used to only catch IOEs but its possible that we'd get other
      // exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch
      // all and sundry.
      if (wal != null) {
        wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
      }
      DroppedSnapshotException dse = new DroppedSnapshotException("region: " +
          Bytes.toStringBinary(getRegionName()));
      dse.initCause(t);
      status.abort("Flush failed: " + StringUtils.stringifyException(t));
      throw dse;
    }

    // If we get to here, the HStores have been written.
    if (wal != null) {
      wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
    }

    // Record latest flush time
    this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis();

    // Update the last flushed sequence id for region
    completeSequenceId = flushSeqId;

    // C. Finally notify anyone waiting on memstore to clear:
    // e.g. checkResources().
    synchronized (this) {
      notifyAll(); // FindBugs NN_NAKED_NOTIFY
    }

    long time = EnvironmentEdgeManager.currentTimeMillis() - startTime;
    long memstoresize = this.memstoreSize.get();
    String msg = "Finished memstore flush of ~" +
      StringUtils.humanReadableInt(totalFlushableSize) + "/" + totalFlushableSize +
      ", currentsize=" +
      StringUtils.humanReadableInt(memstoresize) + "/" + memstoresize +
      " for region " + this + " in " + time + "ms, sequenceid=" + flushSeqId +
      ", compaction requested=" + compactionRequested +
      ((wal == null)? "; wal=null": "");
    LOG.info(msg);
    status.setStatus(msg);
    this.recentFlushes.add(new Pair<Long,Long>(time/1000, totalFlushableSize));

    return compactionRequested;
  }

调用HRegion的internalFlushcache方法

1.HRegion 1661 HStore 1941 prepare （获取写锁）用memStore类复制kvset生成snapshot作为本次mem flush的内存

（每次flush会触发region内的所有store的flush，所以flush的最小单位是region，不是store，这也是不太建议多个cf理由的一个原因）

2.HRegion 1674 调用wal 等待wal完成

3.HRegion 1700 HStore flushCache生成tmpfile（一个HStore一个tmpfile，虽然用的tmpfiles是个List）

在

4.HRegion 1706 HStore将新生成的tmpfiles封装为HStorefile，

HStore调用updateStorefiles方法，获得写锁添加到StoreFileManager的List中，提供服务，清空snapshot

HStore 951 needsCompaction方法，调用RatioBasedCompactionPolicy.needsCompaction方法，判断storm是否需要compact

（判断方法hfile数量大于hbase.hstore.compaction.min 和 hbase.hstore.compactionThreshold的最大值数（默认值为3））

--------------------------------------------------------------------------------------------------------------------

2. hregion查看是否split，实现类为split策略类：IncreasingToUpperBoundRegionSplitPolicy

  @Override
  protected boolean shouldSplit() {
    if (region.shouldForceSplit()) return true;
    boolean foundABigStore = false;
    // Get count of regions that have the same common table as this.region
    int tableRegionsCount = getCountOfCommonTableRegions();
    // Get size to check
    long sizeToCheck = getSizeToCheck(tableRegionsCount);

    for (Store store : region.getStores().values()) {
      // If any of the stores is unable to split (eg they contain reference files)
      // then don't split
      if ((!store.canSplit())) {
        return false;
      }

      // Mark if any store is big enough
      long size = store.getSize();
      if (size > sizeToCheck) {
        LOG.debug("ShouldSplit because " + store.getColumnFamilyName() +
          " size=" + size + ", sizeToCheck=" + sizeToCheck +
          ", regionsWithCommonTable=" + tableRegionsCount);
        foundABigStore = true;
      }
    }

    return foundABigStore;
  }

调用IncreasingToUpperBoundRegionSplitPolicy 65 shouldSplit方法，判断，这个region是否需要split

（又是以一个region查看是否需要split的，所以多个cf真的不好）

（（init）initialSize = hbase.increasing.policy.initial.size（预先设置初始值大小）或hbase.hregion.memstore.flush.size （memflush大小））

获取this.region所在表的所有region数 getCountOfCommonTableRegions 为regioncount

当regioncount在0到100之间，取配置hbase.hregion.max.filesize（默认10G）和initialSize*(regioncount^3)的最小值否则取配置hbase.hregion.max.filesize（默认10G）

如，只有一个region，128*1^3=128M

128*2^3=1024M

128*3^3=3456M

128*4^3=8192M

128*5^3=16000M(15G) => 10G 当有5个region就可以用配置了

--------------------------------------------------------------------------------------------------------------------

3.if(split) spliting elif(compact) compacting

http://blackproof.iteye.com/blog/2037159

之前做过笔记，自己都快忘了

又写了一份region split的

生成两个子region的代码： .stepsBeforePONR

 public PairOfSameType<HRegion> stepsBeforePONR(final Server server,
      final RegionServerServices services, boolean testing) throws IOException {
    // Set ephemeral SPLITTING znode up in zk.  Mocked servers sometimes don't
    // have zookeeper so don't do zk stuff if server or zookeeper is null
    if (server != null && server.getZooKeeper() != null) {
      try {
    	    //步骤1@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        createNodeSplitting(server.getZooKeeper(),
          parent.getRegionInfo(), server.getServerName(), hri_a, hri_b);
      } catch (KeeperException e) {
        throw new IOException("Failed creating PENDING_SPLIT znode on " +
          this.parent.getRegionNameAsString(), e);
      }
    }
    this.journal.add(JournalEntry.SET_SPLITTING_IN_ZK);
    if (server != null && server.getZooKeeper() != null) {
      // After creating the split node, wait for master to transition it
      // from PENDING_SPLIT to SPLITTING so that we can move on. We want master
      // knows about it and won't transition any region which is splitting.
	    //步骤2@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      znodeVersion = getZKNode(server, services);
    }

    //步骤3@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    this.parent.getRegionFileSystem().createSplitsDir();
    this.journal.add(JournalEntry.CREATE_SPLIT_DIR);

    Map<byte[], List<StoreFile>> hstoreFilesToSplit = null;
    Exception exceptionToThrow = null;
    try{
	    //步骤4@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      hstoreFilesToSplit = this.parent.close(false);
    } catch (Exception e) {
      exceptionToThrow = e;
    }
    if (exceptionToThrow == null && hstoreFilesToSplit == null) {
      // The region was closed by a concurrent thread.  We can't continue
      // with the split, instead we must just abandon the split.  If we
      // reopen or split this could cause problems because the region has
      // probably already been moved to a different server, or is in the
      // process of moving to a different server.
      exceptionToThrow = closedByOtherException;
    }
    if (exceptionToThrow != closedByOtherException) {
      this.journal.add(JournalEntry.CLOSED_PARENT_REGION);
    }
    if (exceptionToThrow != null) {
      if (exceptionToThrow instanceof IOException) throw (IOException)exceptionToThrow;
      throw new IOException(exceptionToThrow);
    }
    if (!testing) {
	    //步骤5@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      services.removeFromOnlineRegions(this.parent, null);
    }
    this.journal.add(JournalEntry.OFFLINED_PARENT);

    // TODO: If splitStoreFiles were multithreaded would we complete steps in
    // less elapsed time?  St.Ack 20100920
    //
    // splitStoreFiles creates daughter region dirs under the parent splits dir
    // Nothing to unroll here if failure -- clean up of CREATE_SPLIT_DIR will
    // clean this up.
    //步骤6@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    splitStoreFiles(hstoreFilesToSplit);

    // Log to the journal that we are creating region A, the first daughter
    // region.  We could fail halfway through.  If we do, we could have left
    // stuff in fs that needs cleanup -- a storefile or two.  Thats why we
    // add entry to journal BEFORE rather than AFTER the change.
    //步骤7@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    this.journal.add(JournalEntry.STARTED_REGION_A_CREATION);
    HRegion a = this.parent.createDaughterRegionFromSplits(this.hri_a);

    // Ditto
    this.journal.add(JournalEntry.STARTED_REGION_B_CREATION);
    HRegion b = this.parent.createDaughterRegionFromSplits(this.hri_b);
    return new PairOfSameType<HRegion>(a, b);
  }

1.RegionSplitPolicy.getSplitPoint()获得region split的split point ，最大store的中间点midpoint最为split point

2.SplitRequest.run()

实例化SplitTransaction

st.prepare()：split前准备：region是否关闭，所有hfile是否被引用

st.execute:执行split操作

1.createDaughters 创建两个region，获得parent region的写锁

1在zk上创建一个临时的node splitting point，

2等待master直到这个region转为splitting状态

3之后建立splitting的文件夹，

4等待region的flush和compact都完成后，关闭这个region

5从HRegionServer上移除，加入到下线region中

6进行regionsplit操作，创建线程池，用StoreFileSplitter类将region下的所有Hfile（StoreFile）进行split，

（split row在hfile中的不管，其他的都进行引用，把引用文件分别写到region下边）

7.生成左右两个子region，删除meta上parent，根据引用文件生成子region的regioninfo，写到hdfs上

2.stepsAfterPONR 调用DaughterOpener类run打开两个子region，调用initilize

a)向hdfs上写入.regionInfo文件以便meta挂掉以便恢复

b)初始化其下的HStore，主要是LoadStoreFiles函数：

对于该store函数会构造storefile对象，从hdfs上获取路径和文件，每个文件一个

storefile对象，对每个storefile对象会读取文件上的内容创建一个

HalfStoreFileReader读对象来操作该region的父region上的相应的文件，及该

region上目前存储的是引用文件，其指向的是其父region上的相应的文件，对该

region的所有读或写都将关联到父region上

将子Region添加到rs的online region列表上，并添加到meta表上

HBase memflush源码分析

猜你喜欢