HBase Compaction principle and online tuning practice

Author: Vivo Internet Storage Technology Team - Hang Zhengbo

This paper introduces the principle, process and current limiting strategy of HBase Compaction in detail, lists several cases of online tuning, and finally summarizes the relevant parameters of Compaction.

1. Introduction to Compaction

HBase is designed based on a storage model of the LSM-Tree (Log-Structured Merge Tree) architecture. When writing, it first writes to the WAL (Write-Ahead-Log) log, and then writes to the Memstore cache. After certain conditions are met, The Flush operation will be executed to flush the cached data to disk and generate an HFile data file. As data continues to be written, there will be more and more HFile files. Too many files will increase the number of IOs when querying data, which will affect the query performance of HBase. In order to optimize the performance of reading, the method of merging small HFiles is used to reduce the number of files. This operation of merging HFiles is called Compaction. Compaction is the process of selecting some HFile files from a Store in a Region for merging. The principle of merging is to read KeyValue sequentially from the data files to be merged, sort them from small to large and write them into a new file. Afterwards, this newly generated file will replace all previously merged files and provide external services.

1.1 Classification of Compaction

HBase divides Compaction into two types according to the size of consolidation: Minor Compaction and Major Compaction.

  • Minor Compaction refers to selecting some small, adjacent HFiles and merging them into a larger HFile;

  • Major Compaction refers to merging all HFiles in a Store into one HFile. This process will clean up three meaningless data: TTL expired data, deleted data, and data whose version number exceeds the set version number.

The figure below vividly describes the difference between the two types of Compaction:

picture

Under normal circumstances, Major Compaction lasts for a long time, and the whole process consumes a lot of system resources. Therefore, for businesses with a large amount of online data, it is usually recommended to disable the function of automatically triggering Major Compaction and manually trigger it during low-peak business periods (or set a policy to automatically triggered during off-peak periods).

1.2 Significance of Compaction

  1. Merge small files, reduce the number of files, improve read performance, and stabilize random read delays;

  2. When merging, the files on the remote DataNode will be read and written to the local DataNode to improve the localization rate of data;

  3. Clear expired data and deleted data to reduce the storage capacity of the table.

1.3 Compaction trigger timing

There are many timings for triggering Compaction in HBase, and there are three most common triggering timings : background thread periodic check trigger, MemStore Flush trigger, and manual trigger.

(1) Periodic check of background thread : Background thread CompactionChecker will periodically check whether Compaction needs to be executed. The check cycle is hbase.server.thread.wakefrequency *hbase.server.compactchecker.interval.multiplier. The main consideration here is that there is no Flush cannot trigger Compaction due to a write request. The default value of the parameter hbase.server.thread.wakefrequency is 10s, which is the HBase server thread wake-up interval, and the default value of the parameter hbase.server.compactchecker.interval.multiplier is 1000, which is the multiplier factor for the periodic check of the Compaction operation. 10 * 1000 s is approximately equal to 2hrs 46mins 40sec.

(2) MemStore Flush : The root of Compaction lies in Flush. When MemStore reaches a certain threshold, Flush will be triggered, and the data in memory will be flushed to disk to generate HFile files. As more and more HFile files become available, Compaction needs to be executed. After each Flush, HBase will judge whether to perform Compaction, and once the conditions of Minor Compaction or Major Compaction are met, the execution will be triggered.

(3) Manual : refers to executing commands such as compact and major_compact through HBase API, HBase Shell or Master UI interface.

2. Compaction process

After understanding the basic background, let's introduce the whole process of Compaction.

  • RegionServer starts a Compaction check thread to check the Store of the Region periodically;

  • Compaction starts with specific trigger conditions. Once triggered, HBase will hand over the Compaction to an independent thread for processing;

  • Select the appropriate HFile file from the corresponding Store. This step is the core of the entire Compaction. Many conditions need to be followed when selecting files, such as the number of files should not be too many or too few, and the file size should not be too large. Select as much as possible Hosts IO-heavy filesets. Based on this, HBase implements a variety of file selection strategies: commonly used are

    RatioBasedCompactionPolicy, ExploringCompactionPolicy, and StripeCompactionPolicy also support custom Compaction algorithms;

  • After the files to be merged are selected, the corresponding thread pool will be selected for processing according to the total size of these HFile files;

  • Perform specific Compaction operations on these files.

The figure below briefly describes the above process.

picture

Each specific step in FIG. 2 will be described in detail below.

2.1 Start the Compaction timing thread

When the RegionServer starts, it will initialize the CompactSplitThread thread and the CompactionChecker for regular checks, which are executed every 10s by default.

// Compaction thread
this.compactSplitThread = new CompactSplitThread(this);
// Background thread to check for compactions; needed if region has not gotten updates
// in a while. It will take care of not checking too frequently on store-by-store basis.
this.compactionChecker = new CompactionChecker(this, this.threadWakeFrequency, this);
 
if (this.compactionChecker != null) choreService.scheduleChore(compactionChecker);

Among them, CompactSplitThread is used to realize the class of Compaction and Split process, and CompactChecker is used to periodically check whether to execute Compaction.

CompactionChecker is of type ScheduledChore, and ScheduledChore is a Task that HBase executes periodically.

2.2 Trigger Compaction

The triggering timing of Compaction has been introduced above, and the three triggering mechanisms will be introduced in detail below.

2.2.1 Background thread periodic check

The background thread CompactChecker periodically checks whether Compaction needs to be executed, and the check period is hbase.regionserver.compaction.check.period (default 10s).

(1) First check whether the number of files is greater than the number of files that can execute Compaction, and once it is greater than, Compaction will be triggered.

(2) If it is not satisfied, it will then check whether it is the execution cycle of Major Compaction. If the earliest update time of HFile in the current Store is earlier than a certain value mcTime, Major Compaction will be triggered, where mcTime is a floating value, and the floating interval defaults to [7-7*0.2, 7+7*0.2], where 7 is The configuration item hbase.hregion.majorcompaction is set, and 0.2 is the configuration item hbase.hregion.majorcompaction.jitter, so a Major Compaction will be executed in about 7 days. If users want to disable Major Compaction, they only need to set the parameter hbase.hregion.majorcompaction to 0.

(3) If it comes to the execution cycle of Major Compaction:

  • First determine how many HFile files there are. If there is only one file, it will determine whether there is expired data and whether the localization rate is relatively low. If not satisfied, major compaction will not be done;

  • If it is larger than 1 file, Major Compaction will also be done.

The flow of the background thread periodic check is shown in Figure 3.

picture

Here is the key code of this thread:

//ScheduledChore的run方法会一直调用chore函数
@Override
protected void chore() {
  // 遍历instance下的所有online的region 进行循环检测
  // onlineRegions是HRegionServer上存储的所有能够提供有效服务的在线Region集合;
  for (HRegion r : this.instance.onlineRegions.values()) {
    if (r == null)
      continue;
    // 取出每个region的store
    for (Store s : r.getStores().values()) {
      try {
        // 检查是否需要compact的时间间隔 hbase.server.compactchecker.interval.multiplier * hbase.server.thread.wakefrequency,multiplier默认1000;
        long multiplier = s.getCompactionCheckMultiplier();
        assert multiplier > 0;
        // 未到multiplier的倍数跳过,每当迭代因子iteration为合并检查倍增器multiplier的整数倍时,才会发起检查
        if (iteration % multiplier != 0) continue;
        // 需要合并的话,发起SystemCompaction请求,此处最终比较的是是否当前hfile数量减去正在compacting的文件数大于设置的compact min值。若满足则执行systemcompact
        if (s.needsCompaction()) {
          // Queue a compaction. Will recognize if major is needed.
          this.instance.compactSplitThread.requestSystemCompaction(r, s, getName()
              + " requests compaction");
        } else if (s.isMajorCompaction()) {
          if (majorCompactPriority == DEFAULT_PRIORITY
              || majorCompactPriority > r.getCompactPriority()) {
            this.instance.compactSplitThread.requestCompaction(r, s, getName()
                + " requests major compaction; use default priority", null);
          } else {
            this.instance.compactSplitThread.requestCompaction(r, s, getName()
                + " requests major compaction; use configured priority",
              this.majorCompactPriority, null);
          }
        }
      } catch (IOException e) {
        LOG.warn("Failed major compaction check on " + r, e);
      }
    }
  }
  iteration = (iteration == Long.MAX_VALUE) ? 0 : (iteration + 1);
}

2.2.2 Memstore Flush trigger

Memstore Flush will generate HFile files, and more and more files require Compaction. Therefore, after each Flush operation is executed, the number of files in the current Store will be judged. Once the number of files exceeds the threshold of Compaction, Compaction will be triggered. What needs to be emphasized here is that compaction is performed in units of stores, and under the Flush trigger condition, all stores in the entire region will perform compaction, so compaction may be performed multiple times in a short period of time. The following is the code for the Flush operation to trigger the Compaction.

/**
   * Flush a region.
   * @param region Region to flush.
   * @param emergencyFlush Set if we are being force flushed. If true the region
   * needs to be removed from the flush queue. If false, when we were called
   * from the main flusher run loop and we got the entry to flush by calling
   * poll on the flush queue (which removed it).
   * @param forceFlushAllStores whether we want to flush all store.
   * @return true if the region was successfully flushed, false otherwise. If
   * false, there will be accompanying log messages explaining why the region was
   * not flushed.
   */
  private boolean flushRegion(final Region region, final boolean emergencyFlush,
      boolean forceFlushAllStores) {
    synchronized (this.regionsInQueue) {
      FlushRegionEntry fqe = this.regionsInQueue.remove(region);
      // Use the start time of the FlushRegionEntry if available
      if (fqe != null && emergencyFlush) {
        // Need to remove from region from delay queue.  When NOT an
        // emergencyFlush, then item was removed via a flushQueue.poll.
        flushQueue.remove(fqe);
      }
    }
  
    lock.readLock().lock();
    try {
      // flush
      notifyFlushRequest(region, emergencyFlush);
      FlushResult flushResult = region.flush(forceFlushAllStores);
     // 检查是否需要compact
      boolean shouldCompact = flushResult.isCompactionNeeded();
      // We just want to check the size
      // 检查是否需要split
      boolean shouldSplit = ((HRegion)region).checkSplit() != null;
      if (shouldSplit) {
        this.server.compactSplitThread.requestSplit(region);
      } else if (shouldCompact) {
        // 发起compact请求
        server.compactSplitThread.requestSystemCompaction(
            region, Thread.currentThread().getName());
      }
    } catch (DroppedSnapshotException ex) {
      // Cache flush can fail in a few places. If it fails in a critical
      // section, we get a DroppedSnapshotException and a replay of wal
      // is required. Currently the only way to do this is a restart of
      // the server. Abort because hdfs is probably bad (HBASE-644 is a case
      // where hdfs was bad but passed the hdfs check).
      server.abort("Replay of WAL required. Forcing server shutdown", ex);
      return false;
    } catch (IOException ex) {
      LOG.error("Cache flush failed" + (region != null ? (" for region " +
          Bytes.toStringBinary(region.getRegionInfo().getRegionName())) : ""),
        RemoteExceptionHandler.checkIOException(ex));
      if (!server.checkFileSystem()) {
        return false;
      }
    } finally {
      lock.readLock().unlock();
      wakeUpIfBlocking();
    }
    return true;
  }

2.2.3 Manual trigger

Manual triggering is to manually trigger Compaction through commands or API interfaces. There are three reasons for manual triggering:

  • Many businesses are worried that automatic Major Compaction will affect read and write performance, so they will choose to trigger manually during off-peak periods;

  • The user wants to take effect immediately after modifying the ttl attribute, and manually triggers Major Compaction;

  • When the hard disk capacity is insufficient, manually trigger Major Compaction to delete a large amount of expired data.

Most of them are manually triggered based on the first reason.

2.3 Select the files to be merged

The core of Compaction is to select the appropriate file to merge, because the size of the merged file and the IO currently carried by it directly determine the effect of Compaction. I hope to find such a file: it carries a large number of IO requests but the file size is small, so that the compaction itself will not consume too much IO, and the read performance will be significantly improved after the merge is completed. In reality, this may not be the case for the most part. Currently, HBase provides multiple Minor Compaction file selection strategies, which are set through the configuration item hbase.hstore.engine.class. Regardless of the strategy, some filtering operations must be performed on the files before execution to exclude unqualified files, so as to reduce the workload of Compaction and reduce the impact on reading and writing.

  • Exclude files that are currently executing Compaction;

  • If all records in a file have expired, delete the file directly;

  • Exclude a single file that is too large. If the file size is greater than hbase.hstore.compaction.max.size (the default Long maximum value), it will be excluded, and a large amount of IO consumption will not be excluded.

The remaining files after exclusion are called candidate files, and then it will be judged whether the Major Compaction conditions are met, and if so, all files will be selected for merging. There are three judgment conditions as follows, and Major Compaction will be executed as long as one of them is met:

  • It is not applicable if automatic execution of Major Compaction is turned off when the cycle of automatic execution of Compaction is reached and the number of candidate files is less than hbase.hstore.compaction.max (default 10);

  • The Store contains a Reference file, which is a temporary reference file generated by the Split Region and is deleted during the Compaction process;

  • Major Compaction performed manually by the user.

If the above execution conditions are not met, it is Minor compaction. There are many policies for Minor Compaction. The following will focus on the execution policies of RationBasedCompactionPolicy (version before 0.98), ExploringCompactionPolicy (default version after 0.98) and StripeCompactionPolicy.

2.3.1 Modeling of Compaction file selection strategy

The so-called compaction file selection strategy can be modeled as the following problem:

picture

Each number in the figure represents the Sequence ID of the file. The larger the number, the newer the file. It is likely that it has just been flushed, which means that the file size may also be smaller. Such files are preferred during Compaction, so the Storefile files under the Store will be sorted from small to large according to the Sequence ID, and marked as f[0], f[1]...f[n-1] in turn, and filtered The strategy is to determine the Storefile within a continuous range [Start, End] to participate in Compaction.

The purpose of Compaction is to reduce the number of files and delete useless data, and optimize reading performance. The implementation of Compaction is to rewrite the contents of the original file to a new file. If the file is too large, it means that the Compaction takes a long time, and the IO generated during the Compaction process is enlarged. The more obvious, so the criterion for file screening is to use the minimum IO cost to merge and reduce the number of files the most.

Compaction relies on two prerequisites:

  • All StoreFiles are sorted in order (this order is: old files first, new files last);

  • The files involved in Compaction must be continuous.

2.3.2 RationBasedCompactionPolicy

The basic idea is to select the premise of fixing End as the last file (in general cases), slide from the head of the queue to find Start, and stop scanning until Start satisfies one of the following conditions:

  1. Current file size < sum of all file sizes newer than the current file * ratio, which satisfies the formula f[start].size <= ratio * (f[start+1].size +.......+ f[end -1]. size). Among them, ration is a variable ratio, peak period ration is 1.2, non-peak period ration is 5, non-peak period allows larger files to be merged. You can set the time period of the peak period through the parameters hbase.offpeak.start.hour and hbase.offpeak.end.hour.

  2. The current number of remaining candidate files >= hbase.store.compaction.min (default is 3), because it is necessary to ensure that the number of files in this compaction is greater than the configured minimum value of compaction.

The specific logic code of RationBasedCompactionPolicy is attached below.

/**
  * @param candidates pre-filtrate
  * @return filtered subset
  * -- Default minor compaction selection algorithm:
  * choose CompactSelection from candidates --
  * First exclude bulk-load files if indicated in configuration.
  * Start at the oldest file and stop when you find the first file that
  * meets compaction criteria:
  * (1) a recently-flushed, small file (i.e. <= minCompactSize)
  * OR
  * (2) within the compactRatio of sum(newer_files)
  * Given normal skew, any newer files will also meet this criteria
  * <p/>
  * Additional Note:
  * If fileSizes.size() >> maxFilesToCompact, we will recurse on
  * compact().  Consider the oldest files first to avoid a
  * situation where we always compact [end-threshold,end).  Then, the
  * last file becomes an aggregate of the previous compactions.
  *
  * normal skew:
  *
  *         older ----> newer (increasing seqID)
  *     _
  *    | |   _
  *    | |  | |   _
  *  --|-|- |-|- |-|---_-------_-------  minCompactSize
  *    | |  | |  | |  | |  _  | |
  *    | |  | |  | |  | | | | | |
  *    | |  | |  | |  | | | | | |
  */
ArrayList<StoreFile> applyCompactionPolicy(ArrayList<StoreFile> candidates,
    boolean mayUseOffPeak, boolean mayBeStuck) throws IOException {
  if (candidates.isEmpty()) {
    return candidates;
  }
  
  // we're doing a minor compaction, let's see what files are applicable
  int start = 0;
  // 获取文件合并比例:取参数hbase.hstore.compaction.ratio,默认为1.2
  double ratio = comConf.getCompactionRatio();
  if (mayUseOffPeak) {
    // 取参数hbase.hstore.compaction.ratio.offpeak,默认为5.0
    ratio = comConf.getCompactionRatioOffPeak();
    LOG.info("Running an off-peak compaction, selection ratio = " + ratio);
  }
  
  // get store file sizes for incremental compacting selection.
  final int countOfFiles = candidates.size();
  long[] fileSizes = new long[countOfFiles];
  long[] sumSize = new long[countOfFiles];
  for (int i = countOfFiles - 1; i >= 0; --i) {
    StoreFile file = candidates.get(i);
    fileSizes[i] = file.getReader().length();
    // calculate the sum of fileSizes[i,i+maxFilesToCompact-1) for algo
    // tooFar表示后移动最大文件数位置的文件大小,也就是刚刚满足达到最大文件数位置的那个文件,从i至tooFar数目为合并时允许的最大文件数
    int tooFar = i + comConf.getMaxFilesToCompact() - 1;
    sumSize[i] = fileSizes[i]
      + ((i + 1 < countOfFiles) ? sumSize[i + 1] : 0)
      - ((tooFar < countOfFiles) ? fileSizes[tooFar] : 0);
  }
  // 倒序循环,如果文件数目满足最小合并时允许的最小文件数,且该位置的文件大小大于合并时允许的文件最小大小与下一个文件窗口文件总大小乘以一定比例中的较大者,则继续;
  // 实际上就是选择出一个文件窗口内能最小能满足的文件大小的一组文件
  while (countOfFiles - start >= comConf.getMinFilesToCompact() &&
    fileSizes[start] > Math.max(comConf.getMinCompactSize(),
        (long) (sumSize[start + 1] * ratio))) {
    ++start;
  }
  if (start < countOfFiles) {
    LOG.info("Default compaction algorithm has selected " + (countOfFiles - start)
      + " files from " + countOfFiles + " candidates");
  } else if (mayBeStuck) {
    // We may be stuck. Compact the latest files if we can.保证最小文件数目的要求
    int filesToLeave = candidates.size() - comConf.getMinFilesToCompact();
    if (filesToLeave >= 0) {
      start = filesToLeave;
    }
  }
  candidates.subList(0, start).clear();
  return candidates;
}

2.3.3 ExploringCompactionPolicy

This policy is inherited from RatioBasedCompactionPolicy. The difference is that the Ration policy will stop scanning after finding a suitable file collection, while the Exploring policy will divide the Storefile list into multiple subqueues, and find an optimal solution to participate in Compaction. The optimal solution can be understood as: when the number of files to be merged is the largest or the number of files to be merged is the same, the files are smaller, which is beneficial to reduce the IO consumption caused by Compaction. The algorithm flow can be described as:

  1. Traverse the file from beginning to end, and judge all combinations that meet the conditions;

  2. Select the number of files in the combination >= minFiles, and <= maxFiles;

  3. Calculate the total size of each combined file, select the combined size <= MaxCompactSize, and >= minCompactSize;

  4. The size of each file in each combination must satisfy FileSize(i) <= (sum(0,N,FileSize(_)) - FileSize(i)) * ration, the meaning is to remove very large files, each Compaction You should try to merge some smaller files;

  5. Select the largest number of files in the combination that meets the above 1-4 conditions. When the number of files is the same, further select the smallest total file size. The purpose is to merge as many files as possible and the IO pressure brought by Compaction should be as small as possible.

The specific logic code of ExploringCompactionPolicy is attached below.

public List<StoreFile> applyCompactionPolicy(final List<StoreFile> candidates,
       boolean mightBeStuck, boolean mayUseOffPeak, int minFiles, int maxFiles) {
  
    final double currentRatio = mayUseOffPeak
        ? comConf.getCompactionRatioOffPeak() : comConf.getCompactionRatio();
  
    // Start off choosing nothing.
    List<StoreFile> bestSelection = new ArrayList<StoreFile>(0);
    List<StoreFile> smallest = mightBeStuck ? new ArrayList<StoreFile>(0) : null;
    long bestSize = 0;
    long smallestSize = Long.MAX_VALUE;
  
    int opts = 0, optsInRatio = 0, bestStart = -1; // for debug logging
    // Consider every starting place. 从头到尾遍历文件
    for (int start = 0; start < candidates.size(); start++) {
      // Consider every different sub list permutation in between start and end with min files.
      for (int currentEnd = start + minFiles - 1;
          currentEnd < candidates.size(); currentEnd++) {
        List<StoreFile> potentialMatchFiles = candidates.subList(start, currentEnd + 1);
  
        // Sanity checks
        if (potentialMatchFiles.size() < minFiles) {
          continue;
        }
        if (potentialMatchFiles.size() > maxFiles) {
          continue;
        }
  
        // Compute the total size of files that will
        // have to be read if this set of files is compacted. 计算文件大小
        long size = getTotalStoreSize(potentialMatchFiles);
  
        // Store the smallest set of files.  This stored set of files will be used
        // if it looks like the algorithm is stuck. 总size最小的
        if (mightBeStuck && size < smallestSize) {
          smallest = potentialMatchFiles;
          smallestSize = size;
        }
  
        if (size > comConf.getMaxCompactSize(mayUseOffPeak)) {
          continue;
        }
  
        ++opts;
        if (size >= comConf.getMinCompactSize()
            && !filesInRatio(potentialMatchFiles, currentRatio)) {
          continue;
        }
  
        ++optsInRatio;
        if (isBetterSelection(bestSelection, bestSize, potentialMatchFiles, size, mightBeStuck)) {
          bestSelection = potentialMatchFiles;
          bestSize = size;
          bestStart = start;
        }
      }
    }
    if (bestSelection.size() == 0 && mightBeStuck) {
      LOG.debug("Exploring compaction algorithm has selected " + smallest.size()
          + " files of size "+ smallestSize + " because the store might be stuck");
      return new ArrayList<StoreFile>(smallest);
    }
    LOG.debug("Exploring compaction algorithm has selected " + bestSelection.size()
        + " files of size " + bestSize + " starting at candidate #" + bestStart +
        " after considering " + opts + " permutations with " + optsInRatio + " in ratio");
    return new ArrayList<StoreFile>(bestSelection);
  }

2.3.4 StripeCompactionPolicy

Stripe Compaction ( HBASE-7667 ) was proposed to reduce the pressure of Major Compaction. The idea is: the most direct way to reduce the pressure of Major Compaction is to reduce the size of the Region. It is best that the entire cluster is composed of many small Regions, so that the total size of the files participating in the Compaction must not be too large. However, if the Region setting is too small, there will be a large number of Regions. On the one hand, it will lead to a large overhead for HBase to manage Regions. On the other hand, too many Regions will also require HBase to allocate more memory for use as Memstore, otherwise it may cause the entire RegionServer level. Flush, which in turn causes long-term write blocking. Therefore, simply setting the Region size too small cannot essentially solve the problem.

(1) Level Compaction

Community developers draw on Leveldb's Compaction strategy Level Compaction. The design idea of ​​Level Compaction is to divide all the data in the Store into many layers, and each layer will have some data, as shown in the following figure:

picture

The data organization form is no longer organized according to time, but according to KeyRange. Each KeyRange will contain multiple files, and the Keys of all data in these files must be distributed in the same range. For example, all the data whose Key is distributed between Key0~KeyN will fall in the file of the first KeyRange interval, and all the data whose Key is distributed between KeyN+1~KeyT will be distributed in the file of the second interval. analogy.

The entire data system will be divided into many layers, the top layer (Level 0) represents the latest data, and the bottom layer (Level 6) represents the oldest data. Each layer consists of a large number of KeyRange blocks (except Level 0), and there is no key overlap between KeyRange. Moreover, the larger the number of layers, the larger the size of each KeyRange block of the corresponding layer, and the size of the KeyRange block of the lower layer is 10 times the size of the upper layer. The darker the Range color in the figure, the larger the corresponding Range block.

After the data is flushed from the Memstore, it will first fall into Level 0. At this time, the data falling into Level 0 may contain all possible keys. At this time, if you need to perform Compaction, you only need to read out the KV in Level 0 one by one, and then insert them into the file corresponding to the KeyRange block in Level 1 according to the distribution of the Key. If the size of a KeyRange block in Level 1 happens to be If it exceeds a certain threshold, it will continue to merge to the next layer.

Level Compaction will still have the concept of Major Compaction. Major Compaction only needs to merge the files in the Range block instead of merging the data files in the entire Region.

It can be seen that during the merging process of this kind of Compaction, only some files need to participate from top to bottom, and there is no need to perform Compaction operations on all files. In addition, Level Compaction has another advantage. For many businesses that only read recently written data, most of the read requests will fall to Level 0, so that SSD can be used as the upper level storage medium to further optimize reading. However, this type of compaction has significantly increased the number of compactions due to too many levels. After testing, it was found that this kind of compaction did not improve the IO utilization.

(2)Stripe Compaction

Although the original level compaction is not applicable to HBase, the idea of ​​this kind of compaction has inspired HBase developers. Combined with the small region strategy mentioned above, Stripe compaction is formed.

Same as Level Compaction, Stripe Compaction divides the files in the entire Store into multiple Ranges according to the Key, called Stripes. The number of Stripes can be set by parameters, and the Keys between adjacent Stripes will not overlap. Stripe is similar to the concept of Sub-Region, that is, a large Region is divided into many small Sub-Regions.

As the data is written, Memstore executes Flush to form HFiles. These HFiles will not be written into the corresponding Stripe immediately, but will be placed in a place called L0. The user can configure the number of HFiles that can be placed in L0. Once the number of files placed in L0 exceeds the set value, the system will write these HFiles into the corresponding Stripe: first read the KVs of the HFile, then locate the specific Stripe according to the Key of each KV, and insert the KV into the corresponding Stripe file, as shown in Figure 6. Since Stripe is a small Region, Compaction does not consume too much system resources. In addition, when reading data, find the corresponding Stripe according to the corresponding Key, and then perform the search inside Stripe. Because the amount of data in Stripe is relatively small, the performance of data search can also be improved to a certain extent.

picture

2.4 Execute Compaction operation

After selecting the files to be merged, the real merge is performed. The merger process is mainly divided into the following steps:

  1. Read out the KV of all HFiles to be merged sequentially, and write them sequentially to temporary files located in the ./tmp directory;

  2. Move the temporary file to the official data directory of the corresponding Region;

  3. Encapsulate the input file path and output file path of Compaction as KV, write it into the WAL log, mark it with Compaction, and finally execute sync;

  4. Delete all the input files of Compaction in the corresponding Region data directory.

HBase considers the entire Compaction very comprehensively. If an error occurs in each of the above four steps, it is highly fault-tolerant and idempotent (the result is the same for one execution and multiple executions).

  • If an exception occurs in RS at or before step 2, this Compaction will be considered a failure. If the same Compaction continues, the last exception will not have any impact on the next Compaction, nor will it affect reading and writing. The only The impact is that there is one more redundant data;

  • If the RS is abnormal after step 2, step 3 or before step 3, there will only be one more redundant data;

  • If an exception occurs after step 3 and before step 4, RS will see the log of the last compaction from WAL after reopening the Region. Because the input file and output file have been persisted to HDFS at this time, it is only necessary to remove the input file of Compaction according to the WAL log.

The compact method of Store is attached below.

public List<StoreFile> compact(CompactionContext compaction,
   CompactionThroughputController throughputController, User user) throws IOException {
   assert compaction != null;
   List<StoreFile> sfs = null;
   CompactionRequest cr = compaction.getRequest();
   try {
     // Do all sanity checking in here if we have a valid CompactionRequest
     // because we need to clean up after it on the way out in a finally
     // block below
     long compactionStartTime = EnvironmentEdgeManager.currentTime();
     assert compaction.hasSelection();
     Collection<StoreFile> filesToCompact = cr.getFiles();
     assert !filesToCompact.isEmpty();
     synchronized (filesCompacting) {
       // sanity check: we're compacting files that this store knows about
       // TODO: change this to LOG.error() after more debugging
       // 再次检查
       Preconditions.checkArgument(filesCompacting.containsAll(filesToCompact));
     }
  
     // Ready to go. Have list of files to compact.
     LOG.info("Starting compaction of " + filesToCompact.size() + " file(s) in "
         + this + " of " + this.getRegionInfo().getRegionNameAsString()
         + " into tmpdir=" + fs.getTempDir() + ", totalSize="
         + TraditionalBinaryPrefix.long2String(cr.getSize(), "", 1));
  
     // Commence the compaction.  开始compact,newFiles是合并后的新文件
     List<Path> newFiles = compaction.compact(throughputController, user);
  
     long outputBytes = 0L;
     // TODO: get rid of this!
     if (!this.conf.getBoolean("hbase.hstore.compaction.complete", true)) {
       LOG.warn("hbase.hstore.compaction.complete is set to false");
       sfs = new ArrayList<StoreFile>(newFiles.size());
       final boolean evictOnClose =
           cacheConf != null? cacheConf.shouldEvictOnClose(): true;
       for (Path newFile : newFiles) {
         // Create storefile around what we wrote with a reader on it.
         StoreFile sf = createStoreFileAndReader(newFile);
         sf.closeReader(evictOnClose);
         sfs.add(sf);
       }
       return sfs;
     }
     // Do the steps necessary to complete the compaction.
     // 将newFiles移动到新的位置,返回StoreFile列表
     sfs = moveCompatedFilesIntoPlace(cr, newFiles, user);
     // 在WAL中写入Compaction记录
     writeCompactionWalRecord(filesToCompact, sfs);
     // 将新生成的StoreFile列表替换到StoreFileManager的storefile中
     replaceStoreFiles(filesToCompact, sfs);
     // 根据compact类型,累加相应计数器
     if (cr.isMajor()) {
       majorCompactedCellsCount += getCompactionProgress().totalCompactingKVs;
       majorCompactedCellsSize += getCompactionProgress().totalCompactedSize;
     } else {
       compactedCellsCount += getCompactionProgress().totalCompactingKVs;
       compactedCellsSize += getCompactionProgress().totalCompactedSize;
     }
     for (StoreFile sf : sfs) {
       outputBytes += sf.getReader().length();
     }
     // At this point the store will use new files for all new scanners.
     // 归档旧文件
     completeCompaction(filesToCompact, true); // Archive old files & update store size.
     long now = EnvironmentEdgeManager.currentTime();
     if (region.getRegionServerServices() != null
         && region.getRegionServerServices().getMetrics() != null) {
       region.getRegionServerServices().getMetrics().updateCompaction(cr.isMajor(),
         now - compactionStartTime, cr.getFiles().size(), newFiles.size(), cr.getSize(),
         outputBytes);
     }
     // 记录日志信息并返回
     logCompactionEndMessage(cr, sfs, now, compactionStartTime);
     return sfs;
   } finally {
     finishCompactionRequest(cr);
   }
 }

3. Current limit of Compaction

The above strategies all set corresponding file selection strategies according to different business scenarios. The core is to reduce the number of files participating in Compaction, shorten the execution time of the entire Compaction, indirectly reduce the IO amplification effect of Compaction, and reduce the delay of reading and writing for business Influence. However, if the read and write throughput in the compaction execution phase is not limited, it will also consume a large amount of system resources in a short period of time, affecting user read and write delays. HBase limits the flow of Compaction by limiting the speed of Compaction and the bandwidth of Compaction.

3.1 Limit Compaction Speed

This optimization scheme automatically adjusts the compaction throughput of the system by sensing the pressure of the compaction, reducing the consolidation throughput when the pressure is high, and increasing the consolidation throughput when the pressure is low.

The basic principle is:

Under normal circumstances, users need to set the throughput lower limit parameter hbase.hstore.compaction.throughput.lower.bound (default 10MB/sec) and upper limit parameter hbase.hstore.compaction.throughput.higher.bound (default 20MB/sec), The actual working throughput is lower + (higer – lower) * ratio, where ratio is a decimal value ranging from 0 to 1, which is determined by the number of HFiles in the current Store to participate in Compation. The larger the number, the higher the ratio. small, and vice versa.

If the number of HFiles in the current Store is too large and exceeds the parameter blockingFileCount, all write requests will be blocked and wait for the Compaction to complete. In this scenario, the above restrictions will automatically become invalid.

3.2 Compaction BandWidth Limit

The principle is basically the same as Limit Compaction Speed. It mainly involves two parameters: compactBwLimit and numOfFilesDisableCompactLimit.

The functions are as follows:

  • compactBwLimit : The maximum bandwidth usage of a Compaction, if the bandwidth used by the Compaction is higher than this value, it will be forced to sleep for a period of time.

  • numOfFilesDisableCompactLimit : Obviously, in the case of very large write requests, limiting the usage of Compaction bandwidth will inevitably lead to HFile accumulation, which will affect the response delay of read requests. Therefore, the meaning of this value is obvious. Once the number of HFiles in the Store exceeds the set value, the bandwidth limit will become invalid.

// 该方法进行Compaction的动态限制
private void tune(double compactionPressure) {
    double maxThroughputToSet;
    // 压力大于1,最大限速不受限制
    if (compactionPressure > 1.0) {
      // set to unlimited if some stores already reach the blocking store file count
      maxThroughputToSet = Double.MAX_VALUE;
     // 空闲时间,最大限速为设置的Compaction最大吞吐量
    } else if (offPeakHours.isOffPeakHour()) {
      maxThroughputToSet = maxThroughputOffpeak;
    } else {
      // compactionPressure is between 0.0 and 1.0, we use a simple linear formula to
      // calculate the throughput limitation.
      // lower + (higher - lower) * ratio
      maxThroughputToSet =
          maxThroughputLowerBound + (maxThroughputHigherBound - maxThroughputLowerBound)
              * compactionPressure;
    }
    if (LOG.isDebugEnabled()) {
      LOG.debug("compactionPressure is " + compactionPressure + ", tune compaction throughput to "
          + throughputDesc(maxThroughputToSet));
    }
    this.maxThroughput = maxThroughputToSet;
  }

Let's look at the getCompactionPressure method for obtaining the compaction pressure of RS. In fact, it traverses each Store in each Region and takes the one with the highest pressure.

@Override
public double getCompactionPressure() {
  double max = 0;
  for (Region region : onlineRegions.values()) {
    for (Store store : region.getStores()) {
      double normCount = store.getCompactionPressure();
      if (normCount > max) {
        max = normCount;
      }
    }
  }
  return max;
}

@Override
public double getCompactionPressure() {
  int storefileCount = getStorefileCount();
  int minFilesToCompact = comConf.getMinFilesToCompact();
  if (storefileCount <= minFilesToCompact) {
    return 0.0;
  }
  return (double) (storefileCount - minFilesToCompact) / (blockingFileCount - minFilesToCompact);
}

HBase's current limiting scheme automatically adjusts the compaction throughput of the system by sensing the compaction pressure, reduces the merged throughput when the pressure is high, and increases the merged throughput when the pressure is low.

The basic principle is:

Under normal circumstances, users need to set the throughput lower limit parameter hbase.hstore.compaction.throughput.lower.bound (default 10MB/sec) and upper limit parameter hbase.hstore.compaction.throughput.higher.bound (default 20MB/sec), In practice, it will work when the throughput is lower + (higer – lower) * ratio, where ratio is a decimal value ranging from 0 to 1, which is determined by the number of HFiles in the current Store to participate in Compation. More, the smaller the ratio, and vice versa.

If the number of HFiles in the current Store is too large and exceeds the value of blockingFileCount, which is configured by the parameter hbase.hstore.blockingStoreFiles, all write requests will be blocked and wait for the Compaction to complete. In this scenario, the above restrictions will automatically fail .

4. Problems encountered online and tuning methods

Due to the complexity of the online environment, more optimizations have been made to the Compaction module. Two typical cases are selected below to illustrate.

4.1 Turn off the automatic triggering of Major Compaction, but the Major Compaction queue still has values ​​in the monitoring, which will affect the read and write performance

Online clusters disable the function of automatically triggering Major Compaction, and manually trigger Major Compaction by scheduled tasks during low-peak business periods. In a certain failure, the business feedback read and write performance delay is relatively large during the non-execution period of Major Compaction. Looking at the monitoring, it is found that the value of the Major Compaction queue in the monitoring is relatively large.

The following is the monitoring graph of the Major Compaction queue length and the average time-consuming of read and write calls at that time. From the graph, the following points can be clearly seen:

  • When the queue length of Major Compaction is relatively large, it takes a long time to read and write;

  • The queue length of Major Compaction is related to the incoming traffic. When the incoming traffic is relatively large, the queue length of Major Compaction is relatively large.

A question arises here. When the automatic Major Compaction is turned off, what conditions trigger the Major Compaction?

picture

picture

With the above questions in mind, we analyze the problem from the source code level.

1) First, check the meaning of the indicator of Major Compaction queue length, which indicates the number of waiting in the work queue of the longCompaction thread pool.

@Override
public int getLargeCompactionQueueSize() {
  //The thread could be zero.  if so assume there is no queue.
  if (this.regionServer.compactSplitThread == null) {
    return 0;
  }
  return this.regionServer.compactSplitThread.getLargeCompactionQueueSize();
}
public int getLargeCompactionQueueSize() {
  return longCompactions.getQueue().size();
}

2) Check the HBase log and find that there is indeed a Major Compaction behavior.

picture

3) Further check when the long Compaction thread pool will be called, and check the source code related to the long Compaction and small Compaction queues selected by Compaction.

/**
 * @param candidateFiles candidate files, ordered from oldest to newest. All files in store.
 * @return subset copy of candidate list that meets compaction criteria
 * @throws java.io.IOException
 */
public CompactionRequest selectCompaction(Collection<StoreFile> candidateFiles,
    final List<StoreFile> filesCompacting, final boolean isUserCompaction,
    final boolean mayUseOffPeak, final boolean forceMajor) throws IOException {
  // Preliminary compaction subject to filters
  ArrayList<StoreFile> candidateSelection = new ArrayList<StoreFile>(candidateFiles);
  // Stuck and not compacting enough (estimate). It is not guaranteed that we will be
  // able to compact more if stuck and compacting, because ratio policy excludes some
  // non-compacting files from consideration during compaction (see getCurrentEligibleFiles).
  int futureFiles = filesCompacting.isEmpty() ? 0 : 1;
  boolean mayBeStuck = (candidateFiles.size() - filesCompacting.size() + futureFiles)
      >= storeConfigInfo.getBlockingFileCount();
  candidateSelection = getCurrentEligibleFiles(candidateSelection, filesCompacting);
  LOG.debug("Selecting compaction from " + candidateFiles.size() + " store files, " +
      filesCompacting.size() + " compacting, " + candidateSelection.size() +
      " eligible, " + storeConfigInfo.getBlockingFileCount() + " blocking");
 
  // If we can't have all files, we cannot do major anyway
  boolean isAllFiles = candidateFiles.size() == candidateSelection.size();
  if (!(forceMajor && isAllFiles)) {
    // 过滤掉大文件
    candidateSelection = skipLargeFiles(candidateSelection, mayUseOffPeak);
    isAllFiles = candidateFiles.size() == candidateSelection.size();
  }
  ...
}

Among them, the skipLargeFiles method filters the merged files and removes large files. The threshold is configured by maxCompactSize = conf.getLong(HBASE_HSTORE_COMPACTION_MAX_SIZE_KEY, Long.MAX_VALUE), and the default is Long.MAX_VALUE.

/**
 * @param candidates pre-filtrate
 * @return filtered subset
 * exclude all files above maxCompactSize
 * Also save all references. We MUST compact them
 */
private ArrayList<StoreFile> skipLargeFiles(ArrayList<StoreFile> candidates,
  boolean mayUseOffpeak) {
  int pos = 0;
  while (pos < candidates.size() && !candidates.get(pos).isReference()
    && (candidates.get(pos).getReader().length() > comConf.getMaxCompactSize(mayUseOffpeak))) {
    ++pos;
  }
  if (pos > 0) {
    LOG.debug("Some files are too large. Excluding " + pos
        + " files from compaction candidates");
    candidates.subList(0, pos).clear();
  }
  return candidates;
}

Then choose the long Compaction thread pool or the small Compaction thread pool according to the size of the files to be merged.

@Override
public boolean throttleCompaction(long compactionSize) {
  return compactionSize > comConf.getThrottlePoint();
}

The calculation method of this threshold is as follows, the default is 2.5G, that is to say, if the size of the file to be merged is greater than 2.5G, it will be placed in the thread pool of long Compaction for execution.

throttlePoint = conf.getLong("hbase.regionserver.thread.compaction.throttle",
          2 * maxFilesToCompact * storeConfigInfo.getMemstoreFlushSize());

4) Check the logs of ReigonServer for this period of time, and find that there are a large number of files larger than 2.5G in Compaction, which explains why there is no log of Major Compaction in this period of time in the RS log, but the long Compaction queue has value.

picture

So far, the cause of the problem has been found. The increase in inbound traffic leads to a relatively large single HFile file. When doing Minor Compaction after Flush, if the total size of the files to be merged is greater than 2.5G (default value), the Minor Compaction will be placed in the Execute in the thread pool of long Compaction. Large files to be merged lead to high disk IO consumption, which in turn affects read and write performance.

5) Measures

We adjusted the parameter hbase.hstore.compaction.max.size of Compaction and changed the value to 2G, which means that HFiles larger than 2G will be excluded during Minor Compaction, and files larger than 2G will be processed during the off-peak period Merge to reduce the impact of Compaction on disk IO.

6) Effect

After the adjustment, the long Compaction thread pool is rarely occupied during the non-manual triggering of the Major Compaction, and the average reading and writing time is reduced to less than 50ms.

picture

picture

4.2 The execution time of the scheduled manually triggered Major Compation task is too long

Business feedback The read and write performance of a certain table has been a bit slow recently. Through monitoring, it can be seen that the storage of the table has been growing, and the single copy of the storage has reached 578TB. Check the table information, the TTL setting of the table is 15 days, and the input traffic of the table has not increased significantly. The monitoring diagram is as follows:

picture

picture

Therefore, it is suspected that the daily compaction task has not been completed, resulting in the failure to completely delete the expired data. Looking at the online configuration, the thread pool size of Major Compaction is 1, and the data volume of this table is relatively large. Therefore, the size of the Compaction thread pool is adjusted to 10, and the idle time of the cluster is set to hbase.offpeak.start.hour and hbase.offpeak.end.hour. During this time period, the size of the files to be merged can be increased during Compaction. After the adjustment is completed, you can see that the workload of Compaction has increased significantly by monitoring the effect comparison chart of Compaction.

picture

Looking at the size of the storage occupied by the table, we can see that the table has dropped from 578T to 349T, a drop of 40%. The reading and writing time of the business has also returned to normal. The parameters of Compaction are more important. When adjusting, you need to consider whether it will affect the business. After the adjustment, you need to observe the time-consuming situation of the business, and you can adjust the parameters step by step.

5. Introduction of Compaction related parameters

The parameters related to Compaction are attached below, and the online environment can be adjusted according to the actual situation.

picture

6. Summary

Compaction is a very important means for HBase to improve read and write performance, but the logic of Compaction is more complicated, and improper use will lead to write amplification, which will affect normal read and write requests. This article focuses on the trigger mechanism of Compaction, various merge strategies that appear during the development of Compaction, the selection algorithm of files to be merged, the current limit of Compaction, and the parameters related to Compaction, and makes a detailed description. Finally, two online cases are selected , introduces specific analysis ideas and tuning methods. After tuning, the performance has been doubled, ensuring the efficient and stable operation of the business.

Guess you like

Origin blog.csdn.net/vivo_tech/article/details/131956504