HBase principle – all the details of Region segmentation are here

 This article is published by NetEase Cloud 

Author: Fan Xinxin (This article is for internal sharing only, if you need to reprint, please contact NetEase for authorization.)

Region automatic segmentation is one of the most important factors for HBase to have good scalability, and it must be a good medicine for all distributed systems to pursue infinite scalability. How is the automatic segmentation of Regions implemented in the HBase system? There are many knowledge points involved, such as what is the trigger condition for Region segmentation? Where is the segmentation point of Region segmentation? How to split to maximize the availability of the Region? How to handle exceptions in the segmentation process? Should data be moved during the segmentation process? Wait, this article will give a basic explanation of these details. On the one hand, it can give you a deeper understanding of the automatic segmentation of Regions in HBase. On the other hand, if you want to achieve similar functions, you can also refer to the HBase implementation scheme. . Region segmentation trigger strategy In the latest stable version (1.2.6), HBase already has as many as 6 segmentation trigger strategies. Of course, each trigger strategy has its own applicable scenarios, and users can choose different segmentation trigger strategies at the table level according to the business. Common segmentation strategies are as follows:

  • ConstantSizeRegionSplitPolicy: The default split policy before version 0.94. This is the easiest to understand but also the most misunderstood segmentation strategy. Literally, when the region size is greater than a certain threshold (hbase.hregion.max.filesize), the segmentation will be triggered, which is not actually the case , in the real implementation, this threshold is for a certain store, that is, the size of the largest store in a region is greater than the set threshold before the segmentation is triggered. Another question that everyone is more concerned about is whether the store size mentioned here is the total size of the compressed file or the total size of the uncompressed file. In the actual implementation, the store size is the size of the compressed file (in the case of compression). ConstantSizeRegionSplitPolicy is relatively easy to think of, but this splitting strategy on the production line has considerable drawbacks: the splitting strategy has no obvious distinction between large tables and small tables. A larger threshold (hbase.hregion.max.filesize) is more friendly to large tables, but a small table may not trigger a split. In extreme cases, there may be only one, which is not a good thing for business. If the setting is small, it is friendly to small tables, but a large table will generate a large number of regions in the entire cluster, which is not a good thing for cluster management, resource usage, and failover.
  • IncreasingToUpperBoundRegionSplitPolicy : The default split policy for version 0.94~2.0. This splitting strategy is a little complicated. In general, the idea is the same as that of ConstantSizeRegionSplitPolicy. The maximum store size in a region is greater than the set threshold, which will trigger splitting. However, this threshold is not a fixed value like ConstantSizeRegionSplitPolicy, but will be continuously adjusted under certain conditions. The adjustment rules are related to the number of regions in the current regionserver of the region's table: (#regions) * (#regions) * (#regions) * flush size * 2, of course, the threshold will not increase infinitely, and the maximum value is the MaxRegionFileSize set by the user. This splitting strategy makes up for the shortcomings of ConstantSizeRegionSplitPolicy and can adapt to large and small tables. Moreover, under the condition of large clusters, it performs very well for many large tables, but it is not perfect. Under this strategy, many small tables will generate a large number of small regions in the large cluster, which are scattered throughout the cluster. And region splits may also be triggered when region migration occurs.
  • SteppingSplitPolicy: The default splitting policy for version 2.0. The segmentation threshold of this segmentation strategy has changed again. Compared with IncreasingToUpperBoundRegionSplitPolicy , it is simpler. It is still related to the number of regions in the current regionserver of the table to which the region to be split belongs. If the number of regions is equal to 1, the segmentation threshold is flush size * 2, otherwise MaxRegionFileSize. This splitting strategy is more friendly to large and small tables in large clusters than IncreasingToUpperBoundRegionSplitPolicy . Small tables will not generate a large number of small regions, but only enough.
In addition, there are some other splitting strategies, such as using DisableSplitPolicy: to prevent the region from splitting; while KeyPrefixRegionSplitPolicy and DelimitedKeyPrefixRegionSplitPolicy are still based on the default splitting strategy for the splitting strategy, but have their own views on the splitting points, such as the KeyPrefixRegionSplitPolicy requirements must be the same The PrefixKey stays in a region.
In terms of usage, in general, the default segmentation strategy can be used, or the region segmentation strategy can be set at the cf level. The command is: create'table'{NAME=>'cf',SPLIT_POLICY=>'org.apache.hadoop .hbase.regionserver.ConstantSizeRegionSpli

The region splitting strategy will trigger region splitting. The first thing after the splitting starts is to find the splitpoint - splitpoint. All default split policies, whether ConstantSizeRegionSplitPolicy, IncreasingToUpperBoundRegionSplitPolicy or SteppingSplitPolicy is consistent with the definition of split points. Of course, when the user manually performs the segmentation, the segmentation point can be specified for segmentation, which is not discussed here.

How is the split point positioned? The first rowkey of the most central block in the largest file in the largest store in the entire region. This is a sentence that consumes more brain power and needs to be savored carefully. In addition, HBase also stipulates that if the located rowkey is the first rowkey or the last rowkey of the entire file, it is considered that there is no split point.
Under what circumstances will there be a scene without a split point? The most common is that a file has only one block, and when split is executed, it will be found that it cannot be split. Many new students often create a new table when testing split, then insert a few pieces of data into the new table, perform flush, and then perform split, and miraculously find that the data table does not actually perform splitting. The reason is here. If you look at the debug log carefully at this time, you can see this log:
Region core segmentation process
HBase wraps the entire segmentation process into a transaction, with the intention of ensuring the atomicity of the segmentation transaction. The entire split transaction process is divided into three stages: prepare – execute – (rollback), the operation template is as follows:
  • Prepare phase: Initialize two sub-regions in memory, specifically generating two HRegionInfo objects, including tableName, regionName, startkey, endkey, etc. At the same time, a transaction journal will be generated, which is used to record the progress of the segmentation. For details, see the rollback phase.
  • execute stage: the core operation of segmentation. See the image below (from Hortonworks ):
1. The regionserver changes the state of the region in the ZK node/region-in-transition to SPLITING.
2. The master detects the change of the region state through the watch node/region-in-transition, and modifies the state of the region in the memory. The RIT module on the master page can see the state information of the region executing the split.

3. Create a new temporary folder in the parent storage directory, and split saves the daughter region information after the split.

4. Close the parent region: The parent region closes data writing and triggers the flush operation to persist all the data written to the region to the disk. After a short period of time, the client's request that falls on the parent region will throw an exception NotServingRegionException.
5. Core splitting steps: Create two new subfolders under the .split folder, called daughter A and daughter B, and generate reference files in the folders, pointing to the corresponding files in the parent region respectively. This step is the core part of all steps. The generated reference file log is as follows: 2017-08-12 11:53:38,158 DEBUG [StoreOpene-0155388346c3c919d3f05d7188e885e0-1] regionserver.StoreFileInfo: reference'hdfs://hdfscluster/ hbase-rsgroup / data / default / music / 0155388346c3c919d3f05d7188e885e0 / cf / d24415c4fb44427b8f698143e5c4d9dc00 which reference file named d24415c4fb44427b8f698143e5c4d9dc.00bb6239169411e4d0ecb6ddfdbacf66, the format looks special, and that this specific file name What is the meaning of it? Let's take a look at the parent region file pointed to by the reference file. According to the log, we can see that the parent region of the segmentation is 00bb6239169411e4d0ecb6ddfdbacf66, and the corresponding segmentation file is d24415c4fb44427b8f698143e5c4d9dc. It can be seen that the reference file name is a very informative naming method, as follows Show:
In addition, you also need to pay attention to the file content of the reference file. The reference file is a reference file (not a Linux link file), and the file content is obviously not user data. The content of the file is actually very simple, mainly composed of two parts: one is the split key, the other is a boolean type variable (true or false), true indicates that the reference file refers to the upper part of the parent file (top ), while false means the bottom half (bottom) is referenced. Why are these two parts stored? And listen to the breakdown below.
The judge can use the hadoop command to view the specific content of the reference file in person: hadoopdfs-cat/hbase-rsgroup/data/default/music/0155388346c3c919d3f05d7188e885e0/cf/d24415c4fb44427b8f698 6. After the parent region is split into two sub-regions, divide daughter A and daughter B is copied to the HBase root directory to form two new regions.
7. The parent region is notified that it will go offline after modifying the hbase.meta table and will no longer provide services. After going offline, the parent region information in the meta table will not be deleted immediately, but the split column and the offline column will be marked as true, and two sub-regions will be recorded. Why not delete it immediately? And listen to the breakdown below.
8. Open daughter A and daughter B sub-regions. Notice to modify the hbase.meta table and officially provide services to the outside world.
  • Rollback stage: If an exception occurs in the execute stage, the rollback operation is performed. In order to achieve rollback, the entire segmentation process is divided into many sub-stages, and the rollback program will clean up the corresponding garbage data according to which sub-stage currently progresses. The JournalEntryType is used in the code to represent each sub-stage, as shown in the following figure:
Region Slicing Transactional Guarantee

The entire region segmentation is a relatively complex process, involving many sub-steps such as the segmentation of HFile files in the parent region, the generation of two sub-regions, and the change of system meta metadata. Therefore, the transactionality of the entire segmentation process must be guaranteed. That is, either the segmentation is completely successful, or the segmentation has not started at all, and under no circumstances can the segmentation be only half completed.
In order to achieve transactionality, hbase has designed a state machine (see SplitTransaction class) to save the state of each sub-step in the splitting process, so that once an exception occurs, the system can decide whether to roll back and how to roll back according to the current state. roll. Unfortunately, in the current implementation, these intermediate states are only stored in memory, so once the regionserver crashes during the segmentation process, it is possible that the segmentation is in an intermediate state, that is, the RIT state. In this case, you need to use the hbck tool to specifically view and analyze the solution. After version 2.0, HBase implements a new distributed transaction framework Procedure V2 (HBASE-12439). The new framework will use HLog to store the intermediate state of this single-machine transaction (DDL operation, Split operation, Move operation, etc.), so it can be It is guaranteed that even if the participant crashes during transaction execution, HLog can still be used as the coordinator to roll back the transaction or retry the submission, greatly reducing or even eliminating the RIT phenomenon. This is also one of the most anticipated highlights of 2.0 in terms of usability! ! !

The impact of region segmentation on other modules
Through the understanding of the region segmentation process, we know that the entire region segmentation process does not involve data movement, so the segmentation cost itself is not very high, and it can be completed quickly. After splitting, the file of the sub-region does not actually have any user data, and only some metadata information is stored in the file - the split point rowkey, etc. How to find the data by referencing the file? When does the data of the child region actually complete the actual migration? When will the parent region be deleted after data migration is complete?

1. How to find data through reference file?

Here you will see the actual meaning of the reference file name and file content. The whole process is shown in the figure below:
(1) Locate the file path where the real data is located according to the reference file name (region name + real file name).
(2) Is it possible to scan the KV to be checked in the entire file after locating the real data file? Not also. Because the reference file usually only references half of the data of the data file, with the split point as the boundary, either the upper half of the file data or the lower half of the data. What part of the data? What is the split point? Remember the file content of the reference file mentioned above, yes, it is recorded in the file.

2. When will the data of the parent region be migrated to the child region directory?

The answer is that when major_compaction occurs in the sub-region, we know that the execution of the compaction is actually to read out all the small files in the store, one KV and one KV from small to large, and then sequentially write a large file, and then delete the small file after completion, so Compaction itself requires reading and writing large amounts of data. After the child region executes major_compaction, all data belonging to the child region in the parent directory will be read out and written into the child region directory data file. It can be seen that it is a convenient thing to put data migration into the compaction stage.
3. When will the parent region be deleted?

In fact, HMaster will start a thread to periodically traverse and check all parent regions in the splitting state to determine whether the parent region can be cleaned up. The detection thread will first find out all the regions whose split column is true in the meta table, and load the two sub-regions (splitA and splitB columns in the meta table) generated after the split. It only needs to check whether the two sub-regions still exist. Reference files, if there are no reference files, it can be considered that the file corresponding to the parent region can be deleted. Now look at the information of the parent directory in the meta table above, and you can probably understand why this information is stored:
4. Some pits of the split module in the production line?

Sometimes students report that some regions in the cluster are in RIT for a long time, and the region state is splitting. Usually, it is recommended to use hbck to see what errors are reported, and then repair it according to some tools provided by hbck. hbck provides some commands to repair the rit region in the split state. The main commands are as follows: -fixSplitParents Try to force offline split parents to be online.-removeParents Try to offline and sideline lingering parents and keep daughter regions.-fixReferenceFiles Try to offline lingering reference store files
The most common problem is: ERROR:Foundlingeringreferencefilehdfs://mycluster/hbase/news_user_actions/3b3ae24c65fc5094bc2acfebaa7a56de/ To briefly explain, this error means that the parent region file referenced by the reference file does not exist. If you check the log, you may see the following Exception: java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist:/hbase/news_user_actions/b7
Why does the parent region file somehow not exist? After discussion with friends, it is confirmed that it may be caused by an official bug, see HBASE-13331 for details. This jira means that when HMaster confirms whether the parent directory can be deleted, if it checks the reference file (check whether it exists, check whether it can be opened normally) and throws an IOException exception, the function will return no reference file, causing the parent region to be deleted. Under normal circumstances, it should be safe to return the existence of the reference file, keep the region, and print the log for manual intervention. If you also encounter similar problems, you can take a look at this problem, or you can hit the repair patch to the online version or upgrade version.

NetEase has a number: an enterprise-level big data visualization analysis platform. The self-service agile analysis platform for business personnel uses PPT mode for report production, which is easier to learn and use, and has powerful exploration and analysis functions, which really help users gain insight into data and discover value. Click here for a free trial .


Learn about NetEase Cloud :

NetEase Cloud official website: https://www.163yun.com/

New user gift package: https://www.163yun.com/gift

NetEase Cloud Community: https://sq.163yun.com/

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326589063&siteId=291194637