HBase source code analysis and LastN best practice exploration

1. Background

In 2006, Google released "BigTable: A Distributed StorageSystem for Structured Data", implemented by PowerSet and open source. HBase is a distributed and scalable big data storage structure. If time series data is stored in HBase, it is often necessary to access Several recent data LastN. Based on the above background, this paper optimizes the implementation of LastN.

2. Scheme design

The hbase version is based on the company's hbase component 1.0.19-kwai. Assume that Rowkey is a minute-level timestamp. Since hbase stores rowkeys from small to large according to the dictionary order, it is obvious that it can be used. Set startRowkey, stopRowkey, and set the reverse function at the same time.

3、scan and reversed scan

3.1、scan

Users can complete a variety of queries through the scan operation of the hbase client. Common ones include startRow, endRow, Filter, caching, batch, reversed, etc., and the scan operation involves several scenarios of reverse. The cloudera official document has detailed descriptions of several scenarios https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_hbase_scanning.html

From the perspective of the picture below

Correct scan scene: do not set startRow, stopRow
insert image description here
wrong scene: set startRow, stopRow, set Reverse, but not used correctly
insert image description here

3.2、reversed scan

Set startRow, stopRow, use Reversed correctly
insert image description here
and read through the source code, analyze scan.next()–>scan.nextRow—>nextInternal()–>isStopRow(), and keep looking for stopRow from startRow, so setting reversed is the direction of the comparator Change to iterate in the other direction from startRow.

3.3, reversed efficiency test

data sample period Positive sequence (ms) reverse order (ms)
sample data 1 1000 points 152 152
Sample data 2 30720 points 1492 1964
Sample data 3 30720 points 1589 2080
Through the reverse query with a large amount of data, it is found that the reverse order is less efficient than the forward order.

The official issue about the decline in reverse performance compared to sequential reading https://issues.apache.org/jira/browse/HBASE-4811?jql=text%20~%20%22reverse%22

So why is the performance of reverse slower than sequential read? Let's start the analysis.

4. HBase data storage structure

According to the data model of hbase, hbase is a columnar storage, which stores data separately according to the column family. A column family (Column Family) of HBase is essentially an LSM tree. The LSM tree is divided into a memory part and a disk part. The memory part HBase uses the SkipList jump table, and the disk part is composed of independent file blocks.

4.1、LSM

definition:

  1. An LSM tree is a forest of "subtrees" spanning memory and disk.
  2. The LSM tree is divided into Level 0, Level 1, Level 2 ... Level n subtrees, of which only Level 0 is in memory, and the rest of Level 1->n are in disk.
  3. The Level 0 subtree in the memory generally adopts an ordered data structure such as a sorting tree (red-black tree/AVL tree), skip table, or TreeMap, which is convenient for subsequent sequential writing to disk.
  4. The Level 1->n subtree in the disk is essentially a file that is written to the disk after the data is sorted. It is just called a tree.
  5. The subtrees in each layer have a threshold size. After reaching the threshold, they will be merged, and the merged results will be written to the next layer.
  6. Only data in memory is allowed to be updated in-place, and data changes on disk are only allowed to be appended and not updated in-place.
    insert image description here

That is, a CF in HBase corresponds to a StoreScanner, and a storeScanner is composed of MemStoreScanner and StoreFileScanner, which correspond to memory and disk retrieval respectively. Specifically, the memory part uses the skipList data structure, which is implemented by ConcurrentSkipListMap, and the disk part data is located in HFile.

5. HBase write data process and data encoding

5.1. Client processing stage

The client preprocesses the user's write request, locates the RegionServer where the data is written according to the cluster metadata, and sends the request to the corresponding RegionServer

5.2, Region writing stage

RegionServer parses the data after receiving the write request, first writes to WAL, and then writes to the MemStore corresponding to the Region column cluster

5.3, MemStore Flush stage

When the capacity of the MemStore in the Region exceeds a certain threshold, the system will execute the flush operation asynchronously to write the data in the memory into the file to form an HFile.

After HBase performs the flush operation, the data in the memory is written into an HFile file in a specific format.

The basic process is:

MemStore(CellSkipListSet)—>Scanner.next–cell(keyValue)->appendGeneralBloomFilter(cell)—>appendDeleteFamilyBloomFilter(cell)—>(HFile.Writer)writer.append(cell)

First create a new Scanner, get the cell (keyValue) from the CellSkipListSet, use the BloomFilter and the cells marked DeleteFamil and DeleteFamilVersion in the memory, and write the cell into the DataBlock.

In the process of writing a cell to DataBlock in memory, it should be noted that this process is divided into two steps:

1. Encoding KeyValue: Use a specific encoding to encode the cell.

The main encoders in HBase include DiffKeyDeltaEncoder, FastDiffDeltaEncoder, and PrefixKeyDeltaEncoder. The basic idea of ​​coding is to take the delta after comparing the previous KeyValue with the current KeyValue. To expand, the rowkey, column family and column are compared respectively and then the delta is taken. If the rowkeys of the two KeyValues ​​before and after are the same, the current rowkey can use a specific flag, and there is no need to store the entire rowkey completely. In this way, the storage space can be greatly reduced in some scenarios.

2. Write the encoded KeyValue to DataOutputStream

With the continuous writing of cells, the current Data Block will be full because the size exceeds the threshold (64KB by default). After the data is full, the Data Block will flush the data of the DataOutputStream to the file, and the Data Block will be placed on the disk at this time.

After the encoding process, the storage space can be greatly reduced. Note that the rowkey information has not been completed here (an example will be given below). In fact, for massive data, the importance of IO resource optimization is self-evident, in addition to designing asynchronous compaction to reduce the number of files and achieve the purpose of improving reading performance. Compression, encoding and other optimizations are also done in the reading and writing process.

6. HBase read data process and data decoding

Compared with the writing process, the process of HBase reading data is more complicated. Mainly based on two reasons:

One is because a range query of HBase may involve multiple Regions, multiple caches or even multiple data storage files;

The second is because the implementation of update and delete operations in HBase is very simple.

  • The update operation does not update the original data, but uses the timestamp attribute to implement multiple versions;
  • The deletion operation does not actually delete the original data, but only inserts a piece of data marked with the "deleted" label, and the real data deletion occurs when the system executes the Major Compact asynchronously.

Obviously, this implementation idea greatly simplifies the data update and deletion process, but for data reading, it means that these interferences need to be removed: the reading process needs to be filtered according to the version, and the data that has been marked for deletion must also be filtered .

6.1、hbaseClient-hbaseServer

1. The client accesses zk to obtain the RegionServer node information where the hbase:meta table is located

2. Send a read request to RegionServer according to rokwey, and cache metadata in memory at the same time, and RegionServer will perform data processing.

3. get, scan, the implementation is scan (get is a special scan)

The scan operation is not designed as an RPC request, because a full table scan may bring two consequences

  • A large amount of data transmission will cause a large amount of system resources such as cluster network bandwidth to be occupied in a short period of time, seriously affecting other services in the cluster.
  • The client is likely to cause client OOM because the memory cannot cache the data

Therefore, a scan will be split into multiple RPC requests, each RPC request is called a next request, and only a specified number of results are returned each time.

For a next() operation, the client first checks whether there is data in the local cache. If there is data, it will return directly. If not, it will initiate an RPC request to the server to obtain it. After success, it will be cached in memory.

The number of requests for a single RPC is set by the parameter cache. The default is Integer.MAX_VALUE. If it is not too large, it is easy to OOM. If it is too small, multiple RPC operations will cause high network costs.

In order to prevent a row of data from being returned, but the amount of data is large, the client can set the number of columns for an RPC request data through setBatch.

setMaxResultSize sets the amount of data returned by each RPC request (not the number of items), the default is 2G.

6.2, Server-side Scan framework system

A scan may span regions.

For this type of scan, the client will divide the scan start interval [startKey, stopKey) according to the hbase:meta metadata into multiple independent query sub-intervals, and each sub-interval corresponds to a Region. For example, the current table has 3 Regions, and the starting intervals of the Regions are: ["a", "c"), ["c", "e"), ["e", "g"), and the client sets scan The scanning interval of is ["b", "f"). Because the scanning interval obviously spans multiple Regions, it needs to be segmented. The sub-intervals after segmentation according to the Region interval are ["b", "c"), ["c", "e"), ["e", "f").

After RegionServer receives the get/scan request from the client, it does two things: first builds the scanneriterator system; then executes the next function to obtain the KeyValue and performs conditional filtering on it.

6.2.1. Build Scanner Iterator system

The core system of Scanner includes three layers of Scanner: RegionScanner, StoreScanner, MemStoreScanner and StoreFileScanner. The three are hierarchical relationships:

  • A RegionScanner consists of multiple StoreScanners. There are as many StoreScanners as there are column clusters in a table, and each StoreScanner is responsible for searching the corresponding Store data.
  • A StoreScanner consists of MemStoreScanner and StoreFileScanner. The data of each Store consists of MemStore in memory and StoreFile on disk. Correspondingly, StoreScanner will construct a StoreFileScanner for each HFile in the current Store to actually retrieve the corresponding files. At the same time, a MemStoreScanner will be constructed for the corresponding MemStore to perform data retrieval of the MemStore in the Store.

It should be noted that RegionScanner and StoreScanner are not responsible for the actual search operation, they are more responsible for organizing and scheduling tasks, and StoreFileScanner and MemStoreScanner are responsible for the final KeyValue search operation.

insert image description here

It is supplemented here that a column family of Hbase is essentially an LSM tree, which has been introduced above and will not be repeated here.
After constructing the three-layer Scanner system, the following process is required.
insert image description here
1) Filter and eliminate some Scanners that do not meet the query conditions
2) Each Scanner seek to starkKey
This step seeks and scans the starting point startKey in each HFile file (or MemStore). If starkKey is not found in HFile, then seek the next KeyValue address
3) KeyValueScanner merges to build the minimum heap. Merge all StoreFileScanners and MemStoreScanners in the Store to form a heap (minimum heap). The so-called heap is actually a priority queue. In the queue, the KeyValues ​​obtained by Scanner seek are sorted from small to large according to the Scanner sorting rules. The minimum heap management Scanner can ensure that the KeyValue taken out is the smallest, so that the target KeyValue set can be obtained from small to large by continuous pop, ensuring orderliness.

6.2.2. Locate the target block according to the HFile index tree

It can be seen from the above that the data will be saved to memStore and Hfile. In fact, for hbase, most of the data is stored on the disk for massive data. Therefore, the search for data on the disk is mainly considered here.
When HRegionServer opens HFile, it will load the Trailer part and Load-on-open part of all HFiles into memory. The Load-on-open part has a very important Block——Root Index Block, which is the root node of the index tree.
insert image description here

Each box in the above three lines represents an IndexEntry, which consists of three fields: BlockKey, Block Offset, and BlockDataSize. BlockKey is the first rowkey of the entire Block, such as "a", "m", and "o" in the Root Index Block , "u" is BlockKey. Block Offset indicates the offset of the Block pointed to by the index node in the HFile. The HFile index tree index has only the top layer when the amount of data is small, and it begins to split into multiple layers as the amount of data increases, up to three layers.

1) The user enters the rowkey as 'fb', and locates 'fb' between 'a' and 'm' through binary search in the Root Index Block, so it is necessary to access the intermediate node pointed to by the index 'a'. Because Root IndexBlock is resident in memory, this process is very fast.

2) Load the index block of the intermediate node pointed to by the index 'a' into the memory, then locate fb between index 'd' and 'h' through binary search, and then access the leaf node pointed to by the index 'd'.

3) Similarly, load the index block of the intermediate node pointed to by the index 'd' into the memory, find fb between index 'f' and 'g' through binary search, and finally need to access the Data Block node pointed to by the index 'f' . 4) Load the Data Block pointed to by the index 'f' into the memory, and find the corresponding KeyValue by traversing.

In the above process, the Intermediate Index Block, Leaf Index Block, and Data Block all need to be loaded into the memory, so the IO of a query is normally 3 times. But in fact, HBase provides a caching mechanism for Block, which can cache frequently used Blocks in memory to further speed up the actual reading process.

From 6.2.2, we can see that the way the HFile index tree locates the target Block, and from Chapter 5.3, when Flush is sent to the disk, it will first perform the encoding operation, and then select the compression algorithm to perform data compression; similarly, when reading data from the disk, first It needs to be decompressed and decoded. After encoding in the writing process, it can be seen that the rowkey information is not completely preserved, but in order to save storage costs, the bottom layer actually uses delta storage. That is, if two keys have a common prefix, the actual form of the two keys is the common prefix + a different character behind for storage, not the complete two keys, so when reading in reverse order, not only the current key needs to be read, It is also necessary to find the previous key that is smaller than him to perform key delta completion.

The following is an example of Prefix_Encoding in the official document. The left picture shows the unused encoding format, and the right picture shows Prefix_Encoding.

insert image description here

As can be seen from the above figure, when HFile locates Bolck, the reverse reverse order reading needs to iterate upwards according to Prefix Leny and Key Len until the rowkey information of itself is completed. Compared with sequential reading, there will be more iterations, because when the data volume When it is large, the difference will be more obvious.

7. Conclusion

  1. HBAS stores data in a multi-layer manner of region/store/LSM/block, and each level of storage has a corresponding index. Therefore, from a macro perspective, the index structure supports the same forward scan and reverse scan.

  2. In order to save storage space, HBASE encodes the data, which is not friendly to reverse scanning.

  3. If you want to pursue the best performance, it is recommended to use lomg.max - timestamp to implement sequential scanning.

reference article

https://www.cs.umb.edu/~poneil/lsmtree.pdf

https://hbase.apache.org/1.4/book.html#data.block.encoding.enable

https://klevas.mif.vu.lt/~ragaisis/ADS2006/skiplists.pdf

https://issues.apache.org/jira/browse/HBASE-4811?jql=text%20~%20%22reverse%22

https://issues.apache.org/jira/browse/HBASE-4676

Guess you like

Origin blog.csdn.net/qq_42859864/article/details/125329403