Detailed explanation of HBase storage and reading process

Introduction

This article will comprehensively analyze the storage process of data in HBase, as well as the query and parsing process of data, to help you understand the internal working principle and workflow of HBase from the bottom layer.

1. Analysis of HBase data storage process

Let's first look at a general picture of the HBase stored process. The following will be divided into two parts to explain the HBase stored process.

  • Client's request submission process
  • The process after the data arrives at the server RegionServer

image

(Schematic diagram of the stored procedure of HBase)

1. The client's request submission process:

image

(Schematic diagram of the HBase request submission process)

①. By Zookeeperfinding the address where the RegionServermetadata metatable is located.RegionServer

②. Find the data to which the data belongs through the metatable information .RowkeyRegionServer

③. The user submits Putor Deleterequests are added to the local Buffer, and when certain conditions are met, the data will be asynchronously batched to RegionServerthe local.

Note:
The default setting of HBase AutoFlushis true, the Putrequest is directly submitted to the server for processing; it can also be set to false, the Putrequest is placed in the local Buffer, and when it Bufferexceeds the default threshold (2M, which can be modified), it is submitted.

2. The process after the data reaches the server RegionServer

image

(Schematic diagram of the process after the data reaches the RegionServer in the server)

①. Region obtains a row lock (guarantee atomicity), first writes to HLog (WAL) , and then writes to the cache MEMStore . During the lock acquisition time, the data will not be synchronized to HDFS. (left to middle part of the picture above)

②. After the Region releases the row lock, synchronize the data to HDFS through HLog Syncer . This reduces row lock time and improves performance. If the synchronization fails at this time, the data in the MEMStore will be deleted, indicating that the insertion fails. (middle part of the picture above)

③. When the cache value reaches the threshold (default 64M, can be modified), write the data to multiple StoreFiles through asynchronous threads.

④. When the number of StoreFiles reaches a certain number, Compaction will be performed , multiple StoreFiles will be merged into one, and versions will be merged and data deleted at the same time.

⑤. When a single StoreFileexceeds a certain threshold, trigger splitthe operation to decompose the current Region into two Regions, HBase Master will offline the original Region, and distribute the decomposed two Regions to two different RegionServers to achieve load balanced. (the far right part of the picture above)

2. Analysis of HBase data reading process

1. The client's request submission process

It is the same as the process in "HBase Data Stored Procedure Analysis" above, and will not be repeated here. Its flow chart is as follows:
image

2. The process after the data reaches the server RegionServer

image

(Scanner system diagram)

①. After RegionServerreceiving the client's Getor Scanrequest, build one RegionScanner. ( the first column on the left in the picture above )

②. RegionScannerBuild according to the column family StoreScanner( StoreScannerthe number is equal to the number of column families). ( Second column from the left in the picture above )

③. One for each StoreScannercurrent StoreFileconstruct StoreFileScanner, used to perform file retrieval. At the same time, one StoreScannerpair is MemStoreconstructed MemScanner(only one for each Store), which is used to perform Storeretrieval in MEMStorethe pair.

These two Scannerare built because the data may not have been flushed from memory, HFileand these two Scannercan retrieve the underlying data files and data in memory (to ensure that no data is missed) to find the required data. ( third column from the left in the above picture )

Key-Value④. Encapsulate the scanned result as a ResultSetresult set and return it to the client. ( The fourth and fifth columns from the left of the above picture )

Summarize

This article explains in detail the whole process of storing and reading data in HBase, and how each storage structure in HBase works collaboratively to complete the access process. All the important terms involved in this article have hyperlinks in the article to jump to the explanations in the previous articles, so there is no need to repeat them in this article. I hope that if you encounter a term you don't understand, you can read it together with the previous articles, so that you can understand it faster.

The next article will explain the optimization of HBase storage and reading. If you like this article, please like and collect it.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324119879&siteId=291194637