Hbase architecture full solution

Physically, Hbase server is composed of three types of master-slave architecture (master-slave):

  • Region Server is responsible for processing data read and write requests, and the client directly interacts with the Region Server when requesting data.
  • HBase Master , responsible for Region allocation, DDL (create, delete table) and other operations.
  • Zookeeper , as part of HDFS, is responsible for maintaining the state of the cluster.

Of course, the underlying storage is based on Hadoop HDFS:

  • The Hadoop DataNode is responsible for storing the data managed by the Region Server. All HBase data is stored in HDFS files. Region Server and HDFS DataNode are often distributed together, so that Region Server can realize data locality (data locality, that is, place the data as close as possible to those who need it). HBase data is local when it is written, but when the region is migrated, the data may no longer meet the locality, and cannot be restored to the local until the compaction is completed.
  • The Hadoop NameNode maintains the meta information of all HDFS physical data blocks.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Regions

The HBase table (Table) is horizontally split into several regions according to the range of the rowkey. Each region contains all the rows between the start key and end key of the region. Regions are assigned to certain nodes in the cluster for management, namely Region Servers, which are responsible for processing data read and write requests. Each Region Server can manage approximately 1000 regions.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

HBase Master

Also called  HMaster , responsible for Region allocation, DDL (create, delete table) and other operations:

Coordinate all region servers as a whole:

  • Distribution regions at startup, when a fault recovery and load balancing redistribution  regions
  • Monitor all Region Server instances in the cluster (get notification information from Zookeeper)

Administrator functions:

  • Provide an interface for creating, deleting and updating HBase Table

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Zookeeper

HBase uses Zookeeper as a distributed management service to maintain the status of all services in the cluster. Zookeeper maintains which servers are healthy and available, and will notify when the server fails. Zookeeper uses the consistency protocol to ensure the consistency of the distributed state. Note that this requires three or five machines to do the consistency agreement.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

How these components work together

Zookeeper is used to coordinate the sharing of cluster state information in a distributed system. Region Servers and online HMaster (active HMaster) and Zookeeper maintain a session (session). Zookeeper by heartbeat to maintain all temporary node (ephemeral nodes).

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Each Region Server will create an ephemeral node. HMaster will monitor these nodes to discover available Region Servers, and it will also monitor these nodes for failures.

HMasters will compete to create ephemeral nodes, and Zookeeper decides who is the first online HMaster to ensure that there is only one HMaster online. The online HMaster ( active HMaster ) will send a heartbeat to Zookeeper, and the offline standby HMaster ( inactive HMaster ) will monitor the possible faults of the active HMaster and be ready to go up.

If a Region Server or HMaster fails or fails to send heartbeat due to various reasons, their session with Zookeeper will expire, the ephemeral node will be deleted and offline, and listeners will receive this message. Active HMaster monitors the offline messages of region servers, and then restores the failed region server and the region data it is responsible for. What Inactive HMaster cares about is the news that the active HMaster goes offline, and then the competition goes online to become the active HMaster.

Comment: This paragraph is very important and involves some core concepts in distributed system design, including cluster status and consistency. It can be seen that  Zookeeper  is the bridge to communicate everything. All participants maintain a heartbeat session with Zookeeper and obtain the cluster state information they need from Zookeeper to manage other nodes and switch roles. This is also very important in the design of distributed systems. The idea is to maintain distributed cluster status information by specialized services.

First read and write operation

There is a special HBase Catalog table called  Meta table (it is actually a special HBase table), which contains the location information of all regions in the cluster. Zookeeper saves the location of this Meta table.

When HBase reads or writes for the first time:

  • The client obtains from Zookeeper which Region Server is responsible for managing the Meta table.
  • The client will query the Region Server that manages the Meta table, and then know which Region Server is responsible for managing the rowkey required for this data request. The client will cache this information, as well as the location information of the Meta table itself.
  • Then the client goes back to visit the Region Server to get the data.

For future read requests, the client can directly obtain the location information of the Meta table (on which Region Server) from the cache, and the location information of the rowkey previously accessed (on which Region Server), unless it is because The Region was migrated and the cache was invalidated. At this time, the client will repeat the above steps to retrieve the relevant location information and update the cache.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Comment: The client reads and writes data. There are actually two steps: the first step is to locate which Region Server the rowkey belongs to from the Meta table; the second step is to go to the corresponding Region Server to read and write data. Two Region Servers are involved here, and it is necessary to understand their respective roles and functions. About Meta table will be introduced in detail below.

HBase Meta Table

Meta table is a special HBase table, which stores all the region lists in the system. This table is similar to a b-tree, and the structure is roughly as follows:

  • Key:table, region start key, region id
  • Value:region server

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Region Server composition

Region Server runs on HDFS DataNode and consists of the following components:

  • WAL : Write Ahead Log is a file on the distributed file system used to store new data that has not been persistently stored . It is used for failure recovery.
  • BlockCache : This is the read cache. The most frequently accessed data is stored in the memory . It is the LRU (Least Recently Used) cache.
  • MemStore : This is the write cache, which stores new data in memory that has not been persisted to the hard disk. When being written to the hard disk, the data will be sorted first. Note that each Column Family of each Region will have a MemStore.
  • HFile  stores HBase data on the hard disk (HDFS)  in the form of ordered KeyValue .

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Comment: This paragraph is the top priority. Understanding the composition of Region Server is crucial to understanding the architecture of HBase. It is necessary to fully understand the functions of Region Server and the role of each component. The behavior and functions of these components are in the subsequent paragraphs. Will unfold one by one.

HBase write data steps

When the client initiates a write data request (Put operation), the first step is to write the data to the WAL:

  • The new data will be appended to the end of the WAL file.
  • WAL is used to recover data that has not been persisted during failure recovery.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

After the data is written into the WAL, it will be added to the MemStore that is the write cache. Then the server can return ack to the client to indicate the completion of writing data.

Comment: Pay attention to the update sequence of WAL and MemStore when data is written. It cannot be changed. You must first WAL and then MemStore. If the other way round, the MemStore is updated first, and the Region Server crashes at this time, the update in the memory is lost, and the data has not been persisted to the WAL and cannot be restored. Theoretically, WAL is a mirror image of the data in MemStore and should be consistent unless a system crash occurs. Also note that updating WAL is to append at the end of the file. This disk operation performance is very high and will not affect the overall response time of the request too much.

HBase MemStore

MemStore caches HBase data updates in memory in the form of ordered KeyValues, which is the same as the storage form in HFile. Each Column Family has a MemStore, and all updates are sorted in units of Column Family.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

HBase Region Flush

After enough data is accumulated in MemStore, the entire ordered data set will be written into a new HFile file to HDFS. HBase creates an HFile for each Column Family, which stores the specific  Cell , that is, the KeyValue data. Over time, HFile will continue to be generated, because KeyValue will continue to be flashed from MemStore to the hard disk.

Note that this is one reason why HBase limits the number of Column Family. Each Column Family has a MemStore; if a MemStore is full, all MemStores will be flashed to the hard disk. At the same time it also records the last written data of maximum sequence number ( Sequence Number The ), so that the system can know what data has so far been a persistent.

The maximum sequence number is a meta information that is stored in each HFile to indicate which piece of data the persistence has progressed and where it should continue. When the region is started, these serial numbers will be read, and the largest of them will be used as the basic serial number, and subsequent new data updates will be incremented based on this value to generate a new serial number.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Comment: There is a concept of a serial number. Every HBase data update will bind a new self-incrementing serial number. Each HFile will store the maximum serial number of the data it saves. This meta-information is very important. It is equivalent to a commit point, telling us that the data before this serial number has been persisted to the hard disk. It will be used not only when the region is started, but also when recovering from a failure, it can also tell us where in the WAL we should start replaying the historical update record of the data.

HBase HFile

The data is stored in HFile in the form of Key/Value. When MemStore accumulates enough data, the entire ordered data set will be written into a new HFile file to HDFS. The whole process is a sequential write operation, which is very fast, because it does not need to move the disk head. (Note that HDFS does not support random file modification operations, but supports append operations.)

HBase HFile file structure

HFile uses a multi-level index to query data without having to read the entire file. This multi-level index is similar to a B+ tree:

  • KeyValues ​​are stored in order.
  • The rowkey points to the index, and the index points to the specific data block, in units of 64 KB.
  • Each block has its leaf index.
  • The last key of each block is stored in the intermediate index.
  • The index root node points to the middle-level index.

The trailer points to the original information data block, which is written at the end of the HFile file when the data is persisted as the HFile. The trailer also contains information such as bloom filters and time ranges. Bloom filter is used to skip files that do not contain the specified rowkey, and the time range information is filtered based on time, skipping those files that are not within the requested time range.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

HFile index

The index discussed just now will be loaded into memory when HFile is opened, so that data query only needs one hard disk query.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

HBase Read merge

We have found that the KeyValue cells of each row ( row ) may be located in different places. These cells may have been written to HFile, may have been updated recently, are still in the MemStore, or may have been read recently, and cached in the Block Cache in. So, when you read a row, how does the system return the corresponding cells? A read operation will merge the cells in Block Cache, MemStore and HFile:

  • First, the scanner reads cells from Block Cache. The most recently read KeyValue is cached here, which is an LRU cache.
  • Then the scanner reads the MemStore, the write cache, which contains the most recently updated data.
  • If the scanner does not find the corresponding cells in the BlockCache and MemStore, HBase will use the index and bloom filter in the Block Cache to load the corresponding HFile into the memory, and find the requested row cells.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Discussed previously, each MemStore there may be multiple HFile, it may take a read request to read a file, which can affect performance, which is called sense amplifier ( read Amplification ).

Comment: From the timeline, the HFiles are also ordered. In essence, they store the data history update of each column family of each region. Therefore, for the same cell of the same rowkey, it may also have multiple versions of data distributed in different HFiles, so it may be necessary to read multiple HFiles, so the performance overhead will be relatively large, especially when the data locality is not satisfied. This kind of read amplification situation will be more serious. This is also  the reason why compaction will be mentioned later  .

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

HBase Minor Compaction

HBase will automatically merge some small HFiles and rewrite them into a few larger HFiles. This process is called  minor compaction . It uses a merge sort algorithm to merge small files into large files, effectively reducing the number of HFiles.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

HBase Major Compaction

Major Compaction  merges and rewrites all HFiles under each Column Family to become a single large HFile. In this process, deleted and expired cells will be physically deleted, which can improve read performance. But because major compaction will rewrite all HFiles, it will generate a lot of hard disk I/O and network overhead. This is called write amplification ( the Write Amplification ).

Major compaction can be set to be automatically scheduled. Because of the write amplification problem, major compaction is usually scheduled on weekends and midnight. The MapR database has made improvements to this and does not require compaction. Major compaction can also move the data migration caused by server crash or load balancing back to the place away from the Region Server, so that the data locality can be restored  .

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

HDFS data backup

All reads and writes occur on the primary DataNode of HDFS. HDFS will automatically back up WAL and HFile file blocks. HBase relies on HDFS to ensure data integrity and security. When data is written to HDFS, one copy is written to the local node, and the other two backups are written to other nodes.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Both WAL and HFiles are persisted to the hard disk and backed up. So how does HBase recover the data in MemStore that has not been persisted to HFile? The following section will discuss this issue.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

HBase failure recovery

When a Region Server crashes, the regions it manages cannot be accessed until the crash is detected, and then the failure recovery is completed, these regions can be accessed. Zookeeper relies on heartbeat detection to detect node failure, and then HMaster will receive a notification of region server failure.

When HMaster discovers a region server failure, HMaster will allocate the regions managed by this region server to other healthy region servers. In order to restore the data in the MemStore of the failed region server that has not been persisted to HFile, HMaster will split the WAL into several files and save them on the new region server. Each region server then replays the data in the WAL fragments it has obtained to establish a MemStore for the new region it allocates.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

WAL contains a series of modification operations, and each modification represents a put or delete operation. These changes are written sequentially in chronological order, and they are written to the end of the WAL file sequentially when they are persisted.

What if the data is still in MemStore and has not been persisted to HFile? The WAL file will be played back. The method of operation is to read the WAL file, sort and add all modification records to MemStore, and finally MemStore will be flashed to HFile.

è¿å¯è½æ¯æ容æçè§ £ çHbaseæ¶æå¨è§ £ ï¼10åéå¦ä¼ï¼å »ºè®®æ¶è

Comment: Failure recovery is an important feature of HBase reliability assurance. WAL plays a key role here. When splitting WAL, data is allocated to the corresponding new region server according to the region, and then the region server is responsible for replaying this part of the data to MemStore.

 

 

 

Guess you like

Origin blog.csdn.net/llwy1428/article/details/106773217