Hbase basic summary

Introduction to hbase

HBase (Hadoop Database) is a highly reliable, high-performance, column-oriented, and scalable distributed big data storage system. Has the most ideal write and excellent read performance. It supports pluggable compression algorithms (users can choose their compression algorithms reasonably according to the data characteristics in their column families), making full use of disk space.

 



 

As shown in the figure above, it is an open source implementation of Google BigTable, using Hadoop HDFS as its file storage, using Hadoop MapReduce to process massive data, and using Zookeeper for coordination services. Using HBase technology, a large-scale NoSql storage cluster can be built on a cheap PC Server. It is worth mentioning that BigTable only supports primary indexes, while Hbase supports not only primary indexes, but also secondary indexes.

 

hbase data structure

logical structure
 

  • Table: HBase organizes data into tables
  • Row: Each row represents a data object, and each row is uniquely identified by a Row Key
  • Column Family: It is simply understood that for the classification of some columns in the table, all the columns in the table need to be organized in the column family. Once the column family is determined, it cannot be easily modified.
  • Column Qualifier: Column name, the data in the column family is mapped by the column identifier; it can also be understood as a key-value pair, and the Column Qualifier is the Key.
  • Cell: Each row key, column family and column identifier together form a cell, and the data stored in the cell is called cell data.
  • Timestamp: By default, the data in each unit will be versioned with a timestamp when it is inserted.

 

physical model

system structure

Let's take a look at the system architecture of hbase



 

 

Client: The interface for accessing HBase. There is a caching mechanism that caches location information such as Region to speed up HBase access and communicate with HMaster (management operations) and HRegionServer (read and write operations).

 

Zookeeper: Mainly do three things, store the address of the ROOT-table and HMaster, and will be sensed by the temporary node mechanism

The monitoring status of HRegionServer, Zookeeper can also avoid the single point problem of HMaster (multiple HMaster can be started).

 

HMaster: You can start multiple HMasters and use Zookeeper's election mechanism to prevent single-point problems. It mainly completes some management tasks for Table and Region:

(1) Assign Region to RegionServer
(2) Responsible for load balancing
of RegionServer (3) Find outdated RegionSever and reassign Region
(4) Manage user's operations such as adding, deleting, modifying, and checking Tables

 

RegionServer: It is mainly responsible for responding to user requests and reading and writing data to HDFS. It is the core module of HBase:


 

HRegionServer contains some columns of HRegion objects. Each HRegion corresponds to a Region. As mentioned in the physical model above, each Region includes multiple Stores, and each Store includes a memStore (memory storage) and multiple Stores. storeFile (storeFile encapsulates HFile and is stored on HDFS).

 

The data written by the user will first be put into the MemStore. When the MemStore is full, it will be flushed into a StoreFile. When the number of StoreFile files increases to a certain threshold, the Compact merge operation will be triggered to merge multiple StoreFiles into one StoreFile. Version merging and data deletion are performed (when data is deleted, a mark is first marked but not really deleted), so it can be seen that HBase actually only adds data, and all update and delete operations are performed in the subsequent compact process, which makes The user's write operation can be returned immediately as long as it enters the memory, which ensures the high performance of HBase I/O.

When the StoreFiles are Compacted, larger and larger StoreFiles will gradually be formed. When the size of a single StoreFile exceeds a certain threshold, the Split operation will be triggered, and the current Region will be split into 2 Regions. Each child Region will be assigned to the corresponding HRegionServer by HMaster, so that the pressure of the original one Region can be shunted to two Regions

 

HRegionServer adopts the WAL (Write-Ahead-Log) mechanism to implement data fault tolerance and recovery, which is similar to Binary Log in mysql. There is an HLog object in each HRegionServer. HLog is a class that implements Write Ahead Log. Every time the user operates to write to the MemStore, it also writes a copy of data to the HLog file. The HLog file periodically scrolls out new ones. , and delete old files (data that have been persisted to StoreFile). When HRegionServer terminates unexpectedly, HMaster will perceive through Zookeeper that HMaster will first process the remaining HLog files, split the log data of different regions, and put them in the corresponding regions' directories, and then reassign the failed regions. , the HRegionServer that received these regions will find that there are historical HLogs that need to be processed during the process of loading regions, so the data in the HLog will be Replayed to MemStore, and then flushed to StoreFiles to complete data recovery.

 

physical model

(1) The rows in the Table will be sorted according to the lexicographical order of the Row Key (2) Each Column Family is actually a centralized storage unit, it is best to put the columns with common IO characteristics in a Column Family, which is the most efficient . Row Key and Version will have a copy in each Column Family, and HBase will maintain a multi-level index for each value The columns and values ​​in the column cluster are actually stored according to Key/Value, so the data in the physical structure is not as sparse as the logical structure .

(3) The Table will be divided into multiple Regions in the row direction. At the beginning, the table has only one Region. As the data increases, the Region will be equally divided into two new Regions and redistributed to different RegionServers.

 

(4)HBase中有两张特殊的表-ROOT-和.META.,Zookeeper中记录了-ROOT-的位置信息;-ROOT-又记录了.META.表的Region信息,-ROOT-表只能有一个region;.META.表记录了用户表的Region信息,它可以有多个region。



 

(5)Region是分布式存储的最小单元,但Region中又包括多个Store,每个Store又包括一个memStore(内存存储)和多个storeFile(storeFile封装了HFile,存储在HDFS之上) 

 

适用场景

(1)大数据量存储,大数据量并发操作
(2)需要对数据随机读写操作
(3)读写访问均是非常简单的操作,没有关系型数据库那么复杂的读写操作

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326609686&siteId=291194637