1. HBase overview
- HBase is a distributed column storage system built on HDFS;
- HBase is an important member of the Apache Hadoop ecosystem and is mainly used for massive structured data storage;
- Logically, HBase stores data in tables, rows, and columns.
HDFS is suitable for batch processing scenarios:
Does not support random search of data.
Not suitable for incremental data processing
. Does not support data update.
Features of HBase tables:
大
:A table can have billions of rows and millions of columns
无模式
: each row has a sortable primary key and any number of columns. Columns can be dynamically added as needed, and different rows in the same table can have completely different columns;
面向列
: Column (family)-oriented storage and permission control, column (family) independent retrieval;
稀疏
: For empty (null) columns, they do not occupy storage space, and the table can be designed to be very sparse; :
数据多版本
The data in each unit can be Multiple versions, the version number is automatically assigned by default, which is the timestamp when the cell is inserted;
数据类型单一
:The data in Hbase are all strings and have no type.
Comparison between row storage and column storage:
Traditional row database:
- Data is stored row by row
- Queries without indexes use a lot of I/O
- Building indexes and materialized views takes a lot of time and resources
- For query needs, the database must be massively expanded to meet performance requirements.
Column database:
- Data is stored in columns - each column is stored separately
- The data is the index
- Only access the columns involved in the query - significantly reduce system I/O
- Each column is processed by a thread - concurrent processing of queries
- Consistent data types and similar data characteristics - efficient compression
2. HBase data model
HBase是基于Google BigTable模型开发的,典型的key/value系统.
- HBase schema can have multiple Tables, and each table can be composed of multiple Column Families.
- HBase can have Dynamic Column: the column name is encoded in the cell; different cells can have different columns.
Rowkey与Column Family
Row Key
: The "primary key" of each record in the table, which facilitates quick search. The rowkey of each row must be unique and does not need to be inserted in increasing order. : Has a name
Column Family
and contains one or more related columns.
Column
: Belongs to a certain column family , contained in a column familyName:columnName
Version Number
: unique for each rowkey, default value -> system timestamp, type Long
Value (Cell)
: Byte array
Operations supported by Hbase
- All operations are based on rowkey;
- Support CRUD (Create, Read, Update and Delete) and Scan;
- Single line operations: Put, Get, Scan
- Multi-line operations: Scan, MultiPut
- There is no built-in join operation and can be solved using MapReduce.
3. HBase physical model
- Each column family is stored in a separate file on HDFS;
- Key and Version number have one copy in each column family;
- Null values will not be saved.
- HBase maintains a multi-level index for each value, namely: <key, column family, column name, timestamp>
- 1. All rows in the Table are arranged in dictionary order according to the row key;
- 2. Table is divided into multiple Regions in the row direction;
- 3. Regions are divided according to size. Each table starts with only one region. As the data increases, the region continues to increase. When it increases to a threshold, the region will be divided into two new regions, and then there will be More and more regions;
- 4. Region is the smallest unit of distributed storage and load balancing in HBase. Different Regions are distributed to different RegionServers;
- 5,
Region虽然是分布式存储的最小单元
, but it is not the smallest unit of storage (数据存储的最小单元是cell
).
- Region consists of one or more Stores, each store stores a column family;
- Each Store is composed of a memStore and 0 to more StoreFiles;
- memStore is stored in memory and StoreFile is stored on HDFS.
4. HBase basic architecture
HBase basic components
Client:
- Contains interfaces for accessing HBase and maintains cache to speed up access to HBase
Zookeeper:
- Ensure that there is only one master in the cluster at any time
- Store the addressing entries of all Regions
- Monitor the online and offline information of the Region server in real time. And notify the Master in real time
- Store HBase schema and table metadata
Master:
- Assign region to Region server
- Responsible for load balancing of Region server
- Discover the failed Region server and reallocate the regions on it
- Manage users’ operations of adding, deleting, modifying and checking tables
Region Server:
- Region server maintains regions and handles IO requests to these regions
- The Region server is responsible for splitting regions that become too large during operation.
Zookeeper role
HBase relies on ZooKeeper.
By default, HBase manages ZooKeeper instances. For example, starting or stopping ZooKeeper
Master and RegionServers will register with ZooKeeper when starting.
The introduction of Zookeeper makes the Master no longer a single point of failure.
Write-Ahead-Log(WAL)
HBase fault tolerance
Master fault tolerance: Zookeeper reselects a new Master
- In the process without Master, data reading still proceeds as usual;
- In the process without a master, region segmentation, load balancing, etc. cannot be performed;
RegionServer fault tolerance: regularly reports heartbeats to Zookeeper, if no heartbeat occurs within a time
- Master redistributes the Region on the RegionServer to other RegionServers;
- The "write-ahead" log on the failed server is split by the main server and sent to the new RegionServer
Zookeeper fault tolerance: Zookeeper is a reliable service
- Generally, 3 or 5 Zookeeper instances are configured.
Region定位
: Looking for RegionServer -> (ZooKeeper, -ROOT-(single Region), .META., user table)
-ROOT-
- The table contains the list of regions where the .META. table is located. The table will only have one Region;
- The location of the -ROOT- table is recorded in Zookeeper.
.META.
- The table contains a list of all user space regions and the server address of the RegionServer.