HBase application scenarios, principles and basic architecture

1. HBase overview

  • HBase is a distributed column storage system built on HDFS;
  • HBase is an important member of the Apache Hadoop ecosystem and is mainly used for massive structured data storage;
  • Logically, HBase stores data in tables, rows, and columns.

HDFS is suitable for batch processing scenarios:

Does not support random search of data.
Not suitable for incremental data processing
. Does not support data update.

Features of HBase tables:

:A table can have billions of rows and millions of columns
无模式: each row has a sortable primary key and any number of columns. Columns can be dynamically added as needed, and different rows in the same table can have completely different columns;
面向列: Column (family)-oriented storage and permission control, column (family) independent retrieval;
稀疏: For empty (null) columns, they do not occupy storage space, and the table can be designed to be very sparse; :
数据多版本The data in each unit can be Multiple versions, the version number is automatically assigned by default, which is the timestamp when the cell is inserted;
数据类型单一:The data in Hbase are all strings and have no type.

Comparison between row storage and column storage:

Traditional row database:

  • Data is stored row by row
  • Queries without indexes use a lot of I/O
  • Building indexes and materialized views takes a lot of time and resources
  • For query needs, the database must be massively expanded to meet performance requirements.

Column database:

  • Data is stored in columns - each column is stored separately
  • The data is the index
  • Only access the columns involved in the query - significantly reduce system I/O
  • Each column is processed by a thread - concurrent processing of queries
  • Consistent data types and similar data characteristics - efficient compression

2. HBase data model

HBase是基于Google BigTable模型开发的,典型的key/value系统.

  • HBase schema can have multiple Tables, and each table can be composed of multiple Column Families.
  • HBase can have Dynamic Column: the column name is encoded in the cell; different cells can have different columns.
    Insert image description here

Rowkey与Column Family

Insert image description here
Row Key: The "primary key" of each record in the table, which facilitates quick search. The rowkey of each row must be unique and does not need to be inserted in increasing order. : Has a name
Column Familyand contains one or more related columns.
Column: Belongs to a certain column family , contained in a column familyName:columnName
Version Number: unique for each rowkey, default value -> system timestamp, type Long
Value (Cell): Byte array

Operations supported by Hbase

  • All operations are based on rowkey;
  • Support CRUD (Create, Read, Update and Delete) and Scan;
  • Single line operations: Put, Get, Scan
  • Multi-line operations: Scan, MultiPut
  • There is no built-in join operation and can be solved using MapReduce.

3. HBase physical model

  • Each column family is stored in a separate file on HDFS;
  • Key and Version number have one copy in each column family;
  • Null values ​​will not be saved.
  • HBase maintains a multi-level index for each value, namely: <key, column family, column name, timestamp>
  • 1. All rows in the Table are arranged in dictionary order according to the row key;
    Insert image description here
  • 2. Table is divided into multiple Regions in the row direction;
  • 3. Regions are divided according to size. Each table starts with only one region. As the data increases, the region continues to increase. When it increases to a threshold, the region will be divided into two new regions, and then there will be More and more regions;
    Insert image description here
  • 4. Region is the smallest unit of distributed storage and load balancing in HBase. Different Regions are distributed to different RegionServers;
    Insert image description here
  • 5, Region虽然是分布式存储的最小单元, but it is not the smallest unit of storage ( 数据存储的最小单元是cell).
    • Region consists of one or more Stores, each store stores a column family;
    • Each Store is composed of a memStore and 0 to more StoreFiles;
    • memStore is stored in memory and StoreFile is stored on HDFS.
      Insert image description here

4. HBase basic architecture

Insert image description here

HBase basic components

Client:

  • Contains interfaces for accessing HBase and maintains cache to speed up access to HBase

Zookeeper:

  • Ensure that there is only one master in the cluster at any time
  • Store the addressing entries of all Regions
  • Monitor the online and offline information of the Region server in real time. And notify the Master in real time
  • Store HBase schema and table metadata

Master:

  • Assign region to Region server
  • Responsible for load balancing of Region server
  • Discover the failed Region server and reallocate the regions on it
  • Manage users’ operations of adding, deleting, modifying and checking tables

Region Server:

  • Region server maintains regions and handles IO requests to these regions
  • The Region server is responsible for splitting regions that become too large during operation.

Zookeeper role

HBase relies on ZooKeeper.
By default, HBase manages ZooKeeper instances. For example, starting or stopping ZooKeeper
Master and RegionServers will register with ZooKeeper when starting.
The introduction of Zookeeper makes the Master no longer a single point of failure.

Write-Ahead-Log(WAL)

Insert image description here

HBase fault tolerance

Master fault tolerance: Zookeeper reselects a new Master

  • In the process without Master, data reading still proceeds as usual;
  • In the process without a master, region segmentation, load balancing, etc. cannot be performed;

RegionServer fault tolerance: regularly reports heartbeats to Zookeeper, if no heartbeat occurs within a time

  • Master redistributes the Region on the RegionServer to other RegionServers;
  • The "write-ahead" log on the failed server is split by the main server and sent to the new RegionServer

Zookeeper fault tolerance: Zookeeper is a reliable service

  • Generally, 3 or 5 Zookeeper instances are configured.

Region定位: Looking for RegionServer -> (ZooKeeper, -ROOT-(single Region), .META., user table)

-ROOT-

  • The table contains the list of regions where the .META. table is located. The table will only have one Region;
  • The location of the -ROOT- table is recorded in Zookeeper.

.META.

  • The table contains a list of all user space regions and the server address of the RegionServer.

Guess you like

Origin blog.csdn.net/m0_49447718/article/details/129994834