HBase’s functional principles, design ideas, architecture design and source code analysis

Author: Zen and the Art of Computer Programming

1 Introduction

1.1 What is HBase?
HBase is an open source NoSQL data storage system under the Apache Foundation. It can run in the Hadoop environment and provide high reliability and high performance data reading and writing services. HBase has a flexible structure of column families, supports random queries of massive data, and is suitable for various non-relational data analysis scenarios.
From the development of Hadoop, one of Apache's top projects in 2007, to its decline in recent years, and now its entry into the Apache incubator, HBase has become a unique open source product in terms of commercial interests and user needs.
1.2 Why should we learn HBase source code?
After learning the basic knowledge of HBase, we need to further understand its design ideas, architectural design and source code. Learning the HBase source code can help us better understand the working mechanism of HBase, and can also deepen our development understanding. For example, for beginners who are familiar with the Java language but not familiar with Hadoop, Zookeeper or HDFS, reading the HBase source code can help them quickly understand the basic architecture and principles of HBase. In addition, reading the HBase source code may be helpful to some developers, because HBase is developed in Java, and knowledge of the Java language is crucial to understanding the HBase code.
1.3 The learning objectives of this series of tutorials
This series of tutorials mainly focuses on learning the source code of HBase. Through the analysis of the functional principles, design ideas, architecture design and source code of HBase, it can help readers better understand HBase and be able to apply the knowledge they have learned. Solve real problems. The specific learning objectives are as follows:

  • Understand the functional overview and features of HBase.
  • Master the working principles of HBase, including cluster architecture, data model, table design and sharding strategy.
  • Learn the Java API of HBase and its implementation principles.
  • In-depth understanding of the internal mechanisms of HBase, including load balancing, RegionServer splitting and merging, data consistency protocols, transaction processing, etc.
  • Expand your programming skills through source code reading and practice.

2. Core concepts and terminology

2.1 Basic concepts

2.1.1 NoSQL

NoSQL (Not Only SQL) means "not just SQL" and refers to non-relational databases. Different from traditional relational databases, NoSQL separates data storage and calculation, making the data more flexible and easy to expand. NoSQL stores data in the form of key-value pairs, and the corresponding value can be obtained through a simple key. Due to its key-value format, NoSQL can store a large number of different types of data without the need for a pre-defined schema. In general, NoSQL is usually used to build distributed key-value databases, such as Apache Cassandra, MongoDB, and Redis.

2.1.2 Hadoop

Hadoop is a framework that integrates underlying hardware resources and provides a simple and unified set of operations. Distributed computing and storage of large-scale data sets can be achieved through Hadoop. Hadoop has three pillars: HDFS, MapReduce and YARN. HDFS (Hadoop Distributed File System) is a distributed file system used to store massive files. MapReduce (Hadoop Map Reduce) is a distributed computing framework that allows users to write concurrent or sequential Map and Reduce-based jobs. YARN (Yet Another Resource Negotiator) is a resource management framework used to allocate cluster resources.

2.1.3 HBase

HBase is an open source NoSQL data storage system under the Apache Foundation and developed by Hadoop. HBase provides distributed, scalable, low-latency, high-throughput storage and access capabilities, and is suitable for massive data storage and query in distributed environments. HBase supports column family storage, batch writing, real-time query and other features, and it uses Hadoop's HDFS file system as the underlying storage, which can directly utilize existing server clusters on Hadoop.

2.1.4 Bigtable

Google's Bigtable is a NoSQL database, which is a distributed structured data storage used to store structured, semi-structured and unstructured data. The design concept of Bigtable is to divide data according to rows, columns, and timestamps, and ensure high availability and high performance. Its storage architecture is similar to HDFS in Hadoop. Although Bigtable has some limitations, such as the inability to search specific columns like a relational database, its advantages are efficiency, scalability, and reliability.

2.1.5 Cassandra

Apache Cassandra is an open source distributed NoSQL database proposed by Facebook and is one of the leading NoSQL databases. Cassandra supports high availability, distribution, elastic scaling, and strong consistency. It is a multi-master model based on replication (i.e. each node can receive read and write requests), providing high throughput and low latency. Cassandra is developed in Java language and is extremely flexible and easy to use. It has now become a top-level project of the Apache Foundation.

2.1.6 Hypertable

Hypertable is a distributed, in-memory computing NoSQL database developed by Pinterest. Hypertable stores data in memory and provides high read and write speeds through SSD. It supports high availability, data security, ACID transactions, split horizon and replication. Its storage architecture is similar to HDFS in Hadoop. Hypertable draws on the design concepts from Google's F1 paper and adopts a unique B+Tree indexing method in its architecture.

2.1.7 Memcached

Memcached is a high-performance distributed memory object caching system. Memcached, first developed by Danga Interactive, is a high-performance memory key-value storage system used for caching or session storage. Its high performance comes from memcached being stored in memory, so it is fast and suitable for multi-threaded applications. It supports libevent-based event-driven model. Memcached draws on the BSD protocol and uses a simple, lightweight message passing mechanism. It has now become a service of the Linux operating system.

2.1.8 Redis

Redis is an open source, high-performance key-value database. It supports data persistence, master-slave synchronization, HA and other features. It is a memory-based high-performance key-value database that supports data types such as strings, hashes, lists, sets, and ordered sets. It can be used in scenarios such as caching, message queues, counters, and rankings. It is developed using C language and has very high performance. Redis is inherently oriented to massive data and can handle more than 100,000 requests per second. Currently, Redis is becoming the most popular NoSQL database.

2.2 Data model and table design

The HBase data model is similar to a traditional relational database, with the concept of a table. Each table consists of multiple columns, and each column corresponds to a column family (Column Family). Each row corresponds to a row key (Row Key), which uniquely determines the data of this row. There are two types of column families in HBase:

  • Primary Column Family: There can only be one primary key column family, generally named "cf". Every table should have a primary key column family because the purpose of this column family is for sorting and range scans.
  • Other column families (Non-primary Column Family): There can be multiple other column families, which are used to store other data besides the primary key.
    Each column family can store any amount of data, including NULL values. Data for all column families in each row will be saved together. Locate data through row keys, column families, and column qualifiers.

2.3 HDFS architecture

HDFS (Hadoop Distributed File System) is a distributed file system that provides high fault-tolerance and high-throughput file storage services. HDFS is a sub-project of the Hadoop project. HDFS consists of two main components: NameNode and DataNode. The NameNode is responsible for managing the file system namespace and is a single point of failure. DataNode stores file data blocks. HDFS supports random reading and writing of files. HDFS is designed to store large amounts of files, but does not support random modification of files. All data is stored in blocks, and the file system accesses data in the form of streams. HDFS has high fault tolerance, high reliability, and automatic rack awareness.
HDFS has the following important features:

  • High fault tolerance: HDFS uses a master-slave architecture to provide high fault tolerance. Once a DataNode fails, other DataNodes will take over its work and continue to provide services.
  • High throughput: HDFS can handle a large number of read and write requests with high throughput.
  • Reliable data transmission: HDFS uses independently deployed TCP/IP connections, and data is transmitted through CRC verification.
  • Suitable for batch processing: HDFS is designed to store a large number of small files that will not be modified and is suitable for batch processing.

2.4 ZooKeeper

Apache Zookeeper is a sub-project of the Apache Hadoop project and is a distributed collaborative service. It provides consistency services for distributed applications. Zookeeper is used to ensure the correctness of distributed processes. It is an open source distributed coordination tool that provides various functions for distributed applications, including configuration maintenance, domain name services, soft routing, failover and notifications, etc. The role of Zookeeper is to maintain configuration files in distributed systems, detect whether nodes are alive, and notify other servers to maintain the consistency of cluster information. The operating mode of Zookeeper relies on a group of servers called Zookeeper services, which generally consist of a master server and multiple slave servers. When the master server fails, zookeeper will elect a new master server. The entire system guarantees eventual consistency, which means that the status of all servers will eventually reach a consistent state.

3. Design principles

3.1 Distributed design

HBase is a distributed database. It is not a stand-alone database, but a distributed system composed of multiple servers. It provides highly available distributed services. HBase can scale horizontally, which means that the performance of the service can be improved by adding nodes. There is a Master node in HBase, which is responsible for monitoring the status of RegionServer and allocating Regions. It also has a set of Server nodes running on RegionServer, which are responsible for storing data.

3.1.1 Master

The Master node of HBase mainly has the following functions:

  • Metadata storage: The Master node stores metadata about data distribution. It stores which RegionServers the data is located on, and which Regions are distributed on which servers.
  • Namespace management: The Master node tracks the latest status of all tables in HBase. It can create new tables, delete existing tables, and update table properties.
  • Query routing: The Master node routes to the corresponding RegionServer according to the user's query request.
  • Load balancing: When cluster resources are tight, the Master node will distribute requests to different RegionServers.

3.1.2 RegionServer

The RegionServer node of HBase stores the data in HBase. It is mainly responsible for the following tasks:

  • Data storage: RegionServer stores data in HBase. RegionServer splits data according to Region to utilize storage space more efficiently. RegionServer can store data on local disks or use remote data storage systems such as Amazon S3.
  • Data cutting and splitting: RegionServer can dynamically split data to make full use of the cluster's storage resources. When a Region on RegionServer is overcrowded, it will split the Region into two new Regions, and cut and evenly distribute the data to the two new Regions.
  • Copy management: The data in HBase is stored on each RegionServer. In order to ensure the redundancy and reliability of the data, it will copy the data to multiple servers.
  • Request routing: When RegionServer receives a client request, it will find the corresponding Region to respond based on the query request.
  • Failover: When a RegionServer fails, HBase will reroute requests to another RegionServer.

3.1.3 Communication protocol

When a client sends a request to HBase, the request will first be routed to the Master node. The Master node will select a RegionServer node and then route the request to the specified RegionServer. The RegionServer node is responsible for processing requests and returning results to the client. RegionServers communicate with each other using the Thrift protocol, which is a high-performance cross-language network communication protocol. Thrift can map complex structures to binary encoding to reduce network bandwidth consumption.

3.2 Sharding mechanism

Data in HBase is distributed to different RegionServers to make more efficient use of cluster resources. The data is divided into multiple Regions on the RegionServer, and the Regions are distributed on different RegionServers. Regions are cut into fixed-size chunks called StoreFiles. StoreFile is the smallest physical unit in HBase and is also the storage unit of data on disk. A Region usually consists of multiple StoreFiles. StoreFile can be placed on a local disk or on a remote data storage system such as Amazon S3.
When a record is inserted into HBase, it will first be routed to the corresponding Region. Region will divide the records into multiple StoreFiles and write the StoreFiles to the corresponding disk or remote data storage system. If a record needs to be updated, HBase needs to find the corresponding Region and find the corresponding StoreFile, and then write the modified data to the StoreFile. If the record needs to be deleted, just delete it from the corresponding StoreFile. This ensures data integrity and high availability.

3.3 Consistency protocol

In a distributed system, the status of each node may be inconsistent. In order to keep data consistent on each node, HBase adopts the Two Phase Commits protocol to ensure data consistency. The consistency protocol used in HBase is the Paxos protocol. The Paxos protocol guarantees data consistency among multiple nodes.
Before executing the Paxos protocol, HBase needs to determine a super Master, which is the only master node in the HBase cluster. The super Master is responsible for coordinating HBase operation requests, managing the joining and exiting of RegionServers, and related configuration changes to tables. Whenever a RegionServer node starts, it registers with the super Master. When a RegionServer node fails, it will log off from the super Master.
Every time a client request is sent to HBase, a Transaction ID is generated, which is the timestamp when the client initiates the request. When the client ends the request, it sends a Prepare message to the RegionServer, which contains the client's request, Transaction ID, and the maximum submission time expected by the client. Prepare messages will be collected by RegionServer, and then Commit messages will be sent to Primary RegionServer. The Commit message will only be sent after the Prepare messages of all participants (including Primary RegionServer and Secondary RegionServer) have been confirmed. If any of the participants does not respond successfully, the client will retry the request.
If all participants respond successfully, RegionServer will send an Accept message to all participants. The Accept message contains the client's request, Transaction ID, submission time and other information. When any of the participants accepts the Accept message, RegionServer will persist the data to disk. If any of the participants rejects the Accept message, the client will retry the request.

3.4 Scan command

The Scan command is used in HBase to retrieve data from multiple tables. The Scan command specifies the table name, scanning conditions, result filtering conditions, columns to return results, sorting conditions, etc. When the client initiates the Scan command, it sends an RPC request to the specified RegionServers, and the RegionServers process the request and return the query results. After receiving the returned results, the client performs data filtering, aggregation, and sorting based on the result filtering conditions, columns, and sorting conditions. Finally, the client returns the results to the user.

4. Source code analysis

This section will analyze the source code of HBase in detail. First, we will introduce several important modules in HBase: HMaster, HRegionServer, HLog, WAL and Client. Then, the storage architecture of HBase is described in detail, the storage process of HBase, and the writing process of HBase are introduced. Finally, the Read/Write process of HBase and the lock mechanism involved are explained.

4.1 Module introduction

4.1.1 HMaster

HMaster is the Master process of HBase, which is the central controller of the entire HBase cluster. Its main responsibilities are as follows:

  • Maintain metadata information of the HBase cluster, including table location, table status, RegionServer status, etc.
  • Process the client's read and write requests and route the requests to the corresponding RegionServer.
  • Make decisions on Region splits and merges.
  • Perform HBase-related configuration modifications.
  • Manage HBase services.

4.1.2 HRegionServer

HRegionServer is a RegionServer process of HBase. Its main responsibilities are as follows:

  • Store data in HBase tables.
  • Respond to client read and write requests.
  • Manage Region.
  • Perform failover.

4.1.3 HLog

HLog is the log file used by HBase for WAL (Write Ahead Log). It is the advance log when HBase data is updated. HLog has two main functions:

  • Data persistence: When the RegionServer goes down, it can recover data through the WAL log.
  • Fault recovery: When the RegionServer fails, it can recover metadata through the WAL log.

4.1.4 WAL

WAL (Write Ahead Log) is the advance log when HBase data is updated. WAL is a mechanism that ensures data persistence by writing logs first and then flushing the disk. When RegionServer receives a data update request, it writes the update to the WAL file before flushing the disk. If the RegionServer crashes, it can recover data through the data in the WAL file.

4.1.5 Client

Client is the client library of HBase. Client provides Java and Python versions of the interface. Its main responsibilities are as follows:

  • Encapsulates the HBase client interface.
  • Perform data operations.

4.2 HBase storage architecture

Data in HBase is stored in the form of RowKey and ColumnFamily:Qualifier. Among them, RowKey is used to locate rows, ColumnFamily is the logical unit for organizing data, and Qualifier is an element in ColumnFamily. HBase data is stored on RegionServer in the form of tables. A table consists of multiple Regions. Region is a continuous byte array that stores a set of rows, each row consisting of multiple Cells. Cell consists of two parts: Value and Timestamp. Value is the value of Cell, and Timestamp is the timestamp of Cell update. Each Cell has a corresponding version number. HBase uses BlockCache to cache the contents of recently accessed Blocks (data blocks). BlockCache will reduce the number of HDFS accesses and improve performance.
HBase storage architecture diagram:

  • A table (Table) can contain multiple Regions.
  • A Region consists of several StoreFiles, and each StoreFile is an HBase physical storage unit.
  • Each StoreFile can be on a local disk or a remote data storage system.
  • StoreFile contains one or more column families (ColumnFamily).
  • Each column family consists of multiple columns (Column).
  • Each Cell has a corresponding version number.

4.3 Storage process

When a client initiates a data read or write request, it will first check whether the Master is healthy. Master will forward the request to the corresponding RegionServer. RegionServer will find the corresponding TableRegion based on the requested read and write type (read, write, scan), and determine whether the Region exists or is closed. If the Region does not exist or is closed, RegionServer will return an error message. Otherwise, RegionServer performs data reading and writing operations. If it is a write operation, RegionServer will write the data to WAL and persist the WAL.
The processing flow of write operations is as follows:

1.客户端将数据写入 WAL。
2.RegionServer 检查 WAL 文件是否写满。
3.如果 WAL 文件写满,RegionServer 将 WAL 文件滚动并生成一个新的 HLog 文件。
4.RegionServer 将数据写入 HDFS 的一个 StoreFile 中。
5.如果数据没有损坏,RegionServer 会更新内存中的数据,同时向 MemStore 写入一个 MemStoreSize。
6.如果 MemStoreSize 达到了阀值,RegionServer 会将 MemStore 中的数据写入 HDFS 的一个 MemStore 文件中。
7.RegionServer 会将 MemStore 文件滚动并生成一个新的 HFile 文件。
8.如果 MemStore 文件满了,RegionServer 会将 MemStore 文件滚动并生成一个新的 MemStore 文件。
9.RegionServer 返回客户端操作成功。

The processing flow of the read operation is as follows:

1.客户端发送请求到 RegionServer。
2.RegionServer 查找 MemStore 和 HFile 文件中是否有符合条件的数据。
3.如果 MemStore 和 HFile 文件中都没有数据,RegionServer 会向邻居节点请求数据。
4.如果邻居节点有数据,RegionServer 会将数据合并后返回给客户端。
5.客户端获取数据并返回。

The processing flow of the Scan operation is as follows:

1.客户端发送 Scan 请求到 RegionServer。
2.RegionServer 获取 MemStore 和 HFile 文件中的数据。
3.RegionServer 对数据进行过滤、排序等操作。
4.RegionServer 返回过滤、排序后的结果给客户端。

4.4 Writing process

Write flow chart:
Detailed steps of write operation:

  • The client sends a Put request to the Master.
  • Master sends a request to the corresponding RegionServer.
  • RegionServer checks whether the MemStore file is full.
  • If the MemStore file is full, RegionServer rolls the MemStore file and generates a new MemStore file.
  • RegionServer writes data to MemStore.
  • RegionServer determines whether the current MemStore file is full.
  • If the MemStore file is full, RegionServer rolls the MemStore file and generates a new MemStore file.
  • RegionServer generates a new Hlog file and writes data to the Hlog file.
  • RegionServer persists Hlog files to disk.
  • If the write operation is successful, RegionServer writes the data to HDFS.

4.5 Reading and writing process

Read and write flow chart:
Detailed steps of read and write operations:

  • The client sends a Get/Put request to the Master.
  • Master sends a request to the corresponding RegionServer.
  • RegionServer checks whether the MemStore file has data, and if so, returns the data directly.
  • If the MemStore file has no data, RegionServer will check the StoreFile file in HDFS, merge the data and return it to the client.
  • If there is no matching data, RegionServer will return a Not Found exception to the client.
  • The client returns the response result.

4.6 Lock mechanism

In order to prevent conflicts caused by multiple clients updating the same data at the same time, HBase uses optimistic locking and pessimistic locking.

  • Pessimistic lock: Pessimistic lock believes that only one client can update data at a time. Each time the lock is acquired, it will check whether the current data has been modified by other clients.
  • Optimistic locking: Optimistic locking believes that the client will not conflict and compares the version number of the data every time it tries to update the data.
    The locking mechanisms used in HBase are as follows:
  • Global lock: Global lock is a special kind of pessimistic lock. Its function is to control concurrent access of the entire HBase cluster. When a client obtains a global lock, other clients cannot obtain the lock until the current client releases the lock.
  • Row-level lock: Row-level lock is a pessimistic lock, its function is to control concurrent access to specified rows.
  • Column family level lock: Column family level lock is a pessimistic lock, its function is to control concurrent access to the specified column family.
  • Custom lock: A custom lock is a pessimistic lock that allows clients to specify the lock granularity themselves and allows multiple clients to hold locks with different granularities at the same time.

Summarize

This article provides an overview of HBase's functionality, key terms, and storage architecture. By analyzing the source code of HBase, I have a deep understanding of the functional principles, design ideas, architecture design, storage process, writing process, reading and writing process and lock mechanism of HBase. I hope everyone can be inspired by the learning process and be able to use the knowledge they have learned to solve practical problems in their daily work.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/132158164