Big Data Ecosystem Products (3) - HBase Architecture and High-Performance Storage

1. The birth of HBase

  Google published three papers on GFS, MapReduce, and BigTable, known as the "Troika", which opened the era of big data. This article introduces the NoSQL system HBase corresponding to BigTable, and its processing of large-scale massive data.

1.1 Design Model

  In the field of computer data storage, relational databases have always been dominant. Many application system designs are designed for databases, which leads to the kidnapping of object models by relational models, and thus leads to the protracted dispute between the anemic model and the hyperemic model of business objects.

  Here, almost all web projects are based on the development model of this anemia model, and even the official demo of the Java Spring framework is written according to this development model. About why the development model of the anemia model will be The majority of programmers accept it, mainly for the following three reasons:

  1. In most cases, the system business we develop may be relatively simple;
  2. The design of the hyperemia model is more difficult than the anemia model, so do not over-design;
  3. The thinking has been solidified, and the transformation has a cost;

  If we apply the DDD development model based on the congestion model in the project, the corresponding development process will be completely different. In this development mode, we need to clarify all businesses in advance and define the attributes and methods contained in the domain model. The domain model is equivalent to a reusable business middle layer. The development of new functional requirements is completed based on these previously defined domain models.

1.2 Non-relational database NoSQL

  In order to solve the shortcomings of relational databases, the industry has proposed many solutions. The more famous one is object databases. The emergence of these databases further proves the superiority of relational databases.

  Things didn't improve until people encountered the insurmountable flaws of relational databases-poor massive data processing capabilities and rigid design constraints. Starting from Google's BigTable, a series of databases that can store and access massive data have been designed, and the concept of NoSQL has been proposed.

  NoSQL mainly refers to non-relational, distributed, database design patterns that support massive data storage. There are also many experts who say that NoSQL is only a supplement to relational databases, not a replacement. HBase is an outstanding representative of NoSQL systems.

  HBase has the ability to process massive data because of its different ideas from traditional relational database design. Traditional relational databases have many constraints on the data stored on them. To learn relational databases, you must learn the database design paradigm, and part of the business logic is included in the data storage. On the other hand, NoSQL databases simply and violently believe that databases store data, and business logic should be handled by applications.

2. Scalable architecture of HBase

  HBase is designed for scalable massive data storage, and its scalability mainly depends on its splittable HRegion and scalable distributed file system HDFS.
HBase scalability architecture

2.1 HRegion

  HRegion is the main process of HBase responsible for data storage. The program reads and writes data through communication with HRegion.

  Data is managed in units of HRegion. If an application wants to access a piece of data, it must first find HRegion, then submit the data read and write operations to HRegion, and HRegion will complete the data operation at the storage level.

2.2 HRegionServer

  HRegionServer is a physical server, and multiple HRegion instances can be started on each HRegionServer.

  When too much data is written in one HRegion and reaches the configured threshold, one HRegion will be split into two HRegions, and the HRegion will be migrated throughout the cluster to balance the load of HRegionServer.

2.3 HMaster

  Each HRegion stores a piece of data in the Key value range [key1, key2).

  All HRegion information, including the stored Key value range, HRegionServer address, access port number, etc., are recorded on the HMaster server.

  In order to ensure the high availability of HMaster, HBase will start multiple HMasters and elect a master server through ZooKeeper.

  The application obtains the address of the main HMaster through ZooKeeper, enters the Key value to obtain the address of the HRegionServer where the Key is located, and then requests the HRegion on the HRegionServer to obtain the required data.

  The timing diagram is as follows:
Sequence diagram obtained by HBase-HMaster

2.4 Data writing process

  Same as the reading process, you need to get the HRegion first to continue the operation.

  HRegion will store data in several HFile format files, which are stored using the HDFS distributed file system, distributed and highly available throughout the cluster.

  When there is too much data in a HRegion, the HRegion and HFile will be split into two HRegions and migrated according to the server load in the cluster.

  If there is a new server in the cluster, that is to say, there is a new HRegionServer, because of its low load, the HRegion will be migrated and recorded to HMaster, so as to realize the linear scaling of HBase.

3. Scalable data model of HBase

  In order to ensure the correctness of relational operations (via SQL statements), traditional relational databases need to specify the field names and data types of the table when designing the database table structure, and follow a specific design paradigm. These specifications lead to poor scalability, and even pre-designing some redundant fields cannot meet incremental requirements.

  The ColumnFamily design used by NoSQL databases is one of the solutions. Column family was first used in Google's BigTable, which is a sparse matrix storage format for column family.
HBase column family
  As shown in the figure above, it is a basic information table of a student. In the table, different students have different contact information and elective courses. More contact information and courses will be added to this table in the future.

  When using a NoSQL database that supports the column family structure, when creating a table, you only need to specify the name of the column family without specifying the fields. You can specify the fields when writing data. In this way, data tables can contain millions of fields, so that the data structure of the application can be expanded at will.

  And this kind of database is also very convenient when querying, and can be queried by specifying any field name and value.

  The data structure design of the HBase column family actually stores the field name and field value together in HBase in the form of Key-Value. When actually writing, you can specify the field name at will, even if there are millions of fields, you can easily handle it.

4. High performance storage of HBase

  We know that the access characteristics of traditional mechanical disks are: continuous read and write is very fast, and random read and write is very slow. Because the disk addressing takes a long time, if the data is not stored continuously, the head needs to move continuously, which will waste a lot of time.

  In order to improve the data writing speed, HBase uses a data structure called LSM tree for data storage. The full name of the LSM tree is Log Structed Merge Tree, which is the Log structure merge tree.

4.1 Data storage

  When data is written, it is continuously written in Log mode, and then multiple LSM trees on the disk are merged asynchronously.
HBase-LSM tree
  The LSM tree can be regarded as an N-order merge tree. Data reading and writing are all performed in memory, and new records will be created, new data will be recorded when modified, and deletion marks will be recorded when deleted.

  There is still a sorting tree in memory. When the amount of data exceeds the set memory threshold, this sorting tree will be merged with the latest sorting tree on disk.

  When the data volume of this sorting tree exceeds the set threshold, it will be merged with the sorting tree at the next level of the disk.

  During the merge process, the old data will be overwritten with the latest updated data.

4.2 Data reading

  When data is read, first search from the sorting tree in memory, if not found, then search sequentially from the sorting tree on disk.

  A data update on the LSM tree does not require disk access and can be done in memory.

  When data access is dominated by write operations, while read operations are concentrated on recently written data, using an LSM tree can greatly reduce the number of disk accesses and speed up access.


  reference article

Guess you like

Origin blog.csdn.net/initiallht/article/details/124933608