Learn HBase in more detail and quickly, you deserve it!

Chapter 1 Introduction to HBase

1.1 HBase definition

HBase is a distributed, scalable NoSQL database that supports massive data storage.

1.2HBase data model

Logically, the data model of HBase is very similar to that of a relational database. The data is stored in a table with rows and columns. But from the perspective of HBase's underlying physical storage structure (KV), HBase is more like a multi-dimensional map.

1.2.1 HBase logical structure

 

1.2.2 HBase physical storage structure

 

1.2.3 Data Model

1) Name Space is similar to the concept of DatabBase in relational databases. There are multiple tables under each namespace. HBase has two built-in namespaces, hbase and default. HBase stores the built-in HBase tables, and the default table is the namespace used by the user by default.

2) Region is similar to the table concept of a relational database. The difference is that HBase only needs to declare the column family when defining the table, and there is no need to declare specific columns. This means that when writing data to HBase, fields can be specified dynamically and on demand. Therefore, compared with relational databases, HBase can easily cope with field changes.

3) Each row of data in the Row HBase table consists of a RowKey and multiple Columns. The data is stored in the lexicographical order of the RowKey, and can only be retrieved based on the RowKey when querying the data, so the design of the RowKey is very important .

4) Each column in Column HBase is qualified by ColumnFamily (column family) and ColumnQualifier (column qualifier), such as info: name, info: age. When building a table, you only need to specify the column family, and the column qualifier does not need to be defined in advance.

5) Time Stamp is used to identify different versions of data. When each piece of data is written, if you do not specify a timestamp, the system will automatically add this field to it, and its value is the time when it was written to HBase.

6) Cell The cell uniquely determined by {rowkey, columnFamily: columnQualifier, timeStamp}. The data in the cell has no type, and is all stored in bytecode format.

1.3 HBase basic architecture

incomplete!

Architectural role: 

1) Region Server RegionServer is the manager of the Region, and its implementation class is HRegionServer. Its main functions are as follows: operations on data: get, put, delete; operations on regions: splitRegion, compactRegion.

2) Master Master is the manager of all RegionServers, and its implementation class is HMaster. Its main functions are as follows: For table operations: create, delete, alter For RegionServer operations: Assign regions to each RegionServer, monitor the status of each RegionServer, Load balancing and failover.

3) Zookeeper HBase uses Zookeeper to do the high availability of the Master, the monitoring of the RegionServer, the entry of metadata, and the maintenance of the cluster configuration.

4) HDFS HDFS provides the ultimate underlying data storage service for HBase, and at the same time provides high-availability support for HBase.

Chapter 2 HBase Quick Start

2.1 HBase installation and deployment

Can refer to my blog:

Big data platform-HBase installation and configuration

Tip: If the node time between clusters is not synchronized, the regionserver cannot be started and a ClockOutOfSyncException will be thrown.

2.2 HBaseShell operation

2.2.1 Basic operation

1. Enter the HBase client command line 

[root@m1 bin]# hbase shell

2. View help commands 

hbase(main):001:0> help

3. View which tables are in the current database 

hbase(main):002:0> list

2.2.2 Table operation

Can refer to my blog:

Must master [Hbase Shell]

 

Chapter 3 HBase Advanced

3.1 Architecture Principle

1) StoreFile stores the physical files of actual data. StoreFile is stored on HDFS in the form of HFile. Each Store has one or more StoreFiles (HFile), and the data is ordered in each StoreFile.

2) MemStore write cache. Since the data in HFile is required to be ordered, the data is stored in MemStore first. After sorting, it will be flashed to HFile when the flashing time is reached. Each flashing will form one The new HFile.

3) WAL data must be sorted by MemStore before being flushed to HFile, but storing the data in the memory has a high probability of causing data loss. In order to solve this problem, the data will be written in a file called Write-Ahead logfile. File, and then write to MemStore. So when the system fails, the data can be reconstructed through this log file.

3.2 Writing process

Writing process:

1) The client first accesses zookeeper to obtain the RegionServer where the hbase:meta table is located.

2) Access the corresponding Region Server, obtain the hbase:meta table, and query which Region in which RegionServer the target data is located according to the namespace:table/rowkey of the read request. The region information of the table and the location information of the meta table are cached in the meta cache of the client to facilitate next access.

3) Communicate with the target RegionServer;

4) Write (append) data sequentially to WAL;

5) Write the data to the corresponding MemStore, and the data will be sorted in the MemStore;

6) Send an ack to the client;

7) After reaching the flashing time of MemStore, flash the data to HFile.

3.3MemStoreFlush

MemStore flash timing:

1. When the size of a memstroe reaches hbase.hregion.memstore.flush.size (the default value is 128M), all memstores in the region will be flushed. When the size of the memstore reaches hbase.hregion.memstore.flush.size (default value 128M)*hbase.hregion.memstore.block.multiplier (default value 4), it will prevent writing data to the memstore.

2. When the total size of the memstore in the region server reaches java_heapsize*hbase.regionserver.global.memstore.size (default value 0.4) *hbase.regionserver.global.memstore.size.lower.limit (default value 0.95), the region will follow All the memstores are flashed in order of size (from large to small). Until the total size of all memstores in the region server is reduced below the above value. When the total size of the memstore in the region server reaches java_heapsize*hbase.regionserver.global.memstore.size (the default value is 0.4), it will prevent continuing to write data to all memstores.

3. When the time for automatic flashing is reached, memstoreflush will also be triggered. The automatic refresh interval is configured by this property hbase.regionserver.optionalcacheflushinterval (default is 1 hour).

4. When the number of WAL files exceeds hbase.regionserver.max.logs, the region will be flushed in chronological order until the number of WAL files is reduced to hbase.regionserver.max.log (this attribute name is obsolete and is no longer needed Manually set, the maximum value is 32).

3.4 Reading process

Reading process

1) The client first accesses zookeeper to obtain the RegionServer where the hbase:meta table is located.

2) Access the corresponding Region Server, obtain the hbase:meta table, and query which Region in which RegionServer the target data is located according to the namespace:table/rowkey of the read request. The region information of the table and the location information of the meta table are cached in the meta cache of the client to facilitate next access.

3) Communicate with the target RegionServer;

4) Query the target data in BlockCache (read cache), MemStore and StoreFile (HFile) respectively, and merge all the data found. All data here refers to different versions (timestamp) or different types (Put/Delete) of the same piece of data.

5) Cache the data block (Block, HFile data storage unit, the default size is 64KB) queried from the file to BlockCache.

6) Return the combined final result to the client.

3.5StoreFileCompaction

Since memstore generates a new HFile each time it is flashed, and different versions (timestamp) and different types (Put/Delete) of the same field may be distributed in different HFiles, it is necessary to traverse all HFiles when querying.

In order to reduce the number of HFiles and clean up expired and deleted data, StoreFileCompaction will be performed. Compaction is divided into two types, namely MinorCompaction and MajorCompaction. MinorCompaction will merge several adjacent smaller HFiles into one larger HFile, but will not clean up expired and deleted data. MajorCompaction will merge all HFiles under a Store into one large HFile, and will clean up expired and deleted data.

StoreFile Compaction

3.6Region Split

By default, each Table has only one Region at the beginning. As data is continuously written, the Region will be split automatically. When splitting, the two sub-regions are located in the current RegionServer, but for load balancing considerations, HMaster may transfer a region to another RegionServer.

Timing of RegionSplit:

1. When the total size of all StoreFiles under a Store in a region exceeds hbase.hregion.max.filesize, the region will be split (before version 0.94).

2. When the total size of all StoreFiles under a Store in a region exceeds Min(R^2 * "hbase.hregion.memstore.flush.size",hbase.hregion.max.filesize"), the Region will Split, where R is the number of the Table in the current RegionServer (after version 0.94).

Region Split

 

Chapter 4 HBaseAPI

To be continued

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_46009608/article/details/110951222