Huawei Cloud HBase Hot and Cold Separation Best Practices

This article is shared from the Huawei Cloud Community  "Huawei Cloud HBase Hot and Cold Separation Best Practices" by pippo.

Introduction to HBase

HBase is the abbreviation of Hadoop Database. It is a distributed column-oriented database built on the Hadoop file system. It has high reliability, high performance, column-oriented and scalable characteristics, and provides fast random access to massive data.

HBase adopts the Master/Slave architecture, which is composed of HMaster nodes, RegionServer nodes, and ZooKeeper clusters. The underlying data is stored on HDFS.

The overall architecture is shown in the figure:

HMaster is mainly responsible for:

  • In HA mode, it includes the active Master and the standby Master.
  • Active Master: Responsible for the management of RegionServer in HBase, including table addition, deletion, modification and query; RegionServer load balancing, Region distribution adjustment; Region splitting and Region allocation after splitting; Region migration after RegionServer failure, etc.
  • Standby Master: When the active Master fails, the standby Master will replace the active Master to provide external services. After the fault is restored, the original master is reduced to standby.

RegionServer is mainly responsible for:

  • Store and manage local HRegion.
  • RegionServer is responsible for providing services such as table data reading and writing. It is the data processing and computing unit of HBase and interacts directly with the Client.
  • RegionServer is generally deployed together with the DataNode of the HDFS cluster to implement the data storage function. Read and write HDFS and manage data in Table.

The ZooKeeper cluster is mainly responsible for:

  • Stores metadata of the entire HBase cluster and cluster status information.
  • Implement Failover of HMaster master and slave nodes.

HDFS cluster is mainly responsible for:

  • HDFS provides highly reliable file storage services for HBase, and all HBase data is stored in HDFS.

Structure description:

Store

  • A Region consists of one or more Stores, and each Store corresponds to a Column Family in the graph.

MemStore

  • A Store contains a MemStore. The MemStore caches the data inserted by the client into the Region. When the MemStore size in the RegionServer reaches the configured capacity limit, the RegionServer will "flush" the data in the MemStore to HDFS.

StoreFile

  • MemStore data becomes StoreFile after being flushed to HDFS. As data is inserted, one Store will generate multiple StoreFiles. When the number of StoreFiles reaches the configured threshold, RegionServer will merge multiple StoreFiles into one large StoreFile.

HFile

  • HFile defines the storage format of StoreFile in the file system. It is the specific implementation of StoreFile in the current HBase system.

HLog (WAL)

  • The HLog log ensures that data written by users will not be lost when the RegionServer fails. Multiple Regions of the RegionServer share the same HLog.

HBase provides two APIs to write data.

  • Put: Data is sent directly to RegionServer.
  • BulkLoad: Load HFile directly into the table storage path.

HBase hot and cold separation requirements

HBase is the abbreviation of Hadoop Database. It is a distributed column-oriented database built on the Hadoop file system. It has high reliability, high performance, column-oriented and scalable characteristics, and provides fast random access to massive data.

In a massive big data scenario, part of the business data in the table is only used as archive data or has a very low access frequency over time. At the same time, the volume of this part of historical data is very large, such as order data or monitoring data. If this part of the data is reduced The storage cost will greatly save the cost of the enterprise.

The hot and cold separation function supports storing hot and cold data on different media. The storage type of cold data is ordinary IO storage, and the storage type of hot data is ultra-high IO storage. The price of ordinary IO storage is only 30% of that of ultra-high IO storage, which greatly reduces storage costs.

Introduction to HBase hot and cold separation

HBase supports hot and cold separate storage of data in the same table. After the user configures the hot and cold time demarcation point of the data on the table, HBase will rely on the timestamp (milliseconds) and time demarcation point of the data written by the user to determine whether the data is hot or cold. Data is initially stored on hot storage and slowly migrates to cold storage over time. At the same time, users can arbitrarily change the hot and cold demarcation points of the data, and the data can be moved from hot storage to cold storage, or from cold storage to hot storage.

The overall architecture is shown in the figure:

1236.png

Command introduction

Set the hot and cold dividing line of the table

Create a hot and cold separation table:

hbase(main):002:0> create 'hot_cold_table', {NAME=>'f', COLD_BOUNDARY=>'86400'}

Parameter Description:

NAME: Column family that requires hot and cold separation.

COLD_BOUNDARY: hot and cold separation time point, unit is seconds (s). For example, COLD_BOUNDARY is 86400, which means that data written 86400 seconds (one day) ago will be automatically archived to cold storage.

Cancel hot and cold separation.

hbase(main):004:0> alter 'hot_cold_table', {NAME=>'f', COLD_BOUNDARY=>""}

Set hot and cold separation for an existing table, or modify the hot and cold separation dividing line, in seconds.

hbase(main):005:0> alter 'hot_cold_table', {NAME=>'f', COLD_BOUNDARY=>'86400'}

Check whether hot and cold separation is set or modified successfully

hbase(main):005:0> desc 'hot_cold_table'

Data writing

The data writing method for hot and cold separated tables is exactly the same as that of ordinary tables. The data will be stored in hot storage (ultra-high IO) first. As time goes by, if a row of data meets the current time-time column value > the value condition set by COLD_BOUNDARY, it will be archived into cold storage (ordinary IO) when Compaction is executed.

Insert record

To execute the "put" command to insert a record into the specified table, you need to specify the name of the table, the primary key, the custom column, and the specific value to be inserted.

hbase(main):004:0> put 'hot_cold_table','row1','cf:a','value1'

Parameter Description:

hot_cold_table: The name of the table.

row1: primary key.

cf:a: Custom column.

value1: The inserted value.

data query

Since both hot and cold data are in the same table, all user query operations only need to be performed in one table. When querying, it is recommended to specify the time range of the query by configuring TimeRange. The system will determine the query mode based on the specified time range, that is, query only hot storage, only query cold storage, or query both cold storage and hot storage. If the time range is not limited when querying, cold data will be queried. In this case, query throughput is limited by cold storage.

random query

Query data without specifying the HOT_ONLY parameter. In this case, the data in cold storage will be queried.

hbase(main):001:0> get 'hot_cold_table', 'row1'

Query data by specifying the HOT_ONLY parameter. In this case, only the data in hot storage will be queried.

hbase(main):002:0> get 'hot_cold_table', 'row1', {HOT_ONLY=>true}

Query data by specifying the TimeRange parameter. In this case, CloudTable will compare the TimeRange and the hot and cold boundary values ​​to determine whether to query only the data in hot storage or cold storage, or to query the data in both hot and cold storage.

hbase(main):003:0> get 'hot_cold_table', 'row1', {TIMERANGE => [0, 1568203111265]}

range query

Query data without specifying the HOT_ONLY parameter. In this case, the data in cold storage will be queried.

hbase(main):001:0> scan 'hot_cold_table', {STARTROW =>'row1', STOPROW=>'row9'}

Query data by specifying the HOT_ONLY parameter. In this case, only the data in hot storage will be queried.

hbase(main):002:0> scan 'hot_cold_table', {STARTROW =>'row1', STOPROW=>'row9', HOT_ONLY=>true}

Query data by specifying the TimeRange parameter. In this case, CloudTable will compare the TimeRange and the hot and cold boundary values ​​to determine whether to query only the data in hot storage or cold storage, or to query the data in both hot and cold storage.

hbase(main):003:0> scan 'hot_cold_table', {STARTROW =>'row1', STOPROW=>'row9', TIMERANGE => [0, 1568203111265]}

Data merge

  • Merge the hot data areas of all partitions of the table.

    hbase(main):002:0> major_compact 'hot_cold_table', nil, 'NORMAL', 'HOT'

  • Merge the cold data areas of all partitions of the table.

    hbase(main):002:0> major_compact 'hot_cold_table', nil, 'NORMAL', 'COLD'

  • Merge the hot and cold data areas of all partitions of the table.

    hbase(main):002:0> major_compact 'hot_cold_table', nil, 'NORMAL', 'ALL'

HBase hot and cold separation effect

cke_1565.png

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

 

Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk’s hard-core migration of servers TCP congestion control saved the Internet Apache OpenOffice is a de facto “unmaintained” project Google celebrates its 25th anniversary Microsoft open source windows-drivers-rs, use Rust to develop Windows drivers Raspberry Pi 5 will be released at the end of October, priced from $60 macOS Containers: Use Docker to run macOS images on macOS IntelliJ IDEA 2023.3 EAP released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10114074