Big Data HBase Study Bible: A book to realize the freedom of HBase learning

Learning objectives: Three-dwelling-in-one architect

This article is the V1 version of "Big Data HBase Study Bible", which is the companion article of "Nin's Big Data Interview Collection".

Here is a special note: Since the first release of the 5 special topic PDFs of "Nien Big Data Interview Guide", it has collected hundreds of questions and a large amount of useful and authentic information for interviews with big companies . "Nion's Big Data Interview Guide" is a collection of interview questions that will become a must-read book for big data learning and interviewing.

Therefore, the Nien architecture team struck while the iron was hot and launched the " Big Data Flink Study Bible " and "Big Data HBASE Study Bible" (this article)

"Big Data HBase Study Bible" will be continuously upgraded and iterated in the future, and it will become a must-read book for studying and interviewing in the field of big data .

In the end, help everyone grow into a three-in-one architect , enter a large factory, and get a high salary.

For the PDFs of "Nin Architecture Notes", "Nin High Concurrency Trilogy" and " Nin Java Interview Collection ", please go to the official account [Technical Freedom Circle] to obtain

"Java+Big Data" Amphibious Architecture Success Case

Success story 1:

Shocking counterattack: Unemployed for 4 months, 3-year-old guy likes to propose a structure offer for 1 month, and he is older and cross-line, super awesome

Success story 2:

Quickly get offers: Ali P6 landed quickly after being laid off, and within 1 month, I would like to mention 2 high-quality offers (including Didi)

Article directory

1 Introduction

In today's digital era, data has become a key resource to promote business, scientific research and social development. With the rapid development of the Internet, IoT and sensor technology, the generation of large-scale data has exploded. This data trend has surpassed the processing capabilities of traditional relational databases. In this new data landscape, distributed NoSQL databases have gradually emerged and become a powerful tool to solve big data storage and processing problems.

1.1 The Value and Challenges of Data

Data has become gold in today's world. Companies use data analysis to gain insight into market trends and predict customer behavior. Scientists use data to study important issues such as climate change and disease spread. However, this massive influx of data also brings huge challenges. Traditional relational databases are often unable to cope with the rapid expansion of data scale, and their data models and architectures cannot meet the needs of large-scale data storage and high-performance processing.

1.2 The rise of NoSQL databases

To meet this challenge, distributed NoSQL (Not Only SQL) databases emerged. Different from traditional relational databases, NoSQL databases adopt a more flexible data model and distributed architecture, can effectively handle massive data, and can be expanded horizontally to meet growing needs. Mainstream NoSQL databases such as MongoDB, Cassandra, and HBase each have unique characteristics and are suitable for different application scenarios.

1.3 Introducing HBase

Among many NoSQL databases, HBase has attracted much attention for its excellent big data storage and real-time query capabilities. HBase is an open source distributed, scalable, high-performance NoSQL database built on the Hadoop ecosystem. It has attracted widespread attention for its excellent performance in processing massive data and achieving random access. By using HBase, users can easily store, manage and retrieve massive data, thereby gaining more business and scientific research value in the big data era.

1.4 Objectives of this article

This article aims to provide beginners with basic knowledge about HBase and help them understand the characteristics, applicable scenarios and basic operations of HBase. From an overview of HBase to advanced operations, we will gradually guide you to gain an in-depth understanding of this powerful distributed NoSQL database, and provide support and guidance on your journey of exploration in the field of big data.

2. HBase Overview

2.1 What is HBase?

HBase (abbreviation for Hadoop Database) is an open source distributed, scalable, high-performance NoSQL database. It is designed based on Google's Bigtable paper and built on the Hadoop ecosystem. HBase is designed to process massive amounts of data and achieve efficient real-time random access on this data. Compared with traditional relational databases, HBase provides a data model and architecture that is more suitable for large-scale data processing.

2.2 Characteristics of HBase

HBase has many unique characteristics that make it ideal for handling large-scale data:

  • Distributed architecture: HBase uses a distributed architecture, and data is divided into multiple Regions and distributed on multiple RegionServers. This enables HBase to scale horizontally and support the storage and processing of massive data.
  • Columnar storage: HBase adopts columnar storage, and data is stored on disk by column, which helps save storage space and improve query efficiency.
  • Sparse data: HBase supports sparse data, which means that each row of data does not need to contain the same columns, which is useful for processing data with different properties.
  • Real-time random access: HBase supports real-time random read and write operations, making it suitable for application scenarios that require low latency, such as real-time analysis and data query.
  • Strong consistency: HBase provides strongly consistent data access to ensure data accuracy and consistency.

2.3 Differences between HBase and traditional relational databases

There are significant differences in the data model and architecture between HBase and traditional relational databases:

  • Data model: Traditional relational databases use a tabular model, and data is stored in structured rows and columns. HBase uses the Bigtable model to store data according to column families, and each column family can contain multiple columns.
  • Architecture: Traditional relational databases are usually based on a single machine. As data grows, vertical expansion may be required. HBase adopts a distributed architecture, supports horizontal expansion, and can easily handle large-scale data.
  • Query language: Traditional relational databases use SQL for query, but HBase does not provide SQL query language. Querying HBase data often requires writing code in Java or other programming languages.
  • Flexibility: HBase is more flexible in data model and architecture, and is suitable for storing and processing various types of data, including structured, semi-structured and unstructured data.

2.4 HBase application scenarios

As a distributed, high-performance NoSQL database, HBase is suitable for a variety of application scenarios, especially when processing large-scale data and requiring real-time random access, it plays an important role.

2.4.1 Big data storage and processing

HBase's distributed architecture makes it ideal for storing and processing large-scale data. In big data applications, the amount of data may reach or even exceed PB levels, making it difficult for traditional relational databases to cope. HBase's distributed storage and automatic horizontal expansion capabilities enable it to easily cope with the storage and query needs of such large-scale data.

2.4.2 Real-time data analysis

HBase also has advantages for scenarios that require real-time data analysis. Real-time data analysis requires the system to query and obtain data quickly, and HBase supports real-time random read and write operations, allowing it to analyze data immediately when it arrives and draw valuable conclusions.

2.4.3 Log data storage

Many applications generate large amounts of log data, which is largely unstructured and needs to be retained for a long time for subsequent analysis. HBase's sparse data model and efficient storage capabilities make it an ideal choice for storing this log data. With HBase, you can easily store, retrieve, and analyze massive log data.

2.4.4 Time series data storage

Time series data is time series data, such as sensor data, stock prices, weather data, etc. HBase's distributed architecture and real-time query capabilities make it very suitable for storing and processing time series data. You can perform quick queries based on timestamps, supporting fast historical data review and real-time monitoring.

2.4.5 High concurrent random access

Some applications need to support high-concurrency random access, and traditional relational databases often cannot meet this demand. One of the design goals of HBase is to achieve high-performance real-time random access. Its distributed architecture and columnar storage allow it to easily cope with highly concurrent read and write requests.

2.4.6 Full text search

Although HBase is not a dedicated full-text search engine, in some cases it can also be used to store full-text index data. By storing index data in HBase, you can achieve fast keyword-based retrieval.

In short, HBase has a wide range of application scenarios, especially in scenarios where large-scale data is processed, real-time requirements are high, and random access is frequent, where it can exert its powerful characteristics. From storing log data to real-time data analysis, from time-series data storage to high-concurrency random access, HBase can provide you with reliable solutions. Next, we will delve into how to install and configure HBase to build an efficient data storage and processing environment for you.

3. Installation and configuration

Refer to Apache HBase configuration_Hbase Chinese documentation

Installation of HBase is the first step for you to start using this powerful database. In this section, we will introduce you to how to install HBase and explain in detail how to install and configure it in different modes.

3.1 Overview of HBase installation methods

The installation of HBase can be divided into the following methods:

  1. Local mode (Standalone Mode): Local mode is the simplest installation method and is suitable for development and testing on a local standalone machine. In local mode, HBase will run in a single Java process and data is stored in the local file system.
  2. Fully-Distributed Mode: Fully-Distributed Mode is a way to deploy HBase in a real distributed environment. In fully distributed mode, the various components of HBase are distributed across multiple computers to achieve high availability, fault tolerance, and performance scaling.

In the following sections, we will explain in detail the steps and configuration methods of each installation method to help you choose the appropriate installation method according to your actual needs.

3.2 Local mode installation and configuration

Local mode installation is ideal for beginners, as it allows you to quickly experience the basic functionality of HBase without investing in too much configuration. In local mode, HBase will run in a single Java process and data is stored in the local file system.

Hbase single node configuration is a simulated installation and configuration of Hbase's distributed storage and computing without multiple computer nodes. By decompressing the Hbase installation compressed package on a computer node, and then configuring Hbase related files, let Hbase run on a machine and implement test support for data storage and computing. By default, HBase runs in stand-alone mode. In stand-alone mode, Hbase uses the local file system instead of HDFS.

step:

  1. Prepare the environment :
    Make sure the Java Development Kit (JDK) is installed on your system. HBase requires a Java runtime environment.
  2. Download HBase :
    Index of /dist/hbase/2.5.5 (apache.org)
  3. Decompress the HBase compressed package :
    Decompress the downloaded HBase compressed package in the directory of your choice. You can use the following command (assuming your tarball is hbase-x.x.x.tar.gz):
tar -xzvf hbase-2.5.5-bin.tar.gz
  1. Configure HBase :
    Enter the decompressed HBase directory and edit conf/hbase-site.xmlthe file for configuration. Here is an example configuration:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>file:///home/docker/hbase/data</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/home/docker/hbase/zookeeper</value>
    </property>
</configuration>

Please replace the path in the above example with the path where you actually want to store your data.

  1. Start HBase :
    Open a terminal, navigate to the HBase directory, and run the following command to start HBase:
./bin/start-hbase.sh
  1. Access the HBase Shell :
    You can use the HBase Shell to interact with HBase in local mode. In the terminal, run the following command:
./bin/hbase shell

This will open the HBase Shell where you can execute HBase commands.

  1. Stop HBase :
    To stop HBase, go back to the terminal, navigate to the HBase directory, and run the following command:
./bin/stop-hbase.sh

These steps will install and run HBase in local mode. Please note that HBase in local mode is not suitable for production environments, but can be used for learning and development purposes. If you want to use HBase in a distributed environment, more detailed configuration and setup is required.

3.3 Fully distributed mode installation and configuration

Fully distributed mode is a way to deploy HBase in a real distributed environment. It is suitable for scenarios that require processing large-scale data and achieving high availability. In fully distributed mode, the various components of HBase are distributed across multiple computers and configured to achieve high availability, fault tolerance, and performance expansion.

  1. Deploy docker
# 安装yum-config-manager配置工具
yum -y install yum-utils

# 建议使用阿里云yum源:(推荐)
#yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

# 安装docker-ce版本
yum install -y docker-ce
# 启动并开机启动
systemctl enable --now docker
docker --version
  1. Deploy docker-compose
curl -SL https://github.com/docker/compose/releases/download/v2.16.0/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose

chmod +x /usr/local/bin/docker-compose
docker-compose --version
  1. Create network
# 创建,注意不能使用hadoop_network,要不然启动hs2服务的时候会有问题!!!
docker network create hadoop-network

# 查看
docker network ls
  1. Deploy zookeeper
    to create directories and files
[root@cdh1 zookeeper]# tree
.
├── docker-compose.yml
├── zk1
├── zk2
└── zk3

3 directories, 1 file

docker-compose.yml

version: '3.7'
   
# 给zk集群配置一个网络,网络名为hadoop-network
networks:
  hadoop-network:
    external: true
  
# 配置zk集群的
# container services下的每一个子配置都对应一个zk节点的docker container
services:
  zk1:
    # docker container所使用的docker image
    image: zookeeper
    hostname: zk1
    container_name: zk1
    restart: always
    # 配置docker container和宿主机的端口映射
    ports:
      - 2181:2181
      - 28081:8080
    # 配置docker container的环境变量
    environment:
      # 当前zk实例的id
      ZOO_MY_ID: 1
      # 整个zk集群的机器、端口列表
      ZOO_SERVERS: server.1=0.0.0.0:2888:3888;2181 server.2=zk2:2888:3888;2181 server.3=zk3:2888:3888;2181
    # 将docker container上的路径挂载到宿主机上 实现宿主机和docker container的数据共享
    volumes:
      - ./zk1/data:/data
      - ./zk1/datalog:/datalog
    # 当前docker container加入名为zk-net的隔离网络
    networks:
      - hadoop-network
   
  zk2:
    image: zookeeper
    hostname: zk2
    container_name: zk2
    restart: always
    ports:
      - 2182:2181
      - 28082:8080
    environment:
      ZOO_MY_ID: 2
      ZOO_SERVERS: server.1=zk1:2888:3888;2181 server.2=0.0.0.0:2888:3888;2181 server.3=zk3:2888:3888;2181
    volumes:
      - ./zk2/data:/data
      - ./zk2/datalog:/datalog
    networks:
      - hadoop-network
   
  zk3:
    image: zookeeper
    hostname: zk3
    container_name: zk3
    restart: always
    ports:
      - 2183:2181
      - 28083:8080
    environment:
      ZOO_MY_ID: 3
      ZOO_SERVERS: server.1=zk1:2888:3888;2181 server.2=zk2:2888:3888;2181 server.3=0.0.0.0:2888:3888;2181
    volumes:
      - ./zk3/data:/data
      - ./zk3/datalog:/datalog
    networks:
      - hadoop-network
wget https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.8.2/apache-zookeeper-3.8.2-bin.tar.gz --no-check-certificate

start up

[root@cdh1 zookeeper]# docker-compose up -d
Creating zk3 ... done
Creating zk2 ... done
Creating zk1 ... done
  1. Download Hadoop deployment package
git clone https://gitee.com/hadoop-bigdata/docker-compose-hadoop.git
  1. Install and deploy mysql5.7

    Here mysql is mainly used for hive to store metadata

cd docker-compose-hadoop/mysql

docker-compose -f mysql-compose.yaml up -d

docker-compose -f mysql-compose.yaml ps

#root 密码:123456,以下是登录命令,注意一般在公司不能直接在命令行明文输入密码,要不然容易被安全抓,切记,切记!!!
docker exec -it mysql mysql -uroot -p123456
  1. Install hadoop and hive
cd docker-compose-hadoop/hadoop_hive

docker-compose -f docker-compose.yaml up -d

# 查看
docker-compose -f docker-compose.yaml ps

# hive
docker exec -it hive-hiveserver2 hive -shoe "show databases";

# hiveserver2
docker exec -it hive-hiveserver2 beeline -u jdbc:hive2://hive-hiveserver2:10000  -n hadoop -e "show databases;"

After startup, if you find that the hadoop historyserver container has not started healthily, you can execute the following command:

docker exec -it hadoop-hdfs-nn hdfs dfs -chmod 777 /tmp
docker restart hadoop-mr-historyserver 

To format hdfs you can execute the following command:

[root@cdh1 ~]# docker exec -it hadoop-hdfs-nn hdfs dfsadmin -refreshNodes
Refresh nodes successful
[root@cdh1 ~]# docker exec -it hadoop-hdfs-dn-0 hdfs dfsadmin -fs hdfs://hadoop-hdfs-nn:9000 -refreshNodes
Refresh nodes successful
[root@cdh1 ~]# docker exec -it hadoop-hdfs-dn-1 hdfs dfsadmin -fs hdfs://hadoop-hdfs-nn:9000 -refreshNodes
Refresh nodes successful
[root@cdh1 ~]# docker exec -it hadoop-hdfs-dn-2 hdfs dfsadmin -fs hdfs://hadoop-hdfs-nn:9000 -refreshNodes
Refresh nodes successful

You can check the HDFS distribution through cdh1:30070

And visit http://cdh1:30888/cluster to check the yarn resource status

  1. Configure Hbase parameters
    mkdir conf
  • conf/hbase-env.sh
export JAVA_HOME=/opt/apache/jdk
export HBASE_CLASSPATH=/opt/apache/hbase/conf
export HBASE_MANAGES_ZK=false
  • conf/hbase-site.xml
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://hadoop-hdfs-nn:9000/hbase</value>
        <!-- hdfs://ns1/hbase 对应hdfs-site.xml的dfs.nameservices属性值 -->
    </property>

    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>zk1,zk2,zk3</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.clientPort</name>
        <value>2181</value>
    </property>

    <property>
        <name>hbase.master</name>
        <value>60000</value>
        <description>单机版需要配主机名/IP和端口,HA方式只需要配端口</description>
    </property>
    <property>
        <name>hbase.master.info.bindAddress</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>hbase.master.port</name>
        <value>16000</value>
    </property>
    <property>
        <name>hbase.master.info.port</name>
        <value>16010</value>
    </property>
    <property>
        <name>hbase.regionserver.port</name>
        <value>16020</value>
    </property>
    <property>
        <name>hbase.regionserver.info.port</name>
        <value>16030</value>
    </property>

    <property>
        <name>hbase.wal.provider</name>
        <value>filesystem</value> <!--也可以用multiwal-->
    </property>
</configuration>
  • conf/backup-masters
hbase-master-2
  • conf/regionservers
hbase-regionserver-1
hbase-regionserver-2
hbase-regionserver-3
  • conf/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License. See accompanying LICENSE file.
  -->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!--配置namenode的地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop-hdfs-nn:9000</value>
    </property>

    <!-- 文件的缓冲区大小(128KB),默认值是4KB -->
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>

    <!-- 文件系统垃圾桶保存时间 -->
    <property>
        <name>fs.trash.interval</name>
        <value>1440</value>
    </property>

    <!-- 配置hadoop临时目录,存储元数据用的,请确保该目录(/opt/apache/hadoop/data/hdfs/)已被手动创建,tmp目录会自动创建 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/apache/hadoop/data/hdfs/tmp</value>
    </property>

    <!--配置HDFS网页登录使用的静态用户为root-->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>root</value>
    </property>

    <!--配置root(超级用户)允许通过代理访问的主机节点-->
    <property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>

    <!--配置root(超级用户)允许通过代理用户所属组-->
    <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>

    <!--配置root(超级用户)允许通过代理的用户-->
    <property>
        <name>hadoop.proxyuser.root.user</name>
        <value>*</value>
    </property>

    <!--配置hive允许通过代理访问的主机节点-->
    <property>
        <name>hadoop.proxyuser.hive.hosts</name>
        <value>*</value>
    </property>

    <!--配置hive允许通过代理用户所属组-->
    <property>
        <name>hadoop.proxyuser.hive.groups</name>
        <value>*</value>
    </property>

    <!--配置hive允许通过代理访问的主机节点-->
    <property>
        <name>hadoop.proxyuser.hadoop.hosts</name>
        <value>*</value>
    </property>

    <!--配置hive允许通过代理用户所属组-->
    <property>
        <name>hadoop.proxyuser.hadoop.groups</name>
        <value>*</value>
    </property>
</configuration>
  • conf/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License. See accompanying LICENSE file.
  -->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- namenode web访问配置 -->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>0.0.0.0:9870</value>
    </property>

    <!-- 必须将dfs.webhdfs.enabled属性设置为true,否则就不能使用webhdfs的LISTSTATUS、LISTFILESTATUS等需要列出文件、文件夹状态的命令,因为这些信息都是由namenode来保存的。 -->
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/apache/hadoop/data/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/apache/hadoop/data/hdfs/datanode/data1,/opt/apache/hadoop/data/hdfs/datanode/data2,/opt/apache/hadoop/data/hdfs/datanode/data3</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>

    <!-- 设置SNN进程运行机器位置信息 -->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop-hdfs-nn2:9868</value>
    </property>

    <property>
        <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
        <value>false</value>
    </property>

    <!-- 白名单 -->
    <property>
        <name>dfs.hosts</name>
        <value>/opt/apache/hadoop/etc/hadoop/dfs.hosts</value>
    </property>

    <!-- 黑名单 -->
    <property>
        <name>dfs.hosts.exclude</name>
        <value>/opt/apache/hadoop/etc/hadoop/dfs.hosts.exclude</value>
    </property>

</configuration>

After completing the conf configuration, you need to set read and write permissions

chmod -R 777 conf/
  1. Writing environment .env file
HBASE_MASTER_PORT=16000
HBASE_MASTER_INFO_PORT=16010
HBASE_HOME=/opt/apache/hbase
HBASE_REGIONSERVER_PORT=16020
  1. Orchestration docker-compose.yaml
version: '3'
services:
  hbase-master-1:
    image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hbase:2.5.4
    user: "hadoop:hadoop"
    container_name: hbase-master-1
    hostname: hbase-master-1
    restart: always
    privileged: true
    env_file:
      - .env
    volumes:
      - ./conf/hbase-env.sh:${
    
    HBASE_HOME}/conf/hbase-env.sh
      - ./conf/hbase-site.xml:${
    
    HBASE_HOME}/conf/hbase-site.xml
      - ./conf/backup-masters:${
    
    HBASE_HOME}/conf/backup-masters
      - ./conf/regionservers:${
    
    HBASE_HOME}/conf/regionservers
      - ./conf/hadoop/core-site.xml:${
    
    HBASE_HOME}/conf/core-site.xml
      - ./conf/hadoop/hdfs-site.xml:${
    
    HBASE_HOME}/conf/hdfs-site.xml
    ports:
      - "36010:${HBASE_MASTER_PORT}"
      - "36020:${HBASE_MASTER_INFO_PORT}"
    command: ["sh","-c","/opt/apache/bootstrap.sh hbase-master"]
    networks:
      - hadoop-network
    healthcheck:
      test: ["CMD-SHELL", "netstat -tnlp|grep :${HBASE_MASTER_PORT} || exit 1"]
      interval: 10s
      timeout: 20s
      retries: 3
  hbase-master-2:
    image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hbase:2.5.4
    user: "hadoop:hadoop"
    container_name: hbase-master-2
    hostname: hbase-master-2
    restart: always
    privileged: true
    env_file:
      - .env
    volumes:
      - ./conf/hbase-env.sh:${
    
    HBASE_HOME}/conf/hbase-env.sh
      - ./conf/hbase-site.xml:${
    
    HBASE_HOME}/conf/hbase-site.xml
      - ./conf/backup-masters:${
    
    HBASE_HOME}/conf/backup-masters
      - ./conf/regionservers:${
    
    HBASE_HOME}/conf/regionservers
      - ./conf/hadoop/core-site.xml:${
    
    HBASE_HOME}/conf/core-site.xml
      - ./conf/hadoop/hdfs-site.xml:${
    
    HBASE_HOME}/conf/hdfs-site.xml
    ports:
      - "36011:${HBASE_MASTER_PORT}"
      - "36021:${HBASE_MASTER_INFO_PORT}"
    command: ["sh","-c","/opt/apache/bootstrap.sh hbase-master hbase-master-1 ${HBASE_MASTER_PORT}"]
    networks:
      - hadoop-network
    healthcheck:
      test: ["CMD-SHELL", "netstat -tnlp|grep :${HBASE_MASTER_PORT} || exit 1"]
      interval: 10s
      timeout: 20s
      retries: 3
  hbase-regionserver-1:
    image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hbase:2.5.4
    user: "hadoop:hadoop"
    container_name: hbase-regionserver-1
    hostname: hbase-regionserver-1
    restart: always
    privileged: true
    env_file:
      - .env
    volumes:
      - ./conf/hbase-env.sh:${
    
    HBASE_HOME}/conf/hbase-env.sh
      - ./conf/hbase-site.xml:${
    
    HBASE_HOME}/conf/hbase-site.xml
      - ./conf/backup-masters:${
    
    HBASE_HOME}/conf/backup-masters
      - ./conf/regionservers:${
    
    HBASE_HOME}/conf/regionservers
      - ./conf/hadoop/core-site.xml:${
    
    HBASE_HOME}/conf/core-site.xml
      - ./conf/hadoop/hdfs-site.xml:${
    
    HBASE_HOME}/conf/hdfs-site.xml
    ports:
      - "36030:${HBASE_REGIONSERVER_PORT}"
    command: ["sh","-c","/opt/apache/bootstrap.sh hbase-regionserver hbase-master-1 ${HBASE_MASTER_PORT}"]
    networks:
      - hadoop-network
    healthcheck:
      test: ["CMD-SHELL", "netstat -tnlp|grep :${HBASE_REGIONSERVER_PORT} || exit 1"]
      interval: 10s
      timeout: 10s
      retries: 3
  hbase-regionserver-2:
    image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hbase:2.5.4
    user: "hadoop:hadoop"
    container_name: hbase-regionserver-2
    hostname: hbase-regionserver-2
    restart: always
    privileged: true
    env_file:
      - .env
    volumes:
      - ./conf/hbase-env.sh:${
    
    HBASE_HOME}/conf/hbase-env.sh
      - ./conf/hbase-site.xml:${
    
    HBASE_HOME}/conf/hbase-site.xml
      - ./conf/backup-masters:${
    
    HBASE_HOME}/conf/backup-masters
      - ./conf/regionservers:${
    
    HBASE_HOME}/conf/regionservers
      - ./conf/hadoop/core-site.xml:${
    
    HBASE_HOME}/conf/core-site.xml
      - ./conf/hadoop/hdfs-site.xml:${
    
    HBASE_HOME}/conf/hdfs-site.xml
    ports:
      - "36031:${HBASE_REGIONSERVER_PORT}"
    command: ["sh","-c","/opt/apache/bootstrap.sh hbase-regionserver hbase-master-1 ${HBASE_MASTER_PORT}"]
    networks:
      - hadoop-network
    healthcheck:
      test: ["CMD-SHELL", "netstat -tnlp|grep :${HBASE_REGIONSERVER_PORT} || exit 1"]
      interval: 10s
      timeout: 10s
      retries: 3
  hbase-regionserver-3:
    image: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/hbase:2.5.4
    user: "hadoop:hadoop"
    container_name: hbase-regionserver-3
    hostname: hbase-regionserver-3
    restart: always
    privileged: true
    env_file:
      - .env
    volumes:
      - ./conf/hbase-env.sh:${
    
    HBASE_HOME}/conf/hbase-env.sh
      - ./conf/hbase-site.xml:${
    
    HBASE_HOME}/conf/hbase-site.xml
      - ./conf/backup-masters:${
    
    HBASE_HOME}/conf/backup-masters
      - ./conf/regionservers:${
    
    HBASE_HOME}/conf/regionservers
      - ./conf/hadoop/core-site.xml:${
    
    HBASE_HOME}/conf/core-site.xml
      - ./conf/hadoop/hdfs-site.xml:${
    
    HBASE_HOME}/conf/hdfs-site.xml
    ports:
      - "36032:${HBASE_REGIONSERVER_PORT}"
    command: ["sh","-c","/opt/apache/bootstrap.sh hbase-regionserver hbase-master-1 ${HBASE_MASTER_PORT}"]
    networks:
      - hadoop-network
    healthcheck:
      test: ["CMD-SHELL", "netstat -tnlp|grep :${HBASE_REGIONSERVER_PORT} || exit 1"]
      interval: 10s
      timeout: 10s
      retries: 3
   
# 连接外部网络
networks:
  hadoop-network:
    external: true
  1. Start deployment.
    The current directory structure is as follows:
[root@cdh1 hbase]# tree
.
├── .env
├── conf
│   ├── backup-masters
│   ├── hadoop
│   │   ├── core-site.xml
│   │   └── hdfs-site.xml
│   ├── hbase-env.sh
│   ├── hbase-site.xml
│   └── regionservers
├── docker-compose.yaml

start up:

docker-compose -f docker-compose.yaml up -d

# 查看
docker-compose -f docker-compose.yaml ps

[root@cdh1 hbase]# docker-compose ps
Name                      Command                  State                                                 Ports
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
hbase-master-1         sh -c /opt/apache/bootstra ...   Up (healthy)   0.0.0.0:36010->16000/tcp,:::36010->16000/tcp, 0.0.0.0:36020->16010/tcp,:::36020->16010/tcp
hbase-master-2         sh -c /opt/apache/bootstra ...   Up (healthy)   0.0.0.0:36011->16000/tcp,:::36011->16000/tcp, 0.0.0.0:36021->16010/tcp,:::36021->16010/tcp
hbase-regionserver-1   sh -c /opt/apache/bootstra ...   Up (healthy)   0.0.0.0:36030->16020/tcp,:::36030->16020/tcp
hbase-regionserver-2   sh -c /opt/apache/bootstra ...   Up (healthy)   0.0.0.0:36031->16020/tcp,:::36031->16020/tcp
hbase-regionserver-3   sh -c /opt/apache/bootstra ...   Up (healthy)   0.0.0.0:36032->16020/tcp,:::36032->16020/tcp

Access cluster information through Master: hbase-master-1

  1. shell test
### 进入容器内部
[root@cdh1 hbase]# docker exec -it hbase-master-1 bash

### 进入shell环境
[hadoop@hbase-master-1 hbase-2.5.4]$ hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.5.4, r2e426ab69d126e683577b6e94f890800c5122910, Thu Apr  6 09:11:53 PDT 2023
Took 0.0012 seconds

### 简单建表
hbase:001:0> create 'user1', 'info', 'data'
Created table user1
Took 1.6409 seconds
=> Hbase::Table - user1

### 查看表信息
hbase:002:0> desc 'user1'
Table user1 is ENABLED
user1, {
    
    TABLE_ATTRIBUTES => {
    
    METADATA => {
    
    'hbase.store.file-tracker.impl' => 'DEFAULT'}}}
COLUMN FAMILIES DESCRIPTION
{
    
    NAME => 'data', INDEX_BLOCK_ENCODING => 'NONE', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '
0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536 B (64KB)'}

{
    
    NAME => 'info', INDEX_BLOCK_ENCODING => 'NONE', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '
0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536 B (64KB)'}

2 row(s)
Quota is disabled
Took 0.2303 seconds

### 查看状态
hbase:003:0> status
1 active master, 1 backup masters, 3 servers, 0 dead, 1.0000 average load
Took 0.0853 seconds    

Note, if the hbase container fails to start or has an exception, you can execute the following command

  1. Stop hbase related containers
docker-compose down
  1. Delete hbase node in zookeeper
  2. Delete the hbase directory in hdfs
docker exec -it hadoop-hdfs-nn hdfs dfs -rm -r /hbase
  1. Restart hbase container

4. Data operations

As a distributed NoSQL database, HBase provides rich data operation methods, which can be executed through Shell interaction or Java API. In this section, we will introduce in detail how to operate tables in HBase, add, delete, modify and query data, and how to use counters to meet the counting requirements in high concurrency environments.

4.1 Table operations

In HBase, we can perform various table operations through Shell interaction or Java API. Here are detailed examples of each operation:

  • Create a table : Use createthe command to create a new table and specify the table name, column family and other information.
create 'my_table', 'cf1', 'cf2'

This will create a my_tabletable named with two column families cf1and cf2.

  • Delete a table : First use disablethe command to disable the table, and then use deletethe command to delete the table.
disable 'my_table'
delete 'my_table'

This will first disable my_tablethe table named and then delete it.

  • Modify the table structure : Modify the table structure through altercommands, such as adding column families, modifying configurations, etc.
alter 'my_table', NAME => 'cf3'

This will my_tableadd a new column family to the table cf3.

  • Disabling and enabling tables : Use disablecommands to disable tables and use enablecommands to enable tables.
disable 'my_table'

This disables the table my_table, preventing operations on it.

enable 'my_table'

This will enable a previously disabled table my_table, making it available again.

  • List tables : Use listthe command to list all tables.
list

This will display a list of all tables currently present in the HBase cluster.

4.2 Data operations

HBase provides a variety of data operation methods, covering operations such as data insertion, retrieval, scanning, updating, and deletion. Here are detailed instructions and examples of these operations:

  • Insert dataput : Insert data into the table using commands or the Java API, specifying row keys, column families, column qualifiers, and values.
put 'my_table', 'row1', 'cf1:column1', 'value1'

This will insert data with value in the column of my_tablethe table's cf1column family .column1value1

  • Get data : Use getthe command or Java API to get the data of the specified cell by row key.
get 'my_table', 'row1'

This will get all the data for the rows inmy_table the table .row1

  • Scan data : Use scancommands or Java API to perform range scans, which can filter by rows, column families, and column qualifiers.
scan 'my_table', {
    
    COLUMNS => 'cf1'}

This will my_tablescan the table and display only cf1the data for the column families.

  • Update data : Use putcommands or Java API to update the value of existing data.
put 'my_table', 'row1', 'cf1:column1', 'new_value'

This will update the column family of the row inmy_table the table with the value of the column .row1cf1column1new_value

  • Delete data : Use deletecommands or Java API to delete data in specified cells or rows.
delete 'my_table', 'row1', 'cf1:column1'

This will delete the column data for the row's column family inmy_table the table .row1cf1column1

4.3 Counter

Counter is a special data type in HBase used for counting requirements in high concurrency environments. In many applications, it is necessary to increase or decrease certain data in real time, such as user points, inventory quantity, number of clicks, etc. In these scenarios, counters can provide an efficient and atomic way to manage these count values.

In HBase, counters are incrementColumnValueimplemented through methods, which allow you to atomically increment or decrement the value in a specified cell. By specifying row keys, column families, column qualifiers, and values ​​to increment or decrement, HBase is able to ensure atomicity and data consistency of counting operations, maintaining correctness even under high concurrency.

The following is a specific example of counter operation:

Suppose we have an HBase table named user_scores, which contains user's points information. The table structure is as follows:

  • Table name: user_scores
  • Column family: cf
  • Column qualifier: score

Now, we want to add points to a specific user. You can use the following commands to perform counter operations:

# 假设已经连接到HBase并选择了相应的表

# 定义行键和列族、列限定符
row_key = 'user123'
column_family = 'cf'
column_qualifier = 'score'

# 执行计数器操作,增加1个计数单位
incr 'user_scores', row_key, column_family:column_qualifier, 1

The corresponding java code is as follows:

import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.Increment;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;

// 假设已经建立了HBase连接和获取了Table实例
Table table = connection.getTable(TableName.valueOf("user_scores"));

// 定义计数器操作
Increment increment = new Increment(Bytes.toBytes("user123")); // 指定行键
increment.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("score"), 1L); // 列族、列限定符、增加的值

// 执行计数器操作
table.increment(increment);

// 关闭连接等操作
table.close();

In this example, we perform an increment operation on the column under the column family of the table for user123the user whose row key is, increasing by 1 count unit. Whenever this operation is called, HBase ensures atomicity, thus avoiding concurrency counting issues.user_scorescfscore

This example demonstrates how to use counters to manage high concurrent counting requirements in a distributed environment, ensuring data accuracy and consistency.

Through the above data operation methods, we can easily manage tables and operate data in HBase, and meet high concurrency counting requirements.

5. Data model and architecture

Data Model_Hbase Chinese Documentation

The HBase architecture is divided into HMaster and multiple RegionServers. HMaster is responsible for monitoring RegionServer and processing DDL operations. RegionServer is used to actually store data and is responsible for processing DML operations. In addition, there is Zookeeper for state maintenance.

The core concepts in HBase include: Table, Row Key, Column Family, Column Qualifier, Timestamp, and Cell. The row key and column form a unique index, which is stored in lexicographic order of the row key. The column family is determined when the table is defined, and column qualifiers can be dynamically added or deleted. Timestamps are used for version control.

5.1 HBase system architecture

NOTE: Please click on the image to see a clear architectural diagram!

The system architecture of HBase is the core of the distributed database, which consists of multiple components working together to achieve high-performance data storage and query. In this section, we will delve into the system architecture of HBase and introduce the roles and interrelationships of key components.

5.1.1 Overview of HBase components

NOTE: Please click on the image for a clear view!

The system architecture of HBase mainly consists of the following core components:

HMaster:

HMaster is the "brain" of the HBase cluster and is responsible for managing and coordinating the metadata information of the entire cluster, such as table structure and Region distribution.

  • Receive the client's DDL operations (such as creating tables, modifying table structures) and convert them into corresponding RegionServer operations.
  • Maintain cluster metadata information
  • Discover the failed Region and allocate the failed Region to the normal HRegionServer
RegionServer:

RegionServer is the working node of the HBase cluster and is responsible for actual data storage and query operations.

Each RegionServer is responsible for managing multiple Regions, and each Region represents a part of the data in the table.

RegionServer is responsible for handling client DML operations (such as addition, deletion, modification, and query) as well as Region load balancing and automatic splitting.

Zookeeper:

Zookeeper is a distributed coordination service that HBase uses to manage the status information and configuration information of each component in the cluster. HBase uses Zookeeper to coordinate HMaster election, RegionServer registration and metadata distribution.

HDFS:

HDFS (Hadoop Distributed File System) is the underlying storage layer of HBase, which is used to persistently store table data and metadata. HBase data is stored on the data node in the form of HDFS files and accessed through the read and write interface provided by HBase.

5.1.2 Component Relationships and Collaboration

NOTE: Please click on the image for a clear view!

HBase components implement data storage and query through complex collaboration. The following is a brief description of the relationships and collaboration processes of these components:

  1. The client sends DDL operations (such as creating tables) to HMaster, and HMaster stores metadata information in Zookeeper and notifies RegionServer to update the metadata.
  2. The client sends a DML operation to the RegionServer, and the RegionServer performs the corresponding operation according to the operation type. For write operations, RegionServer writes data to memory (MemStore) and flushes it to disk periodically.
  3. When the MemStore size of a Region reaches the threshold, the RegionServer writes the data in the MemStore to the StoreFile in HDFS and creates a new MemStore. If the number of StoreFiles in a Region reaches a certain number, the Region will be automatically split into two Regions to ensure data balance and efficient query.
  4. The client sends a query request to the RegionServer, and the RegionServer retrieves data from StoreFile and MemStore, and returns the result to the client.
  5. When a RegionServer fails or a Region is added, Zookeeper will notify other components and perform corresponding processing, such as electing a new HMaster.

Through the collaboration of these components, HBase achieves features such as high availability, distributed storage, and real-time random access. An in-depth understanding of the system architecture of HBase will help you better optimize configuration and tune performance to meet the needs of various application scenarios. Next, we will explain the core concepts of HBase in depth to lay a stronger foundation for you.

5.2 Analysis of core concepts

In HBase, there are some core concepts that you must understand that form the basis of data storage and access. In this section, we will explain these core concepts in detail to help you better understand how HBase works.

5.2.1 Table

In HBase, data is organized into a series of tables. Each table contains multiple rows of data, and each row of data is identified by a unique RowKey. The structure of a table in HBase is dynamic, you can add column families or columns at any time without defining the structure of the table beforehand. Tables are often used to store data with the same structure.

5.2.2 RowKey

RowKey is a key that uniquely identifies a row of data in a table. It is a byte array and can be of any data type. The design of RowKey is very important because it directly affects the distribution of data and query performance. A better RowKey design can achieve balanced storage of data and efficient query.

5.2.3 Column Family

A Column Family is a collection of related columns in a table. Each column family has a name, and columns within the column family can be added dynamically. In HBase, a column family is the physical storage unit of data, and all data belonging to the same column family is stored together to achieve higher storage and query efficiency.

5.2.4 Column Qualifier

Column Qualifier is the name of a specific column in a specified column family. It is a byte array that identifies the different columns within the column family. By combining Column Family and Column Qualifier, you can uniquely determine a cell (Cell) in the table.

5.2.5 Cell

Cell is the smallest data unit in HBase, which stores actual data values ​​and corresponding timestamps. Each Cell consists of RowKey, Column Family, Column Qualifier and Timestamp. In the table, the unique identifier of each cell is a combination of RowKey, Column Family and Column Qualifier.

5.2.6 Timestamp

Timestamp is an important attribute of Cell, used to identify the version of data. HBase supports multi-version data storage, and each cell can have multiple versions with different timestamps. With Timestamp, you can query and obtain data versions at different points in time.

5.2.7 Region

Region is the distribution unit of data in HBase. Each table can be divided into multiple Regions, and each Region consists of a series of consecutive rows. Region is the key to HBase's distributed storage and load balancing.

By understanding these core concepts, you will be able to better design table structures, optimize data access and queries, and lay a solid foundation for subsequent data operations. Next, we will delve into the data operations of HBase to help you understand how to add, delete, modify and query data in HBase.

5.3 Internal implementation principles

The internal implementation principle of HBase is the key to its high performance and scalability. In this section, we will reveal the internal implementation principles of HBase such as data storage format, compression algorithm, and caching mechanism to help you deeply understand its working mechanism.

5.3.1 HBase data reading process

NOTE: Please click on the image for a clear view!

HBase is a distributed, column-oriented NoSQL database system that provides high performance for random reading of massive data. The following is the detailed process of HBase reading data:

  1. Data location and access :
    When the client needs to read data, it first performs a hash calculation based on the row key of the data to determine which Region the data is located in. Then, the client obtains the metadata of the table through the Master node of HBase, including the partition information of the table and the location of the RegionServer.
  2. Communicate with RegionServer :
    The client will establish a connection with the RegionServer responsible for the data region based on the metadata information. This connection will be used to request data and receive responses.
  3. MemStore and StoreFile query :
    On RegionServer, data will first be queried in MemStore. If the data you want to find is in MemStore, read it directly from memory. If the data is not in MemStore, HBase will query the StoreFile in chronological order and find the appropriate StoreFile for reading.
  4. Block Cache utilization :
    If the data is in StoreFile, HBase will read the StoreFile in blocks. During the read process, HBase will first check the Block Cache, which is an in-memory cache used to store the most commonly used data blocks to improve read performance. If the required data block exists in the Block Cache, it can be read directly from the cache without reading the StoreFile from the disk.
  5. Data filtering and assembly :
    Once the data block is read, HBase will filter the data according to the column family and column qualifier specified in the request, and only return the required part of the data. The data is then assembled into the appropriate format, usually in the form of key-value pairs.
  6. Return data to the client :
    Finally, RegionServer will return the requested data to the client. The client can use this data for further processing and analysis.

HBase's read process makes full use of cache and indexing in accessing MemStore in memory, using Block Cache, querying StoreFile, and data filtering to provide efficient random read performance. This architecture makes HBase suitable for real-time query needs of large-scale data sets.

5.3.2 HBase data writing process

NOTE: Please click on the image for a clear view!

HBase is a columnar storage system based on Hadoop Distributed File System (HDFS) and is designed to handle large-scale sparse data sets. It provides efficient random read and write capabilities, and has a unique set of internal principles and processes when writing data.

The following are the internal principles and processes of HBase writing data:

  1. Writing process initialization :
    When the client wants to write data, it must first communicate with the HBase Master node to obtain the metadata of the table, including the table structure, partition information, etc. Then, the client will establish a connection with the RegionServer of HBase and prepare for write operations.
  2. Preparation before writing :
    Before writing, HBase will preprocess the data to be written, such as generating a write operation log (Write Ahead Log, WAL) to ensure data durability and recovery capabilities.
  3. Data distribution and positioning :
    The HBase table will be horizontally divided into multiple Regions, and each Region is responsible for managing a certain range of row key data. When writing data, HBase will locate the Region where the data is located based on the hash value of the row key.
  4. Write to MemStore :
    When the Region to be written is determined, the data will first be written to the memory storage area of ​​the Region, called MemStore. MemStore stores data in a sorted manner, and new data will be inserted into the appropriate position to make subsequent read and write operations more efficient.
  5. Persistence to HLog :
    After writing to MemStore, the data will be written to HLog at the same time as a persistent write log. This ensures that even if the RegionServer fails, the data can be recovered from the HLog.
  6. MemStore flushing :
    When the data in MemStore reaches a certain size threshold, HBase will trigger the flushing operation of MemStore. This will write the data in MemStore to a temporary storage file in HDFS called StoreFile.
  7. Compaction :
    As time goes by, multiple StoreFiles will continue to be generated. In order to maintain data continuity and improve query performance, HBase regularly performs Compaction operations to merge multiple StoreFiles into a larger file and delete expired or duplicate data.
  8. Data persistence :
    After the Compaction operation is completed, the new StoreFile generated will be renamed and moved to the final storage location. In this way, the data is persisted to HDFS.
  9. Acknowledgment :
    Once the data is successfully written and persisted, RegionServer will send a confirmation message to the client indicating that the data has been successfully written.

In general, the writing process of HBase involves steps such as MemStore in memory, persistent HLog, StoreFile generation and Compaction, ensuring data persistence and efficient performance. This process takes full advantage of the advantages of distributed storage and columnar storage, allowing HBase to have excellent performance in large-scale data processing.

5.3.3 HBase flush (flush) and compact (merge) mechanism

HBase's Flush and Compaction are important mechanisms for maintaining data consistency, reducing storage space usage, and improving read performance. The following is a detailed introduction to HBase’s Flush and Compaction mechanisms:

Flush mechanism:

Flush refers to the process of writing data in memory to persistent storage (HDFS). In HBase, data is first written to MemStore, which is a sorted data structure located in RegionServer memory to improve write performance. However, MemStore data is not directly written to HDFS during the writing process. Instead, a flush operation is performed after certain conditions are met to persist the data to the StoreFile on the disk.

The process of Flush mechanism is as follows:

  1. MemStore flush trigger : When the data in MemStore reaches a certain size threshold (specified through configuration parameters), or within a specified time interval, HBase will trigger a Flush operation.
  2. Generate StoreFile : During Flush, the data in MemStore will be written to a new StoreFile instead of directly writing to HDFS. This StoreFile is a persistent data file, sorted in key order.
  3. StoreFile writes to HDFS : The generated StoreFile will first be written to the HLog (Write Ahead Log) of HBase to ensure data persistence. Then, the StoreFile will be moved to the data directory of the corresponding Region in HDFS and become a persistent storage.

Flush trigger configuration:

When the size of memstore exceeds this value, it will be flushed to disk. The default is 128M.

<property>
    <name>hbase.hregion.memstore.flush.size</name>
    <value>134217728</value>
</property>

When the data in memstore exceeds 1 hour, it will be flushed to disk.

<property>
    <name>hbase.regionserver.optionalcacheflushinterval</name>
    <value>3600000</value>
</property>

The size of the global memstore of HregionServer. Exceeding this size will trigger a flush to disk operation. The default is 40% of the heap size.

<property>
    <name>hbase.regionserver.global.memstore.size</name>
    <value>0.4</value>
</property>

Manual flush

flush tableName

Compaction mechanism:

Compaction refers to the process of merging StoreFiles to optimize data storage, improve query performance, and delete expired data. Since the write operation of HBase will generate multiple StoreFiles, and the data may exist in multiple versions, in order to maintain the continuity of the data and delete expired data, HBase will perform Compaction periodically.

HBase supports two types of Compaction: small-scale Min Compaction and large-scale Major Compaction .

  • Min Compaction : Also known as smaller-scale Compaction, it will merge several adjacent StoreFiles, but not all StoreFiles. This helps reduce data fragmentation and improves query performance.
  • Major Compaction : Also known as larger Compaction, it will merge all StoreFiles in a Region, delete deleted data, delete expired data, and merge different versions of the same row key. Major Compaction can completely optimize data storage, reduce storage space usage, and improve read performance.

The process of the Compaction mechanism is as follows:

  1. Select StoreFiles to merge : HBase will select some StoreFiles to merge, usually files that are adjacent or have overlapping row key ranges.
  2. Merger process : During the Compaction process, HBase will merge the selected StoreFiles and generate a new StoreFile. During this process, duplicate data will be merged, expired data will be deleted, and different versions of data will be merged.
  3. Generate a new StoreFile : The new StoreFile generated after merging will be written to HDFS and replace the old StoreFile.
  4. Delete the old StoreFile : After the merged new StoreFile is generated, the original old StoreFile will be deleted to free up storage space.

The combined use of Flush and Compaction mechanisms ensures the consistency, reliability and reading performance of HBase data.

5.3.4 Caching mechanism

HBase's caching mechanism is designed to improve read performance and reduce the frequency of access to the underlying storage. HBase uses two main caching mechanisms: Block Cache (Block Cache) and MemStore Heap Cache (MemStore Heap Cache). The following is a detailed introduction to these two caching mechanisms:

1. Block Cache:

Block Cache is one of the main caching mechanisms of HBase and is used to cache data blocks stored in StoreFile on HDFS. StoreFile is usually divided into fixed-size data blocks, also called blocks (Block), and the size of each block is usually 64KB.

The basic principles of how Block Cache works are as follows:

  • Caching strategy : Block Cache uses the LRU (Least Recently Used) strategy to manage data blocks in the cache. Blocks that have been accessed recently will remain in the cache, while blocks that were accessed earlier may be replaced from the cache.
  • Block granularity : Block Cache caches in blocks, so that when reading data, cache hits can be made at the block level, thereby improving read performance.
  • Cache location : Block Cache is located in the memory of each RegionServer and is used to cache the data blocks of all Regions that the RegionServer is responsible for.
  • Cache applicability : Block Cache is suitable for frequently queried data, such as hotspot data. Since the data is cached in memory, read performance is significantly improved.

2. MemStore Heap Cache (MemStore heap cache):

MemStore Heap Cache is a mechanism used to cache MemStore data in memory. When data is written to HBase, it is first written to MemStore and then cached in MemStore Heap Cache before Flush.

  • Caching purpose : The main purpose of MemStore Heap Cache is to improve write performance. Since the flush operation written to the disk is relatively slow, the data can be cached in the Heap Cache before flushing, which can reduce the frequency of actual writes to the disk and improve write performance.

  • Flush trigger : When the MemStore Heap Cache reaches a certain size threshold or the scheduled interval is reached, the Flush operation will be triggered to flush the data in the Heap Cache to the StoreFile in HDFS.

HBase's caching mechanism plays a key role in both reading and writing. Block Cache provides efficient read caching, while MemStore Heap Cache helps improve write performance. Using these caching mechanisms combined, HBase can achieve excellent performance in large-scale data processing.

5.3.5 Garbage collection

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

To sum up, HBase's data garbage collection mechanism mainly cleans invalid data through deletion marks, version deletion, Minor Compaction and Major Compaction, reduces storage space occupation, and maintains data consistency and reliability.

By understanding these internal implementation principles, you will be able to better understand the performance characteristics and working mechanism of HBase. This is very helpful for optimizing configuration, tuning performance, and addressing the needs of various application scenarios. Next, we will introduce the data operation of HBase in depth, and provide you with rich operation examples and guidance.

6. HBase advanced operations

In this chapter, we will delve into the advanced operations of HBase, including the use of filters for read control, efficient data import methods, and how to combine MapReduce for complex distributed computing.

Some more advanced and complex use cases are covered when it comes to advanced operations with HBase. The following are detailed instructions and specific operation examples for some advanced operations:

6.1 Advanced use cases

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

6.2 Use of filters

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

6.3 Data import method

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

In the next chapter, we will introduce how to develop HBase applications through the Java client API, as well as some optimizations and best practices. This will help you better leverage HBase to build applications.

7. Client Development

In this chapter, we will introduce in detail how to use HBase's Java client API for application development. You need to set up dependency configuration and provide a complete and runnable Java code example.

7.1 Dependency configuration

First, you need to add the HBase Java client API dependency to the project. If Maven is used as the project build tool, pom.xmladd the following dependency configuration to the file:

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>2.4.7</version> <!-- 请根据您的HBase版本进行调整 -->
</dependency>

7.2 User information example

The following is a complete runnable Java code example that demonstrates how to use HBase's Java client API to connect to an HBase cluster, create user information tables, insert user data, query user data, and handle exceptions. In this example, we provide configuration information.

  1. Initialize client
public class HbaseClientDemo {
    
    
    Configuration conf=null;
    Connection conn=null;
    HBaseAdmin admin =null;
    @Before
    public void init () throws IOException {
    
    
        conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum","linux121,linux122");
        conf.set("hbase.zookeeper.property.clientPort","2181");
        conn = ConnectionFactory.createConnection(conf);
    }
    public void destroy(){
    
    
        if(admin!=null){
    
    
            try {
    
    
                admin.close();
            } catch (IOException e) {
    
    
                e.printStackTrace();
            }
        }
        if(conn !=null){
    
    
            try {
    
    
                conn.close();
            } catch (IOException e) {
    
    
                e.printStackTrace();
            }
        }
    }
}
  1. Create table
@Test
public void createTable() throws IOException {
    
    
    admin = (HBaseAdmin) conn.getAdmin();
    //创建表描述器
    HTableDescriptor teacher = new HTableDescriptor(TableName.valueOf("teacher"));
    //设置列族描述器
    teacher.addFamily(new HColumnDescriptor("info"));
    //执行创建操作
    admin.createTable(teacher);
    System.out.println("teacher表创建成功!!");
}
  1. Insert data
//插入一条数据
@Test
public void putData() throws IOException {
    
    
    //获取一个表对象
    Table t = conn.getTable(TableName.valueOf("teacher"));
    //设定rowkey
    Put put = new Put(Bytes.toBytes("110"));
    //列族,列,value
    put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("addr"), Bytes.toBytes("beijing"));
    //执行插入
    t.put(put);
    // t.put();//可以传入list批量插入数据
    //关闭table对象
    t.close();
    System.out.println("插入成功!!");
}
  1. delete data:
//删除一条数据
@Test
public void deleteData() throws IOException {
    
    
    //需要获取一个table对象
    final Table worker = conn.getTable(TableName.valueOf("worker"));
    //准备delete对象
    final Delete delete = new Delete(Bytes.toBytes("110"));
    //执行删除
    worker.delete(delete);
    //关闭table对象
    worker.close();
    System.out.println("删除数据成功!!");
}
  1. Query a column family data
//查询某个列族数据
@Test
public void getDataByCF() throws IOException {
    
    
    //获取表对象
    HTable teacher = (HTable) conn.getTable(TableName.valueOf("teacher"));
    //创建查询的get对象
    Get get = new Get(Bytes.toBytes("110"));
    //指定列族信息
    // get.addColumn(Bytes.toBytes("info"), Bytes.toBytes("sex"));
    get.addFamily(Bytes.toBytes("info"));
    //执行查询
    Result res = teacher.get(get);
    Cell[] cells = res.rawCells();//获取改行的所有cell对象
    for (Cell cell : cells) {
    
    
        //通过cell获取rowkey,cf,column,value
        String cf = Bytes.toString(CellUtil.cloneFamily(cell));
        String column = Bytes.toString(CellUtil.cloneQualifier(cell));
        String value = Bytes.toString(CellUtil.cloneValue(cell));
        String rowkey = Bytes.toString(CellUtil.cloneRow(cell));
        System.out.println(rowkey + "----" + cf + "---" + column + "---" + value);
    }
    teacher.close();//关闭表对象资源
}
  1. Full table scan via Scan
/**
  * 全表扫描
  */
@Test
public void scanAllData() throws IOException {
    
    
    HTable teacher = (HTable) conn.getTable(TableName.valueOf("teacher"));
    Scan scan = new Scan();
    ResultScanner resultScanner = teacher.getScanner(scan);
    for (Result result : resultScanner) {
    
    
        Cell[] cells = result.rawCells();//获取改行的所有cell对象
        for (Cell cell : cells) {
    
    
            //通过cell获取rowkey,cf,column,value
            String cf = Bytes.toString(CellUtil.cloneFamily(cell));
            String column = Bytes.toString(CellUtil.cloneQualifier(cell));
            String value = Bytes.toString(CellUtil.cloneValue(cell));
            String rowkey = Bytes.toString(CellUtil.cloneRow(cell));
            System.out.println(rowkey + "----" + cf + "--" + column + "---" + value);
        }
    }
    teacher.close();
}
  1. Scan via startRowKey and endRowKey
/**
  * 通过startRowKey和endRowKey进行扫描查询
  */
@Test
public void scanRowKey() throws IOException {
    
    
    HTable teacher = (HTable) conn.getTable(TableName.valueOf("teacher"));
    Scan scan = new Scan();
    scan.setStartRow("0001".getBytes());
    scan.setStopRow("2".getBytes());
    ResultScanner resultScanner = teacher.getScanner(scan);
    for (Result result : resultScanner) {
    
    
        Cell[] cells = result.rawCells();//获取改行的所有cell对象
        for (Cell cell : cells) {
    
    
            //通过cell获取rowkey,cf,column,value
            String cf = Bytes.toString(CellUtil.cloneFamily(cell));
            String column = Bytes.toString(CellUtil.cloneQualifier(cell));
            String value = Bytes.toString(CellUtil.cloneValue(cell));
            String rowkey = Bytes.toString(CellUtil.cloneRow(cell));
            System.out.println(rowkey + "----" + cf + "--" + column + "---" + value);
        }
    }
    teacher.close();
}

7.3 SpringBoot integrates HBase

To integrate HBase in Spring Boot, you can use the HBase support provided by the Spring Data Hadoop project. Here are the steps to demonstrate how to integrate HBase in Spring Boot:

1. Create a Spring Boot project:

First, create a new Spring Boot project. You can use Spring Initializr (https://start.spring.io/) to quickly generate a basic Spring Boot project, making sure to select the required dependencies, such as Web, HBase, and Lombok (optional).

2. Configure application.properties:

In the files src/main/resourcesin the directory application.properties, add the configuration information of HBase and ZooKeeper:

# HBase configuration
spring.data.hbase.quorum=zkHost1,zkHost2,zkHost3  # 替换为您的ZooKeeper主机名
spring.data.hbase.zk-port=2181
spring.data.hbase.zk-znode-parent=/hbase
spring.data.hbase.rootdir=hdfs://localhost:9000/hbase

# HBase auto start
spring.data.hbase.auto-startup=true

3. Create HBase entity class:

Create a Java class to represent the entities of the HBase table. Use @Tableannotations to specify the name of the table, use @Rowannotations to represent row keys, and use @Columnannotations to represent columns.

import org.springframework.data.annotation.Id;
import org.springframework.data.hadoop.hbase.RowKey;
import org.springframework.data.hadoop.hbase.Table;

@Table("users")
public class User {
    
    

    @Id
    @RowKey
    private String id;

    @Column("userInfo:name")
    private String name;

    @Column("userInfo:age")
    private String age;

    // Getters and setters
}

4. Create HBase Repository:

Create an HBase Repository interface, inherited from org.springframework.data.repository.CrudRepository. You can use inherited methods to implement basic CRUD operations.

import org.springframework.data.repository.CrudRepository;

public interface UserRepository extends CrudRepository<User, String> {
    
    
}

5. Create the service layer:

Create a service layer that can inject the Repository and then use it in the application to access HBase data.

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

@Service
public class UserService {
    
    

    private final UserRepository userRepository;

    @Autowired
    public UserService(UserRepository userRepository) {
    
    
        this.userRepository = userRepository;
    }

    public void saveUser(User user) {
    
    
        userRepository.save(user);
    }

    public User getUserById(String id) {
    
    
        return userRepository.findById(id).orElse(null);
    }

    // 其他业务逻辑...
}

6. Create the controller:

Create a controller class to handle HTTP requests.

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/users")
public class UserController {
    
    

    private final UserService userService;

    @Autowired
    public UserController(UserService userService) {
    
    
        this.userService = userService;
    }

    @PostMapping
    public void addUser(@RequestBody User user) {
    
    
        userService.saveUser(user);
    }

    @GetMapping("/{id}")
    public User getUser(@PathVariable String id) {
    
    
        return userService.getUserById(id);
    }

    // 其他请求处理...
}

7. Run the application:

To run a Spring Boot application, you can use the built-in Tomcat container. Then, access the endpoint defined in the controller via an HTTP request.

To sum up, you now have HBase integrated in Spring Boot. By using Spring Data Hadoop's HBase support, you can more easily perform HBase operations in Spring Boot applications without having to deal with the underlying connection and configuration details.

Next, in the next chapter, we will discuss in depth how to optimize the performance of HBase, and some common application cases.

8. HBase optimization

In this chapter, we will dive into how to optimize the performance of HBase to ensure that your application remains efficient when processing large-scale data. HBase optimization can start from many aspects, let us introduce them one by one.

8.1 Rowkey design

In HBase, the design of RowKey is very important because it directly affects the storage, access and performance of data. A good RowKey design can improve query efficiency, reduce data skew problems, and help optimize HBase performance. The following are some principles and optimization suggestions for designing RowKey:

1. Uniqueness: RowKey must be unique to ensure that each row can be correctly identified. Usually, timestamp, UUID, business-related unique identifier, etc. can be used as a part of RowKey.

2. Hash distribution: A good RowKey design should achieve uniform distribution in the HBase table. This helps avoid hot data, improve load balancing, and reduce data skew issues. A common approach is to hash the RowKey using a hash function to ensure a more even distribution of data.

3. Data locality: Store related data in adjacent rows to minimize disk access when scanning range queries. This can be achieved by storing similar data under the same prefix of the RowKey. For example, for a table that stores user transaction records, you can use the user ID as a prefix to store transaction records for the same user in adjacent rows.

4. Minimize byte length: Shorter RowKey can reduce storage space and IO overhead. However, do not overly pursue a short length and make the RowKey unreadable. A trade-off should be made between readability and length.

5. Avoid frequent changes: After designing RowKey, try to avoid frequent modifications, because modifying RowKey will cause data to be redistributed within HBase. Frequent changes can impact performance and data locality.

6. Consider query patterns: Based on your query needs, design RowKey to support common query patterns. If you often need to query data based on a time range, you can use the timestamp as part of the RowKey.

7. Avoid sequential write hotspots: If RowKey is designed so that new data is always written to the same area, it may cause sequential write hotspots and affect performance. This can be mitigated by adding randomness to the RowKey, or using a hash distribution.

8. Use byte encoding: If the RowKey contains data such as numbers or dates, consider using appropriate byte encoding to maintain the correctness of sorting and comparison.

9. Row partitioning: For large tables, you can consider partitioning the table by rows and spreading the data to different RegionServers. This can be achieved by adding the partition ID in the RowKey.

10. Testing and optimization: The designed RowKey should be tested to ensure that it performs well in actual scenarios. After design, it can be verified through performance testing and optimized.

When designing the RowKey of an HBase table, you can develop a strategy based on the above principles and suggestions.

Here are some examples showing how to design RowKey based on different scenarios and illustrating how to implement it in Java.

1. Uniqueness and Hash Distribution Example:

In this example, we use a UUID as part of the RowKey to ensure uniqueness. At the same time, we hash the RowKey to achieve a more even data distribution.

import org.apache.hadoop.hbase.util.Bytes;

import java.util.UUID;

public class RowKeyExample {
    
    

    public static byte[] generateRowKey() {
    
    
        UUID uuid = UUID.randomUUID();
        int hash = uuid.hashCode();
        return Bytes.toBytes(hash + "-" + uuid.toString());
    }

    public static void main(String[] args) {
    
    
        byte[] rowKey = generateRowKey();
        System.out.println("Generated RowKey: " + Bytes.toString(rowKey));
    }
}

2. Data locality example:

In this example, we use the user ID as the prefix of the RowKey to store data for the same user in adjacent rows.

import org.apache.hadoop.hbase.util.Bytes;

public class UserRowKeyExample {
    
    

    public static byte[] generateRowKey(String userId) {
    
    
        return Bytes.toBytes(userId + "-" + System.currentTimeMillis());
    }

    public static void main(String[] args) {
    
    
        String userId = "user123";
        byte[] rowKey = generateRowKey(userId);
        System.out.println("Generated RowKey: " + Bytes.toString(rowKey));
    }
}

3. Avoid writing hot spots in order Example:

In this example, we add the timestamp to the RowKey to avoid sequential write hotspot issues.

import org.apache.hadoop.hbase.util.Bytes;

public class AvoidHotspotRowKeyExample {
    
    

    public static byte[] generateRowKey() {
    
    
        long timestamp = System.currentTimeMillis();
        int randomValue = (int) (Math.random() * 1000); // 添加随机值
        return Bytes.toBytes(timestamp + "-" + randomValue);
    }

    public static void main(String[] args) {
    
    
        byte[] rowKey = generateRowKey();
        System.out.println("Generated RowKey: " + Bytes.toString(rowKey));
    }
}

In practical applications, the design of RowKey depends on your business needs and query patterns. The above example is just to demonstrate how to design RowKey based on different scenarios, not an absolute best practice. You should customize the design, test and optimize according to the specific situation to obtain the best performance and data distribution effect.

In short, HBase's RowKey design is an important decision that requires careful consideration, as it directly affects data storage and query efficiency. Based on specific business needs, query patterns, and performance requirements, you can choose an appropriate RowKey design strategy.

8.2 Memory optimization

Properly configuring memory parameters is an important step in optimizing HBase performance. The following is an extended description of memory optimization, mainly focusing on MemStore optimization and block cache optimization.

1. MemStore optimization:

MemStore is the memory area used by HBase to cache written data. Excessive data accumulation may cause write performance to decrease. Here are some optimization suggestions:

  • Control refresh frequency: hbase.hregion.memstore.flush.size The parameter determines that when the data size in MemStore reaches a certain threshold, it will trigger data refresh to HFile. By appropriately adjusting this threshold, the refresh frequency can be controlled to avoid too frequent refresh operations.
  • Reasonably set the blocking coefficient: hbase.hregion.memstore.block.multiplier The parameter controls the size of MemStore, which calculates the blocking size of MemStore based on memory allocation. Depending on the available memory of your system, you can adjust this parameter appropriately to better utilize the available memory. If the system memory is sufficient, you can increase the blocking coefficient and increase the size of MemStore.

2. Block cache optimization:

The block cache is used to cache data blocks in HFile, thereby speeding up read operations. Here are some suggestions for block cache optimization:

  • Configured according to the read mode of the table: hbase.regionserver.global.memstore.size parameters control the total memory usage of the block cache and MemStore. Depending on your table's read patterns, this parameter can be adjusted to better allocate memory resources. If the read frequency of the table is high, you can increase the proportion of block cache and reduce the occupation of MemStore.
  • Usage hbase.regionserver.global.memstore.size.lower.limit: This parameter can set a lower limit to ensure that at least a certain amount of memory is allocated to the block cache. This helps avoid the block cache taking up too many memory resources when the MemStore has less data.

When optimizing memory, it is recommended to conduct experiments and performance tests to find the best memory configuration based on actual conditions. At the same time, the usage of other resources of the system must also be considered to avoid affecting the normal operation of other services due to excessive memory allocation.

8.3 Compression algorithm selection

In HBase, choosing an appropriate compression algorithm can significantly reduce storage costs, improve data transmission efficiency, and affect overall performance. The following is an extended explanation of compression algorithm selection:

1. Snappy compression algorithm:

  • Advantages: Snappy is a fast compression algorithm that can effectively reduce data storage space while maintaining a high decompression speed. It is particularly suitable for high throughput and low latency application scenarios.
  • Applicable scenarios: If your main focus is compression speed and low latency, consider using Snappy. It is suitable for scenarios with high requirements on read and write performance, such as real-time analysis and fast query.

2. LZO compression algorithm:

  • Advantages: LZO is an efficient compression algorithm with fast decompression speed and high compression ratio. It performs well in big data analysis scenarios, especially for jobs such as MapReduce and Hive.
  • Applicable scenarios: If you use HBase in big data analysis, you can consider using the LZO compression algorithm. It can reduce disk IO and network transmission and speed up job execution.

3. Gzip compression algorithm:

  • Advantages: Gzip is a general-purpose compression algorithm that performs well in terms of compression ratio. It can store more data in smaller storage space.
  • Applicable scenarios: If your main concern is compression ratio, consider using Gzip. However, it should be noted that Gzip's compression and decompression speed is relatively slow, which may affect read and write performance.

When selecting a compression algorithm, you need to comprehensively consider the characteristics of the data, access patterns, and hardware conditions. You can also choose different compression algorithms based on different column families or tables to meet different needs. When configuring the compression algorithm, you can use HBase's column family-level parameter settings, for example:

<property>
    <name>hbase.columnfamily.familyName.COMPRESSION</name>
    <value>SNAPPY</value>
</property>

In actual applications, it is recommended to conduct performance testing to observe the performance of different compression algorithms in your specific scenario, and make the best choice based on the test results.

8.4 Using Bloom Filter

Bloom Filter is a hash-based data structure used to quickly determine whether an element exists in a set. In HBase, Bloom Filter can be used to improve reading efficiency and reduce unnecessary disk IO. By enabling Bloom Filter on certain column families, unnecessary disk seeking can be reduced during queries, thereby improving performance.

How Bloom Filter works:

Bloom Filter uses a series of hash functions to map elements into a bit array. When it is necessary to determine whether an element exists, the same hash calculation is performed on the element and the corresponding bits in the bit array are checked to see if they are all 1. If any bit is 0, the element definitely does not exist; if all bits are 1, the element may exist, but it may also be a misjudgment.

Using Bloom Filter in HBase:

In HBase, you can configure whether to enable Bloom Filter at the column family level. By enabling Bloom Filter, you can quickly determine whether a row may exist in the HFile at query time, thus avoiding unnecessary disk IO. The following is an example of configuring a Bloom Filter:

<property>
    <name>hbase.columnfamily.familyName.BLOOMFILTER</name>
    <value>ROW</value>
</property>

In the above example, familyNamereplace with the actual column family name and ROWwith the desired Bloom Filter type. In HBase, Bloom Filter supports types including ROW, ROWCOLand NONE.

Applicable scene:

Using Bloom Filter can achieve faster read operations on certain column families, especially in random read scenarios. Applicable scenarios include:

  • Fast query in large tables: When performing query operations in large tables, Bloom Filter can help exclude rows that do not meet the conditions, reduce disk IO, and improve query performance.
  • Caching query results: If you use query caching, Bloom Filter can help quickly determine whether a certain row of data exists in the cache, thereby deciding whether to use the cached results.

Precautions:

  • Bloom Filter is a probabilistic data structure and misjudgments (false positives) may occur. Therefore, it is suitable for scenarios where a certain error is acceptable.
  • Enabling Bloom Filter will occupy a certain amount of memory resources, so the memory usage needs to be weighed based on the actual situation.

In short, by enabling Bloom Filter in HBase, you can reduce unnecessary disk IO and improve read efficiency during query operations. Depending on your query pattern and performance needs, you may choose to enable Bloom Filter on the appropriate column family.

8.5 Index optimization

In HBase, although there is no database index in the traditional sense, you can use some methods to implement index-like functions to improve query efficiency. Here are extended instructions on index optimization:

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

Through the above optimization measures, you can greatly improve the performance and efficiency of HBase, making it more suitable for processing large-scale data and high-concurrency application scenarios.

In the next chapter, we will further demonstrate the application of HBase in different fields through actual application cases.

9. HBase application cases

9.1 Scenario 1: Storing log data

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

Through the above optimization strategies, we can build an efficient distributed log storage and analysis platform. The platform can not only store massive access log data, but also enable real-time analysis and query to provide support for user behavior optimization. At the same time, taking advantage of the high scalability of HBase, we can easily cope with the growing amount of data.

9.2 Scenario 2: Time series data storage

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

Through the above optimization strategies, HBase can support high-throughput time series data writing, and can also perform complex real-time analysis to meet the needs of telecommunications billing, monitoring data analysis, etc. The distributed nature and scalability of HBase make it a powerful tool for time series data storage and analysis.

10. Integrate with the ecosystem

As a powerful distributed NoSQL database, HBase can be tightly integrated with other frameworks in the big data ecosystem to build a complete big data processing solution. In this chapter, we will introduce how to integrate HBase with frameworks such as Hadoop, Hive, and Spark to leverage their advantages and achieve more powerful data processing capabilities.

10.1 HBase and MapReduce integration

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

In summary, through the combination of HBase and MapReduce, powerful distributed computing capabilities can be achieved for complex distributed computing tasks such as data aggregation, connection operations, and data cleaning.

10.2 HBase and Hive integration

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

By executing these queries, you can analyze and process data from HBase in Hive without writing complex Java code.

Through the integration of HBase and Hive, you can give full play to the powerful analysis capabilities of Hive, while taking advantage of the high-performance storage of HBase to provide a more flexible and convenient way for data analysis and report generation.

10.3 HBase and Spark integration

Omitted here due to character limit

For complete content, please refer to "Big Data HBase Study Bible", pdf, get it from Nien

Through the above integration examples, you can give full play to the synergy between HBase and the big data ecosystem to build more powerful data processing and analysis solutions.

In the final chapter, we will review the entire tutorial and provide you with suggestions for further study and practice.

11. Summary

Through this tutorial, I believe that everyone has a certain understanding of the basic concepts, architecture, installation configuration, data model, operation API and optimization of HBase. HBase, as a powerful distributed and scalable NoSQL database, provides important support for processing large-scale data.

Of course, to master HBase in depth, you also need continuous practice and optimization in actual projects. If you want to continue learning more advanced topics, you can refer to the references provided in this tutorial and continue to gain experience in real projects.

12. References

[1] HBase Chinese official documentation: http://hbase.org.cn/

[2] HBase authoritative guide: https://book.douban.com/subject/26560706/

[3] HBase technology insider: https://book.douban.com/subject/26649202/

Say it later

This article is the V1 version of the "Big Data Flink Study Bible" and a companion volume to the "Nien Big Data Interview Guide".

Here is a special note: Since the first release of the 5 special topic PDFs of "Nien Big Data Interview Guide", hundreds of questions have been collected, and a large amount of useful and authentic information for interviews with big companies . "Nion's Big Data Interview Guide" is a collection of interview questions that has become a must-read book for big data learning and interviews.

Therefore, the Nien architecture team struck while the iron was hot and launched the " Big Data Flink Study Bible " and "Big Data HBASE Study Bible" (this article)

For the complete pdf, you can follow Nien’s official account [Technical Freedom Circle].

Moreover, "Big Data HBASE Study Bible", "Big Data Flink Study Bible", and "Nion Big Data Interview Guide" will continue to be iterated and updated to absorb the latest interview questions. For the latest version, please see the official account at the end of the article [ Technical freedom circle]

about the author

First work: Andy , senior architect, one of the authors of "Java High Concurrency Core Programming Enhanced Edition".

Second work: Nien , a 41-year-old senior architect, senior writer in the IT field, and a famous blogger. The creator of "Java High Concurrency Core Programming Enhanced Edition Volume 1, Volume 2, and Volume 3". The author of 11 PDF Bibles including "K8S Study Bible", "Docker Study Bible" and "Go Study Bible". He is also a senior architecture mentor and architecture transformation mentor . He has successfully guided multiple intermediate Java and senior Java transformation architect positions. The highest student received an annual salary of nearly 1 million .

recommended reading

" Message Push Architecture Design "

" Alibaba 2: How many nodes do you deploy?" How to deploy 1000W concurrency? "

" Meituan 2 Sides: Five Nines High Availability 99.999%. How to achieve it?" "

" NetEase side: Single node 2000Wtps, how does Kafka do it?" "

" Byte Side: What is the relationship between transaction compensation and transaction retry?" "

" NetEase side: 25Wqps high throughput writing Mysql, 100W data is written in 4 seconds, how to achieve it?" "

" How to structure billion-level short videos? " "

" Blow up, rely on "bragging" to get through JD.com, monthly salary 40K "

" It's so fierce, I rely on "bragging" to get through SF Express, and my monthly salary is 30K "

" It exploded...Jingdong asked for 40 questions on one side, and after passing it, it was 500,000+ "

" I'm so tired of asking questions... Ali asked 27 questions while asking for his life, and after passing it, it's 600,000+ "

" After 3 hours of crazy asking on Baidu, I got an offer from a big company. This guy is so cruel!" "

" Ele.me is too cruel: Face an advanced Java, how hard and cruel work it is "

" After an hour of crazy asking by Byte, the guy got the offer, it's so cruel!" "

" Accept Didi Offer: From three experiences as a young man, see what you need to learn?" "

"Nien Architecture Notes", "Nien High Concurrency Trilogy", "Nien Java Interview Guide" PDF, please go to the following official account [Technical Freedom Circle] to get ↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132589856