Hadoop Distributed Data Warehouse HBase for Big Data

1. The limitations of Hadoop

HBase is a column-oriented database management system built on top of the Hadoop file system.
insert image description here

If you want to understand why HBase was created, you need to understand the limitations of Hadoop? Hadoop can store structured, semi-structured or even unstructured data through HDFS. It is a supplement to traditional databases and the best method for massive data storage. It has done both batch access and streaming access for large file storage. Optimization, and also solves the disaster recovery problem through multiple copies.

But the flaw of Hadoop is that it can only perform batch processing, and can only access data in a sequential manner, which means that even for the simplest work, the entire data set must be searched, and random access to data cannot be achieved. Realizing random access to data is what traditional relational databases are good at, but they cannot be used to store massive data. In this case, there must be a new solution to solve the problem of massive data storage and random access, HBase is one of them (HBase, Cassandra, couchDB, Dynamo and MongoDB can store massive data and support random access).

Note: Data structure classification:

  • Structured data: data managed in the form of relational database tables;
  • Semi-structured data: non-relational model, data with a basic fixed structure pattern, such as log files, XML documents, JSON documents, Email, etc.;
  • Unstructured data: Data without a fixed pattern, such as WORD, PDF, PPT, EXL, pictures and videos in various formats, etc.

2. Introduction to HBase

HBase is a column-oriented database management system built on top of the Hadoop file system.

HBase is a data model similar to Google's Big Table, which is a part of the Hadoop ecosystem. It stores data on HDFS, and clients can access data on HDFS randomly through HBase. It has the following properties:

  • Complicated transactions are not supported, only row-level transactions are supported, that is, the reading and writing of single-row data is atomic;
  • Since HDFS is used as the underlying storage, it supports structured, semi-structured and unstructured storage just like HDFS;
  • Support horizontal expansion by adding machines;
  • Support data fragmentation;
  • Support automatic failover between RegionServers;
  • Easy-to-use Java client API;
  • Support BlockCache and Bloom filter;
  • Filters support predicate pushdown.

三、HBase Table

HBase is a column-oriented database management system. More precisely, HBase is a column-oriented database management system. The table schema only defines the column family. The table has multiple column families. Each column family can contain any number of columns. The column is composed of multiple cells (cells). Cells can store multiple versions of data. Multiple versions of data differentiated by timestamp.

The following figure shows a table in HBase:

  • RowKey is the unique identifier of the row, and all rows are sorted according to the lexicographical order of RowKey;
  • The table has two column families, personal and office;
  • Among them, the column family personal has three columns of name, city, and phone, and the column family office has two columns of tel and addresses.
    insert image description here
    Hbase tables have the following characteristics:
  • Large capacity: a table can have billions of rows and millions of columns;
  • Column-oriented: data is stored in columns, and each column is stored separately. The data is an index. When querying, only the data in the specified column can be accessed, which effectively reduces the I/O burden of the system;
  • Sparsity: empty (null) columns do not occupy storage space, and the table can be designed very sparsely;
  • Multiple versions of data: the data in each unit can have multiple versions, sorted by timestamp, and the new data is at the top;
  • Storage type: The underlying storage format of all data is byte array (byte[]).

4. Brief description of the data reading and writing process

4.1 Process of writing data

  • Client submits a write request to Region Server;
  • Region Server finds the target Region;
  • Region checks whether the data is consistent with the Schema;
  • If the client does not specify a version, get the current system time as the data version;
  • Write updates to WAL Log;
  • Write updates to Memstore;
  • Determine whether the Memstore storage is full. If the storage is full, you need to flush the Store Hfile.

4.2 Process of reading data

The following is the process for the client to read and write data on HBase for the first time:

  • The client obtains the Region Server where the META table is located from Zookeeper;
  • The client accesses the Region Server where the META table is located, and queries the Region
    Server where the access row key is located from the META table, and then the client will cache the information and the location of the META table;
  • The client gets data from the Region Server where the row key is located.

If read again, the client will get the Region Server where the row key is located from the cache. In this way, the client does not need to query the META table again, unless the cache is invalidated due to the movement of the Region. In this case, the cache will be re-queried and updated.

Note: The META table is a special table in HBase, which stores the location information of all Regions, and the location information of the META table itself is stored on ZooKeeper.

5. Basic usage of HBase Java API 1.0

2.1 Create a new Maven project and import project dependencies
To use Java API to operate HBase, you need to introduce hbase-client. The version of HBase Client selected here is 1.2.0.

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.2.0</version>
</dependency>

2.3 Unit testing

Test the above-encapsulated API in the form of a unit test (for tool classes, refer to my tool class column).

public class HBaseUtilsTest {
    
    

    private static final String TABLE_NAME = "class";
    private static final String TEACHER = "teacher";
    private static final String STUDENT = "student";

    @Test
    public void createTable() {
    
    
        // 新建表
        List<String> columnFamilies = Arrays.asList(TEACHER, STUDENT);
        boolean table = HBaseUtils.createTable(TABLE_NAME, columnFamilies);
        System.out.println("表创建结果:" + table);
    }

    @Test
    public void insertData() {
    
    
        List<Pair<String, String>> pairs1 = Arrays.asList(new Pair<>("name", "Tom"),
                new Pair<>("age", "22"),
                new Pair<>("gender", "1"));
        HBaseUtils.putRow(TABLE_NAME, "rowKey1", STUDENT, pairs1);

        List<Pair<String, String>> pairs2 = Arrays.asList(new Pair<>("name", "Jack"),
                new Pair<>("age", "33"),
                new Pair<>("gender", "2"));
        HBaseUtils.putRow(TABLE_NAME, "rowKey2", STUDENT, pairs2);

        List<Pair<String, String>> pairs3 = Arrays.asList(new Pair<>("name", "Mike"),
                new Pair<>("age", "44"),
                new Pair<>("gender", "1"));
        HBaseUtils.putRow(TABLE_NAME, "rowKey3", STUDENT, pairs3);
    }


    @Test
    public void getRow() {
    
    
        Result result = HBaseUtils.getRow(TABLE_NAME, "rowKey1");
        if (result != null) {
    
    
            System.out.println(Bytes
                    .toString(result.getValue(Bytes.toBytes(STUDENT), Bytes.toBytes("name"))));
        }

    }

    @Test
    public void getCell() {
    
    
        String cell = HBaseUtils.getCell(TABLE_NAME, "rowKey2", STUDENT, "age");
        System.out.println("cell age :" + cell);

    }

    @Test
    public void getScanner() {
    
    
        ResultScanner scanner = HBaseUtils.getScanner(TABLE_NAME);
        if (scanner != null) {
    
    
            scanner.forEach(result -> System.out.println(Bytes.toString(result.getRow()) + "->" + Bytes
                    .toString(result.getValue(Bytes.toBytes(STUDENT), Bytes.toBytes("name")))));
            scanner.close();
        }
    }


    @Test
    public void getScannerWithFilter() {
    
    
        FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
        SingleColumnValueFilter nameFilter = new SingleColumnValueFilter(Bytes.toBytes(STUDENT),
                Bytes.toBytes("name"), CompareOperator.EQUAL, Bytes.toBytes("Jack"));
        filterList.addFilter(nameFilter);
        ResultScanner scanner = HBaseUtils.getScanner(TABLE_NAME, filterList);
        if (scanner != null) {
    
    
            scanner.forEach(result -> System.out.println(Bytes.toString(result.getRow()) + "->" + Bytes
                    .toString(result.getValue(Bytes.toBytes(STUDENT), Bytes.toBytes("name")))));
            scanner.close();
        }
    }

    @Test
    public void deleteColumn() {
    
    
        boolean b = HBaseUtils.deleteColumn(TABLE_NAME, "rowKey2", STUDENT, "age");
        System.out.println("删除结果: " + b);
    }

    @Test
    public void deleteRow() {
    
    
        boolean b = HBaseUtils.deleteRow(TABLE_NAME, "rowKey2");
        System.out.println("删除结果: " + b);
    }

    @Test
    public void deleteTable() {
    
    
        boolean b = HBaseUtils.deleteTable(TABLE_NAME);
        System.out.println("删除结果: " + b);
    }
}

6. Basic use of HBase Java API 2.0

3.1 Create a new Maven project and import project dependencies

The version of HBase Client selected here is the latest 2.1.4.

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>2.1.4</version>
</dependency>

For the use of tools, refer to my tools column

7. Connect Hbase correctly

In the above code, the Connection connection is initialized when the class is loaded, and the subsequent methods are to reuse the Connection. At this time, we may consider whether we can use a custom connection pool to obtain better performance? Actually this is not necessary.

HBase clients need to connect to three different service roles:

  • Zookeeper: Mainly used to obtain the location information of the meta table and Master information;
  • HBase Master: It is mainly used to perform some operations of the HBaseAdmin interface, such as creating tables, etc.;
  • HBase RegionServer: used to read and write data.

insert image description here

HBase provides three implementations of resource pools, namely Reusable, RoundRobin and ThreadLocal. The specific implementation can be specified through the hbase.client.ipc.pool.type configuration item, and the default is Reusable. The size of the connection pool can also be specified through the hbase.client.ipc.pool.size configuration item, and the default is 1, that is, 1 connection per Server. It can also be achieved by modifying the configuration:

config.set("hbase.client.ipc.pool.type",...);
config.set("hbase.client.ipc.pool.size",...);
connection = ConnectionFactory.createConnection(config);

8. Hbase Common Shell Commands

一、基本命令
打开 Hbase Shell:
# hbase shell
1.1 获取帮助
# 获取帮助
help
# 获取命令的详细信息
help 'status'
1.2 查看服务器状态
status
1.3 查看版本信息
version
二、关于表的操作
2.1 查看所有表
list
2.2 创建表
命令格式: create '表名称', '列族名称 1','列族名称 2','列名称 N'
# 创建一张名为Student的表,包含基本信息(baseInfo)、学校信息(schoolInfo)两个列族
create 'Student','baseInfo','schoolInfo'
2.3 查看表的基本信息
命令格式:desc '表名'

describe 'Student'
2.4 表的启用/禁用
enable 和 disable 可以启用/禁用这个表,is_enabled 和 is_disabled 来检查表是否被禁用
# 禁用表
disable 'Student'
# 检查表是否被禁用
is_disabled 'Student'
# 启用表
enable 'Student'
# 检查表是否被启用
is_enabled 'Student'
2.5 检查表是否存在
exists 'Student'
2.6 删除表
# 删除表前需要先禁用表
disable 'Student'
# 删除表
drop 'Student'
三、增删改
3.1 添加列族
命令格式: alter '表名', '列族名'

alter 'Student', 'teacherInfo'
3.2 删除列族
命令格式:alter '表名', {
    
    NAME => '列族名', METHOD => 'delete'}

alter 'Student', {
    
    NAME => 'teacherInfo', METHOD => 'delete'}

```powershell
3.3 更改列族存储版本的限制
默认情况下,列族只存储一个版本的数据,如果需要存储多个版本的数据,则需要修改列族的属性。修改后可通过 desc 命令查看。

alter 'Student',{
    
    NAME=>'baseInfo',VERSIONS=>3}

3.4 Insert data
Command format: put 'table name', 'row key', 'column family: column', 'value'

Note: If the row key value, column family name, and column name of the new data are exactly the same as the original data, it is equivalent to an update operation

put ‘Student’, ‘rowkey1’,‘baseInfo:name’,‘tom’
put ‘Student’, ‘rowkey1’,‘baseInfo:birthday’,‘1990-01-09’
put ‘Student’, ‘rowkey1’,‘baseInfo:age’,‘29’
put ‘Student’, ‘rowkey1’,‘schoolInfo:name’,‘Havard’
put ‘Student’, ‘rowkey1’,‘schoolInfo:localtion’,‘Boston’

put ‘Student’, ‘rowkey2’,‘baseInfo:name’,‘jack’
put ‘Student’, ‘rowkey2’,‘baseInfo:birthday’,‘1998-08-22’
put ‘Student’, ‘rowkey2’,‘baseInfo:age’,‘21’
put ‘Student’, ‘rowkey2’,‘schoolInfo:name’,‘yale’
put ‘Student’, ‘rowkey2’,‘schoolInfo:localtion’,‘New Haven’

put ‘Student’, ‘rowkey3’,‘baseInfo:name’,‘maike’
put ‘Student’, ‘rowkey3’,‘baseInfo:birthday’,‘1995-01-22’
put ‘Student’, ‘rowkey3’,‘baseInfo:age’,‘24’
put ‘Student’, ‘rowkey3’,‘schoolInfo:name’,‘yale’
put ‘Student’, ‘rowkey3’,‘schoolInfo:localtion’,‘New Haven’

put ‘Student’, ‘wrowkey4’,‘baseInfo:name’,‘maike-jack’


3.5 获取指定行、指定行中的列族、列的信息

```powershell
# 获取指定行中所有列的数据信息
get 'Student','rowkey3'
# 获取指定行中指定列族下所有列的数据信息
get 'Student','rowkey3','baseInfo'
# 获取指定行中指定列的数据信息
get 'Student','rowkey3','baseInfo:name'

3.6 Delete the specified row and column in the specified row

# 删除指定行
delete 'Student','rowkey3'
# 删除指定行中指定列的数据
delete 'Student','rowkey3','baseInfo:name'

4. Query
There are two basic ways to access data in hbase:

  • Get data by specified rowkey: get method;
  • Get data according to specified conditions: scan method.

scan can set begin and end parameters to access all data in a range. get is essentially a special scan where begin and end are equal.

4.1 Get query

# 获取指定行中所有列的数据信息
get 'Student','rowkey3'
# 获取指定行中指定列族下所有列的数据信息
get 'Student','rowkey3','baseInfo'
# 获取指定行中指定列的数据信息
get 'Student','rowkey3','baseInfo:name'
4.2 查询整表数据
scan 'Student'
4.3 查询指定列簇的数据
scan 'Student', {
    
    COLUMN=>'baseInfo'}
4.4 条件查询
# 查询指定列的数据
scan 'Student', {
    
    COLUMNS=> 'baseInfo:birthday'}
除了列 (COLUMNS) 修饰词外,HBase 还支持 Limit(限制查询结果行数),STARTROW(ROWKEY 起始行,会先根据这个 key 定位到 region,再向后扫描)、STOPROW(结束行)、TIMERANGE(限定时间戳范围)、VERSIONS(版本数)、和 FILTER(按条件过滤行)等。

如下代表从 rowkey2 这个 rowkey 开始,查找下两个行的最新 3 个版本的 name 列的数据:

scan 'Student', {
    
    COLUMNS=> 'baseInfo:name',STARTROW => 'rowkey2',STOPROW => 'wrowkey4',LIMIT=>2, VERSIONS=>3}

4.5 Conditional filtering

Filter 可以设定一系列条件来进行过滤。如我们要查询值等于 24 的所有数据:

scan 'Student', FILTER=>"ValueFilter(=,'binary:24')"
值包含 yale 的所有数据:

scan 'Student', FILTER=>"ValueFilter(=,'substring:yale')"
列名中的前缀为 birth 的:

scan 'Student', FILTER=>"ColumnPrefixFilter('birth')"
FILTER 中支持多个过滤条件通过括号、AND 和 OR 进行组合:

# 列名中的前缀为birth且列值中包含1998的数据
scan 'Student', FILTER=>"ColumnPrefixFilter('birth') AND ValueFilter ValueFilter(=,'substring:1998')"
PrefixFilter 用于对 Rowkey 的前缀进行判断:

scan 'Student', FILTER=>"PrefixFilter('wr')"

Nine, Hbase disaster recovery and backup

1. Preface
This article mainly introduces three simple disaster recovery backup schemes commonly used in Hbase, namely CopyTable, Export/Import, and Snapshot. They are introduced as follows:

2.
Introduction to CopyTable 2.1
CopyTable can copy the data of an existing table to a new table, and has the following characteristics:

Support functions such as time interval, row interval, changing table name, changing column family name, and whether to copy deleted data; before
executing the command, you need to create a new table with the same structure as the original table;
the operation of CopyTable is based on HBase Client The API uses scan to query and put to write.

2.2 Command format

Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>

2.3 Common commands
CopyTable under the same cluster

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy  tableOrig

CopyTable under different clusters

# 两表名称相同的情况
hbase org.apache.hadoop.hbase.mapreduce.CopyTable \
--peer.adr=dstClusterZK:2181:/hbase tableOrig
# 也可以指新的表名
hbase org.apache.hadoop.hbase.mapreduce.CopyTable \
--peer.adr=dstClusterZK:2181:/hbase \
--new.name=tableCopy tableOrig
下面是一个官方给的比较完整的例子,指定开始和结束时间,集群地址,以及只复制指定的列族:
hbase org.apache.hadoop.hbase.mapreduce.CopyTable \
--starttime=1265875194289 \
--endtime=1265878794289 \
--peer.adr=server1,server2,server3:2181:/hbase \
--families=myOldCf:myNewCf,cf2,cf3 TestTable

2.4 More parameters
You can view more supported parameters through --help

# hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help

3. Export/Import
3.1 Introduction
Export supports exporting data to HDFS, and Import supports importing data from HDFS. Export also supports specifying the start time and end time of the exported data, so it can be used for incremental backup.
Export export is the same as CopyTable, relying on the scan operation of HBase

3.2 Command format

# Export
hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
# Inport
hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
导出的 outputdir 目录可以不用预先创建,程序会自动创建。导出完成后,导出文件的所有权将由执行导出命令的用户所拥有。
默认情况下,仅导出给定 Cell 的最新版本,而不管历史版本。要导出多个版本,需要将 <versions> 参数替换为所需的版本数。

3.3 Common commands

导出命令
hbase org.apache.hadoop.hbase.mapreduce.Export tableName  hdfs 路径/tableName.db
导入命令
hbase org.apache.hadoop.hbase.mapreduce.Import tableName  hdfs 路径/tableName.db

Four, Snapshot
4.1 Introduction
HBase's snapshot (Snapshot) function allows you to obtain a copy of the table (including content and metadata), and the performance overhead is small. Because the snapshot stores only table metadata and HFiles information. A snapshot's clone operation creates a new table from the snapshot, and a snapshot's restore operation restores the contents of the table to the snapshot node. The clone and restore operations do not need to copy any data, because the underlying HFiles (files containing HBase table data) will not be modified, only the metadata information of the table is modified.

4.2 Configuring
HBase The snapshot function is not enabled by default. If you want to enable snapshots, you need to add the following configuration items in the hbase-site.xml file:

<property>
    <name>hbase.snapshot.enabled</name>
    <value>true</value>
</property>

4.3 Commonly used commands
All snapshot commands need to be executed in the Hbase Shell interactive command line.

  1. Take a Snapshot
# 拍摄快照
hbase> snapshot '表名', '快照名'

By default data flushing is performed in memory before taking a snapshot. To ensure that the data in memory is included in the snapshot. But if you don't want to include data in memory, you can use the SKIP_FLUSH option to disable flushing.

# 禁止内存刷新
hbase> snapshot  '表名', '快照名', {
    
    SKIP_FLUSH => true}
  1. Listing Snapshots
# 获取快照列表
hbase> list_snapshots
  1. Deleting Snapshots
# 删除快照
hbase> delete_snapshot '快照名'
  1. Clone a table from snapshot
# 从现有的快照创建一张新表
hbase>  clone_snapshot '快照名', '新表名'
  1. Restore a snapshot
    restores the table to the snapshot node, the restore operation needs to disable the table first
hbase> disable '表名'
hbase> restore_snapshot '快照名'

It should be noted here that if HBase is configured with Replication-based master-slave replication, since Replication works at the log level and snapshots work at the file system level, after restoration, the replica will be in a different state from the master server . At this time, you can stop the synchronization first, and then re-establish synchronization after all servers restore to a consistent data point.

10. Introduce Phoenix core JAR package (HBase middle layer, you can use jdbc to operate the database)

If it is a maven project, find the corresponding version directly in the maven central warehouse, and import the dependencies:

 <!-- https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core -->
    <dependency>
      <groupId>org.apache.phoenix</groupId>
      <artifactId>phoenix-core</artifactId>
      <version>4.14.0-cdh5.14.2</version>
    </dependency>

If it is an ordinary project, you can find the corresponding JAR package from the Phoenix decompression directory, and then manually import it:
insert image description here
code example:

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;


public class PhoenixJavaApi {
    
    

    public static void main(String[] args) throws Exception {
    
    

        // 加载数据库驱动
        Class.forName("org.apache.phoenix.jdbc.PhoenixDriver");

        /*
         * 指定数据库地址,格式为 jdbc:phoenix:Zookeeper 地址
         * 如果 HBase 采用 Standalone 模式或者伪集群模式搭建,则 HBase 默认使用内置的 Zookeeper,默认端口为 2181
         */
        Connection connection = DriverManager.getConnection("jdbc:phoenix:192.168.200.226:2181");

        PreparedStatement statement = connection.prepareStatement("SELECT * FROM us_population");

        ResultSet resultSet = statement.executeQuery();

        while (resultSet.next()) {
    
    
            System.out.println(resultSet.getString("city") + " "
                    + resultSet.getInt("population"));
        }

        statement.close();
        connection.close();
    }
}

insert image description here
Integrate mybatis to operate

Guess you like

Origin blog.csdn.net/zouyang920/article/details/130419160