Basic concepts of Apache Hbase and Java API

1. A basic overview of Apache Hbase

Apache Hbase is a Hadoop-based database. It is reliable, multi-versioned, and distributed suitable for the storage of structured big data. Apache Hbase is an open source implementation of Google BigTable, a Philippine relational database based on column storage.

(1) The difference between row storage and row storage

Column storage and row storage refer to the amount of storage in the data sub-storage medium
**·**Relational database (row storage): Oracle, mysql, etc.
**·**Non-relational database (column storage): Hbase, Redis
Insert picture description here

(2) Hbase data model and concept

Insert picture description here
(1) Primary key rowkey: Get the unique identifier of the data, it cannot be repeated, it is automatically sorted according to the dictionary order, and the bottom storage is byte [].
(2) Column family: a collection of multiple columns, usually a group of similar functions or similar businesses is stored in a column cluster.
(3) Cell: rowkey+column family + column to locate a cell, there are multiple versions of the cell, the default is one.
(4) Multiple versions: The cell allows one data version to be read.
(5) Version number: The current timestamp of the system. By default, the cell data with the latest timestamp will be returned to the user.
(6) Column: A field in the column cluster used to store a certain category of data.

(3) Hbase features:

(1) Large: A table can have tens of billions of rows and millions of columns.
(2) Column-oriented: List-oriented (column cluster) storage and control permissions
(3) Sparse structure: For empty (NULL) columns, it does not occupy storage space, so the table can be designed to be very sparse.
(4) No mode: Each row has a sortable primary key and any number of columns. The columns can be dynamically added as needed. Different rows in the same table can have completely different columns.
(5) Multiple versions of data: The data in each cell can have multiple versions. By default, the version number is automatically assigned, and the version is the timestamp when the cell is inserted.
(6) Single data type: The data in Hbase is stored in the bottom layer as byte [], which can store any type of data.

(4) Detailed Hbase architecture and fully distributed structure

Fully distributed structure:
Insert picture description here
zookeeper serves as the entrance to the hbase cluster (the later java api can be seen from the way of connecting to hbase)

Detailed explanation of hbase architecture:
Insert picture description here
Each table in HBase is divided into multiple sub-tables (HRegion) according to a certain range by rowKey. By default, an HRegion over 256M will be divided into two. This process is managed by HRegionServer, and HRegion is allocated. Managed by HMaster.

The role of HMaster:

(1) Allocate HRegion to HRegionServer
(2) Responsible for load balancing of HRegionServer
(3) Find invalid HRegionServer and redistribute it
(4) Garbage file collection on HDFS
(5) Process Schema update request

The role of HRegionServer:

(1) Maintain the HRegion assigned to it by the HMaster, and process IO requests for these HRegions
(2) Responsible for segmenting the HRegion that has become too large during operation

It can be seen that Client access to data on HBase does not require HMaster to participate, addressing access to ZooKeeper and HRegionServer, data read and write access to HRegionServer, HMaster only maintains the metadata information of Table and Region, and the metadata information of Table is stored on ZooKeeper. The load is very low. When HRegionServer accesses a sub-table, it creates an HRegion object, and then creates an HStore object for each column cluster of the table. Each HStore will have a MemStore and 0 or more StoreFiles corresponding to it, and each StoreFile will correspond to one HFile, HFile is the actual storage file. Therefore, there are as many stores as there are column clusters in an HRegion. One HRegionServer will have multiple HRegions and one HLog.

HRegion

Table is divided into multiple HRegions in the row direction. HRegion is the smallest unit of distributed storage and load balancing in HBase, that is, different HRegions can be on different HRegionServers, but the same HRegion will not be split into multiple On HRegionServer. HRegion is divided by size. Each table generally has only one HRegion. As data is continuously inserted into the table, HRegion keeps increasing.
When a certain column cluster of HRegion reaches a threshold (default 256M), it will be divided into two new HRegions.

1、<表名,StartRowKey, 创建时间>
2、由目录表(-ROOT-和.META.)记录该Region的EndRowKey

HRegion定位:HRegion被分配给哪个HRegionServer是完全动态的,所以需要机制来定位HRegion具体在哪个HRegionServer,HBase使用三层结构来定位HRegion:
    1、通过zk里的文件/hbase/rs得到-ROOT-表的位置。-ROOT-表只有一个region。
    2、通过-ROOT-表查找.META.表的第一个表中相应的HRegion位置。其实-ROOT-表是.META.表的第一个region;
         .META.表中的每一个Region在-ROOT-表中都是一行记录。
    3、通过.META.表找到所要的用户表HRegion的位置。用户表的每个HRegion在.META.表中都是一行记录。

    -ROOT-表永远不会被分隔为多个HRegion,保证了最多需要三次跳转,就能定位到任意的region。Client会将查询的位置信息保存缓存起来,缓存不会主动失效,
    因此如果Client上的缓存全部失效,则需要进行6次网络来回,才能定位到正确的HRegion,其中三次用来发现缓存失效,另外三次用来获取位置信息。

HStore
Each HRegion HStore composed of one or more, at least a HStore, with data access in a HBase will HStore inside, i.e. each ColumnFamily build a HStore, if there are several ColumnFamily, it has several HStore. A Store consists of a MemStore and 0 or more StoreFiles. HBase uses the size of the Store to determine whether to split HRegion.

MemStore
MemStore is stored in memory, and the modified data is keyValues. When the size of MemStore reaches a threshold (64MB by default), MemStore will be flushed to the file, that is, a snapshot is generated. Currently, HBase will have a thread responsible for the Flush operation of MemStore.


  The data in StoreFile MemStore memory is written to the file as StoreFile, and the bottom layer of StoreFile is saved in HFile format

HLog
  HLog (WAL log): WAL means write ahead log, which is used for disaster recovery. HLog records all data changes. Once the region server goes down, it can be recovered from the log.

Briefly describe the process of
Hbase failure recovery: Hmaster finds that some HregionServers are unavailable and starts failure recovery, obtains the Hlog of the failed HregionServer, separates each HRegion write command, redistributes Hregions, restores HRegion data in the HregionServer, and executes the separation The HRegion write command restores MemStore, and the persisted data can be obtained directly from HDFS.

) 二) Hbase Java API

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-common</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-protocol</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-common</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-protocol</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
</dependency>
package com.learn;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;

public class HbaseAPI {
    private Admin admin;
    private Connection connection;

    @Before
    public void doBefore() throws IOException {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set(HConstants.ZOOKEEPER_QUORUM,"192.168.139.156:2181");
        connection = ConnectionFactory.createConnection(configuration);
        admin = connection.getAdmin();
    }

    @Test
    public void testCreateNameSpace() throws IOException {
        NamespaceDescriptor namespaceDescriptor = NamespaceDescriptor.create("gaoj").addConfiguration("author", "gaojian").build();
        admin.createNamespace(namespaceDescriptor);
    }

    @Test
    public void testCreateTable() throws IOException {
        HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("gaoj:t_order"));
        HColumnDescriptor cf1 = new HColumnDescriptor("cf1");
        cf1.setMaxVersions(3);
        HColumnDescriptor cf2 = new HColumnDescriptor("cf2");
        cf2.setTimeToLive(1800);
        hTableDescriptor.addFamily(cf1);
        hTableDescriptor.addFamily(cf2);
        admin.createTable(hTableDescriptor);
    }

    @Test
    public void testInsert() throws IOException {
        Table table = connection.getTable(TableName.valueOf("gaoj:t_order"));
        Put put = new Put(Bytes.toBytes("order101"));
        put.addColumn(Bytes.toBytes("cf1"),Bytes.toBytes("count"),Bytes.toBytes(123));
        table.put(put);
    }

    @Test
    public void testSelect() throws IOException {
        Table table = connection.getTable(TableName.valueOf("gaoj:t_order"));
        Get get = new Get(Bytes.toBytes("order101"));
        Result result = table.get(get);
        String name = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")));
        System.out.println(name);
    }

    @Test
    public void testDelete() throws IOException {
        Table table = connection.getTable(TableName.valueOf("gaoj:t_order"));
        Delete delete = new Delete(Bytes.toBytes("order101"));
        ArrayList<Delete> list = new ArrayList<Delete>();
        list.add(delete);
        table.delete(list);
    }

    @Test
    public void testScan() throws IOException {
        Table table = connection.getTable(TableName.valueOf("gaoj:t_order"));
        Scan scan = new Scan();
        ResultScanner scanner = table.getScanner(scan);
        Iterator<Result> iterator = scanner.iterator();
        while (iterator.hasNext()){
            Result next = iterator.next();
            String row = Bytes.toString(next.getRow());
            String name = Bytes.toString(next.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")));
            System.out.println(row + "*" + name);
        }
    }
    @After
    public void doAfter() throws IOException {
        if(admin != null){
            admin.close();
        }
        if(connection != null){
            connection.close();
        }
    }
}

The above api implements the addition, deletion, modification, and checking of data. Regarding Hbase, we have more control over the use of API. During the learning process, we can use the help command to help learning.

Guess you like

Origin blog.csdn.net/qq_44962429/article/details/108702353