Big Vernacular Explains the core knowledge of big data HBase in detail, Lao Liu is really attentive (3)

Insert picture description here

Lao Liu is currently working hard for next year's school recruitment. The main purpose of writing the article is to explain in detail the big data knowledge points that he has reviewed in the vernacular, and refuse to use mechanical methods in the materials to have his own understanding!

01 HBase knowledge points (3)

Insert picture description here
Point 13: Hot issues of HBase tables

What are hot issues?

That is, when we retrieve the data of hbase, we must first locate the data row through rowkey, but there is a problem here. Due to the design of rowkey, the data of the table may only be distributed in one or a few nodes in the hbase cluster.

When a large number of clients access the data of the hbase cluster, it will cause too many read and write requests for a small number of RegionServers, and the load is too large, while the load of other RegionServers is very small, causing hot spots.

Tell me in detail the reasons for the hot issues?

① The data in hbase is sorted lexicographically. When a large number of consecutive rowkeys are written in individual regions, the data distribution among regions is not balanced.

② When creating a table, there is no pre-partition in advance. The created table has only one region by default, and a large amount of data is written to the current region.

③ The created table has been pre-partitioned in advance, but the designed rowkey has no rules to follow, and the designed rowkey does not add a hash field.

Solutions to hot issues

Pre-partition

The purpose of pre-partitioning allows table data to be evenly distributed in the cluster, instead of having only one region distributed on one node of the cluster by default.

With salt

Specifically, assign a random prefix to the rowkey to make it different from the beginning of the previous rowkey.

Point 14: rowkey design

In the 13th point, it has been said that the bad design of rowkey will cause hot issues. So how exactly should rowkey be designed?

There are three principles for rowkey design, this is not my decision, I read the information written like this.

The first is the principle of rowkey length. It is recommended to be as short as possible and not too short. Otherwise, the probability of prefix duplication between rowkeys will increase. Because rowkeys are sorted in lexicographic order, these data will be stored together as much as possible.

The second is the principle of rowkey hashing, which is to add some random prefixes to the rowkey to make the rowkey as different as possible and try to avoid hot issues.

Finally, the only principle of rowkey is to keep it unique.

Point 15: HBase data backup

There are two types of HBase data backup that Liu knows. The first is to use the classes provided by HBase to export the data of a table in HBase to HDFS, and then export it to HBase tables as needed.

① Export from HBase table to HDFS

hbase org.apache.hadoop.hbase.mapreduce.Export myuser /hbase_data/myuser_bak

② If necessary, create a backup target table in HBase

create 'myuser_bak','f1','f2'

③ Import the data on HDFS into the backup target table

hbase org.apache.hadoop.hbase.mapreduce.Driver import myuser_bak /hbase_data/myuser_bak/

In this way, the first method is over. In the second method, snapshots are used to back up tables. Online information says that HBase data migration and copying through snapshots are the most recommended data migration method. I don't understand this method very well, so I will talk about it later, this time I will mention it first, so that everyone has an impression.

Point 16: HBase secondary index

Why do I need a secondary index?

① For HBase, if you want to accurately locate a row of records, the only way is to query through rowkey, but if you do not query data through rowkey, you must perform a full table scan. For larger tables, the cost of a full table scan is too great, and you need to find another method.

In many cases, we need to query data from multiple aspects. For example, in locating the information of a certain student, you can query by name, ID number, student number, etc., but it is almost impossible to put so many aspects of data in the rowkey (the flexibility of the business does not allow it, right The length of the rowkey is also not allowed).

So you need a secondary index (secondary index) to accomplish this.

② It is also to make HBase query more efficient. For example, using non-rowkey field retrieval to achieve a second-level response, you may need to build a secondary index on HBase.

How to build a secondary index?

It is recommended to use phoenix (Phoenix) to build a secondary index. Old Liu briefly talks about the concept of phoenix.

It creates two indexes:

① Global index, suitable for business scenarios with more reading and less writing.

The use of global indexes is very expensive when writing data, because all update operations to the data table will cause the index table to be updated, and the index table is distributed on different data nodes, and cross-node data transmission brings more Big performance consumption.

When reading data, Phoenix will choose the index table to reduce the time consumed by the query. By default, if the field you want to query is not an index field, the index table will not be used, which means that there will be no increase in query speed.

② Local index, suitable for scenarios with frequent write operations and limited space.

Like a global index, Phoenix will automatically determine whether to use an index when making a query.

When using a local index, the index data and the data of the data table are stored in the same server, which avoids the additional overhead of writing indexes to the index tables of different servers during write operations.

When using a local index, even if the query field is not an index field, the index table will be used, which will increase the query speed, which is different from the global index.

Point 17: HBase namespace

The concept of namespace

In HBase, namespace is equivalent to the database in a relational database. It refers to the logical grouping of a group of tables, meaning that the tables in the same group have similar uses. That is equivalent to saying that we can easily divide the table in business by using namespace.

Basic operation of namespace

创建namespace
hbase>create_namespace 'nametest'  

查看namespace
hbase>describe_namespace 'nametest'  

列出所有namespace
hbase>list_namespace

在namespace下创建表
hbase>create 'nametest:testtable', 'fm1' 

查看namespace下的表
hbase>list_namespace_tables 'nametest'  

删除namespace
hbase>drop_namespace 'nametest'

Point 18: HBase data version confirmation and TTL

In HBase, we can set the upper and lower bounds for the data. This is actually to define how many historical versions of the data are kept. We can realize the data query of multiple historical versions of data by customizing the number of historical versions saved.

Lower bound

The lower bound of the default version is 0, which is disabled. The minimum number of row versions used is combined with the time to live (TTL Time To Live), and we can have 0 or more versions according to actual needs, using 0, that is, only 1 version of the value is written into the cell.

Upper bound

The default upper bound of the previous version is 3, which means that a row keeps 3 copies, and the current version has defaulted to 1.

TTL of data

In actual work, we often encounter some data that we may not need after a period of time. At this time, we can use timed tasks to delete these data regularly
or we can also use Hbase's TTL (Time To Live) function. Let our data be cleared regularly.

The following old Liu will write code to set the data definite and set the TTL of the data:

public class HBaseVersionsAndTTL {
    
    
    public static void main(String[] args) throws IOException {
    
    
    //获得连接
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
    
    //创建连接对象
        Connection connection = ConnectionFactory.createConnection(configuration);
    
    //建表的时候用Admin
        Admin admin = connection.getAdmin();

        if(!admin.tableExists(TableName.valueOf("version_hbase"))) {
    
    
      //添加表名信息
            HTableDescriptor version_hbase = new HTableDescriptor(TableName.valueOf("version_hbase"));
      
            HColumnDescriptor f1 = new HColumnDescriptor("f1");
            //最大版本、最小版本、TTL
            f1.setMinVersions(3);
            f1.setMaxVersions(5);
            f1.setTimeToLive(30);//30s

            version_hbase.addFamily(f1);

            admin.createTable(version_hbase);
        }

        //插入数据
        Table version_hbase = connection.getTable(TableName.valueOf("version_hbase"));
    
        for(int i = 0; i < 6; i++) {
    
    
           Put put = new Put("001".getBytes());
            put.addColumn("f1".getBytes(), "name".getBytes(), ("zhangsan" + i).getBytes());
           version_hbase.put(put);
        }

        //查询
        Get get = new Get("001".getBytes());
    //设置最大版本
        get.setMaxVersions();

        Result result = version_hbase.get(get);

        Cell[] cells = result.rawCells();

        for(Cell cell: cells) {
    
    
            System.out.println(Bytes.toString(CellUtil.cloneValue(cell)));
        }

        //关闭连接
        admin.close();
        version_hbase.close();
        connection.close();
    }
}

02 HBase summary

Well, the knowledge points of big data HBase are almost summarized, and the content is relatively large. You need to understand carefully, and strive to tell these knowledge points in your own words.

Finally, if you feel that there is something wrong or wrong, you can contact the official account: Lao Liu who works hard, and communicate! I hope to be helpful to students who are interested in big data development, and hope to get their guidance.

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/110681882