Hbase design principles of pre-partition and rowKey of the Hbase

Hbase design principles of pre-partition and rowKey of the Hbase

Here Insert Picture Description

1, HBase pre-partition

1.1 Why pre-partition?

  • Increase the efficiency of data reading and writing
  • Load balancing, data skew preventing
  • Facilitate disaster recovery cluster scheduling region
  • Optimize the number Map

1.2 How to pre-partition?

Each region maintains startRow and endRowKey, if added to the data in line with a region rowKey range of maintenance, the maintenance data to this region.

1.3, how to set the pre-partition?

1.3.1, manually specify the pre-partition
hbase(main):001:0> create 'staff','info','partition1',SPLITS => ['1000','2000','3000','4000']

After the completion of Figure:
Here Insert Picture Description

1.3.2, using the pre-partitioning algorithm to generate hexadecimal
hbase(main):003:0> create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

After the completion of Figure:
Here Insert Picture Description

1.3.3, zoning rules created in the file

Creating splits.txt contents of the file are as follows:

cd /export/servers/
vim splits.txt

Edits:

aaaa
bbbb
cccc
dddd

Then execute:

hbase(main):004:0> create 'staff3','partition2',SPLITS_FILE => '/export/servers/splits.txt'

After the success Figure:
Here Insert Picture Description

1.3.4, the pre-partition is created using JavaAPI

Java code is as follows:

/**
     * 通过javaAPI进行HBase的表的创建以及预分区操作
     */
    @Test
    public void hbaseSplit() throws IOException {
        //获取连接
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
        Connection connection = ConnectionFactory.createConnection(configuration);
        Admin admin = connection.getAdmin();
        //自定义算法,产生一系列Hash散列值存储在二维数组中
        byte[][] splitKeys = {{1,2,3,4,5},{'a','b','c','d','e'}};


        //通过HTableDescriptor来实现我们表的参数设置,包括表名,列族等等
        HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("stuff4"));
        //添加列族
        hTableDescriptor.addFamily(new HColumnDescriptor("f1"));
        //添加列族
        hTableDescriptor.addFamily(new HColumnDescriptor("f2"));
        admin.createTable(hTableDescriptor,splitKeys);
        admin.close();

    }

2, HBase design skills of rowKey

HBase ordered three-dimensional storage,, column key (column family and qualifier on) and TimeStamp (timestamp) this three dimensions can quickly locate the data in HBase by RowKey (row key).
HBase in rowkey uniquely identifies a row, when HBase query, the following ways:
1. By way get, get a unique record specified rowkey
2. scan mode, and set startRow range matching parameters stopRow
3. Full table scan, i.e., all rows directly scan the entire table

2.1 rowkey length principle

owkey is a binary stream, it can be any string, the maximum length of 64KB, practical applications generally 10-100bytes, saved in byte [] form, typically designed to a fixed length. It recommended that the shorter the better, not more than 16 bytes , for the following reasons:

  • HFile persistent data file is stored according KeyValue, if rowkey too long, for example more than 100 bytes, the data line 1000W, light rowkey will occupy 100 * 1000w = 10 gigabytes, approximately 1G data, this would be extremely the impact of large storage efficiency HFile;
  • MemStore partial data cache memory, if rowkey field is too long, the effective utilization of the memory is lowered, the system can not cache more data, this will reduce the retrieval efficiency.

2.2 rowkey hash principles

If the timestamp increment to rowkey manner, not the time in front of the binary code, is recommended as the upper rowkey the hash field, randomly generated by the program, the low discharge time field, which will improve the data equally distributed in each RegionServer, the chance to achieve load balancing. If no hash field, the first field is time information directly, all data will be concentrated on a RegionServer, when the load so that data retrieval will focus on individual RegionServer, causing hot issues, will reduce the query efficiency.

2.3 rowkey the only principle

Must ensure its uniqueness in design, rowkey is lexicographically sort of storage, therefore, designed rowkey time, to take full advantage of the characteristics of this sort, the data often stored in a read, the recently may be accessed data into one.

2.4 What is hot

HBase rows are ordered lexicographically rowkey of this design optimization of the scan operation, and the relevant row line are read together in adjacent positions may be accessed, to facilitate scan. However bad rowkey design is a hot source.
Hot spots occur in a large number of client direct access to a cluster or a very small number of nodes (access may be read, write or other operations). A single machine will make a large number of hot spot region where access beyond their capacity, performance degradation or even region are not available, this will also affect other region on the same RegionServer, because the host is unable to service requests from other region's.
Well-designed data access patterns so that the cluster is full, balanced use. In order to avoid the hot write, design rowkey make a different line in the same region, but in the more data, the data should be written to multiple cluster region, instead of one. Here are some common ways to avoid hot spots and their advantages and disadvantages:

2.4.1, salt

The salt mentioned here is not cryptography salt, but the increase in front rowkey random number, in particular is assigned to a random rowkey prefix and that before the beginning of its rowkey different. The number of prefixes and species distribution should be decentralized to the data you want to use a different region of the same number. After salting rowkey dispersion will prefix randomly generated according to the respective region, in order to avoid hot spots.

2.4.2 Hash

Hash will always salt the same line with a prefix. Hash can also spread the load across the cluster, but reading it is predictable. Using the determined hash allows clients to complete the reconstruction of rowkey, you can get set to get exactly one row of data.

2.4.3, reverse

Rowkey reverse fixed length or digital format in the third method of preventing hot spots. This makes the portion (least significant part) rowkey constantly changing in front. This can effectively random rowkey, but at the expense of ordering rowkey.
Examples of reverse rowkey to phone number rowkey, after a string of reverse phone number may be used as rowkey, so avoiding the mobile phone number as the comparison result in the beginning of the hot issues fixed

Example: timestamp reverse
a common data processing problem is that the latest version of quick access to data, the use of reverse time stamp as part of rowkey very useful for this problem, you can use Long.Max_Value - timestamp appended to the end of the key, for example, [key] [reverse_timestamp], [ key] latest value by the first record Scan [key] to obtain [key], because the rowkey HBase is ordered, the first record is the last data entered.
Some other recommendations:
to minimize the size of rows and columns key family in the HBase, value and its key transmission forever together. When a specific value between the transmission system, it RowKey, column names, timestamp is also transmitted together. If your column names rowkey and large, this time they will take up a lot of storage space.
Column families as short as possible, preferably a character.
Lengthy property name though good readability, but it will be better in HBase shorter attribute names are stored.

----------------------------------------------------------------------------------------------------

The above content to end here, oh. Of our readersTripleXiao Bian is to stick to the power Oh! Above are subject to error, we welcome the timely help correct oh !!! small series the best relationship is mutual achievement, we see the next issue.
I am a little Rebels, a Chi Chuan training college students. A programming industry amateurs ... ha ha ha

Learning without thought is labor lost, thought without learning is perilous.
Published 46 original articles · won praise 114 · views 20000 +

Guess you like

Origin blog.csdn.net/Mr_Yang888/article/details/105056650