HBase pre-partitioning & Phoenix salting

HBase hot issues

When the HBase table was first created, there was only one Region managed by one Region Server by default. When the amount of data reached a certain value, a split was triggered, which would continuously split more Regions, managed by different Region Servers, each Region Manage a continuous row key, represented by start row key and end row key, so there will be two problems

  1. Cannot make full use of the advantages of distributed concurrent processing, you must wait for Region to split into multiple automatically, this process may take a long time
  2. Since each Region manages a continuous row key, if the reading and writing of data are not random enough, for example, there is a self-incrementing ID, such as a large number of operations concentrated on a certain row key, this may cause pressure on the same Region

Region split strategy

Defined in hbase-site.xml file

<name>hbase.regionserver.region.split.policy</name>
<value>org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy</value>
<description>
  A split policy determines when a region should be split. The various other split policies that
  are available currently are ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy,
  DelimitedKeyPrefixRegionSplitPolicy, KeyPrefixRegionSplitPolicy etc.
</description>

The default strategy is IncreasingToUpperBoundRegionSplitPolicy. In HBase 1.2, this strategy defaults to say that when the size of Region reaches the cube of the number of Regions, multiply by hbase.hregion.memstore.flush.size (128 MB by default), then multiply it by 2, or reach hbase .hregion.max.filesize (default 10 GB), split the Region

第一次分裂的大小:1^3 * 128MB * 2 = 256MB
第二次分裂的大小:2^3 * 128MB * 2 = 2048MB
第三次分裂的大小:3^3 * 128MB * 2 = 6912MB
第四次分裂的大小:4^3 * 128MB * 2 = 16384MB,超过了 10GB,因此只取 10GB
后面的分裂大小都是 10GB

It can be seen that if there are more nodes available, it may take a long time to fully utilize

Pre-partition

The first method of pre-partitioning

hbase org.apache.hadoop.hbase.util.RegionSplitter tablename HexStringSplit -c 10 -f f1:f2:f3

The above command means to create a table named tablename. This table is pre-allocated with 10 regions and has three CFs, namely f1, f2, and f3. The pre-partitioning algorithm is HexStringSplit, or UniformSplit can also be selected, where HexStringSplit is suitable for row The prefix of the key is a hexadecimal string. UniformSplit is suitable for the row key prefix to be completely random. After pre-partitioning, even if the row key is continuous, HBase will divide it into different regions through the algorithm to achieve uniform distribution and avoid Hotspot


The second pre-partitioning method

hbase shell > create 'tablename', 'f1', SPLITS=> ['10', '20', '30', '40']

When you can know the distribution of row keys in advance, you can specify the split point of each pre-partitioned region. In the table created by the above command, there are 5 regions

Region 1 : row key 的前两位是 min~10
Region 2 : row key 的前两位是 10~20
Region 3 : row key 的前两位是 20~30
Region 4 : row key 的前两位是 30~40
Region 5 : row key 的前两位是 40~max

注意这里不单指数字字符,比如 1a 就会落在 Region 2


You can do a forced split on an existing table

hbase shell > split 'table', 'split point'


You can also design your own split method

Phoenix salt

CREATE TABLE IF NOT EXISTS Product (
    id           VARCHAR not null,
    time         VARCHAR not null,
    price        FLOAT,
    sale         INTEGER,
    inventory    INTEGER,

    CONSTRAINT pk PRIMARY KEY (id, time)
) COMPRESSION = 'GZ', SALT_BUCKETS = 6

In essence, after hashing the row key of the HBase table, take the remainder of SALT_BUCKETS, and insert the result (0 ~ 5 in the above example) as the first bit of the row key, and divide the data according to this value. In different regions, because it is stored as a byte, the maximum value that SALT_BUCKETS can take is 256. Rows with the same salt byte will be divided into the same region server, so usually the number of region servers is taken as SALT_BUCKETS


due to the addition of salt data There is one more bit in front, so by default, the data taken from different region servers cannot be sorted according to the original row key. If you need to ensure sorting, you need to change a configuration

phoenix.query.force.rowkeyorder = true


Guess you like

Origin www.cnblogs.com/moonlight-lin/p/12695350.html