Solve how to pre-partition regions when HBase creates a table under the expected data.

Recently, I found that I imported more than 10 terabytes of data into the HBase (automatic allocation of regions mode) table. The table only occupies 55 resions. It is unbelievable. According to the size of the storage file of each resions, the size of the file is 10G (the value set by hbase.hregion.max.filesize is 10G). ), the hbase table compression method is: "SNAPPY" format. This kind of compression ratio is about 60%. According to the above calculation, the data table is allocated at least 600 resions. Actually, by checking the file size of each resolution in the hbase table, it is found that some regions files There is a data volume of 600G, and this kind of data storage is seriously abnormal.


After several days of investigation (the maximum number of resions that can be allocated for each resolutionServer), various tests and verifications in the cluster of the test environment are used to allocate the specified number of separation points. The data is successfully imported in the way of regions, and the data distribution is also very balanced. (How to determine the separation point: divide 600 copies according to the full rowkey, and then extract the last rowkey value of each copy as the division point)

The investigation idea mainly refers to the previous success Import the meta information of the data table and
view the meta information of a table: echo "scan 'hbase:meta'" | hbase shell | grep 'hbase_table_name' , and then observe the startKey and endKey of each region


according to the hbase table building manual, create There are two ways to pre-divide regions in the hbase table:

1. If the entire imported data set is known, and the distribution of the Rowkeys of all Hbase tables is also known, pre-partition is performed by the startkey and endkey of the Region. This way It can fully satisfy the balanced distribution of data stored in each region. There are two ways to build tables in this way. If there are few split points, you can directly specify them in the table building statement.
1.1, for example:
create 'card_active_quota', {NAME =>'n',VERSIONS => 1, COMPRESSION => 'SNAPPY'}, {SPLITS => ['10', '20', '30', '40']}
or above The split point is: the first resions is startkey=>'' endKey=>'10', the second region's startKey => '10' endKey => '20' and so on, the fifth region's startkey=>'40 ' endKey=>''. The rule is that the first region has no startkey, and the last region has no stopkey.

If many split points, the number of regions can be pre-partitioned by the number of file lines in splits_file, for example, as we passed this time The calculation may require 599 split points, then by writing the split points (each line in the file represents a split point) into the file, you can directly refer to the split file when creating the table,
1.2, for example:
create 'card_active_quota', {NAME = >'n',VERSIONS => 1, COMPRESSION => 'SNAPPY'}, {SPLITS_FILE => '/home/part/splits.txt'}

Note: splits.txt content format:
each line is considered a separation point:
400258AD77AD659C7D9B8BB2D718488A016D9074DD39F2AC391AB573C2908017
7FF9001D147A700B73BDD18378C62C47C8D22680718503A7F6E078186086029A
BFEDF91AFE392EDF60CE378C8D2E5CAFDB8D6F0B249CC9A8AF4962788B1D8108
......


Note: If you enter the hbase shell in the directory containing splits.txt, you can use
SPLITS_FILE => 'splits.txt' in the table creation statement, and you can also specify the absolute path of the local file.

2. If the imported dataset is regular, it can be divided by Hbase's partitioning algorithm. Currently, there are 2 types in the system: 1. HexStringSplit, 2, UniformSplit, 3 or pre-partition by customizing an implementation class. If The future data storage is hexadecimal, so use "HexStringSplit" as the pre-split algorithm.
Create table statement:
create 'card_active_quota', {NAME => 'n', VERSIONS => 1, COMPRESSION => ' SNAPPY'}, {NUMREGIONS => 3, SPLITALGO => 'HexStringSplit'}".

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326225700&siteId=291194637