1. The principle of uniqueness

Rowkey is designed to ensure its uniqueness. Rowkey is sorted and stored in lexicographical order. Therefore, when designing rowkey, make full use of the characteristics of this sorting and store frequently read data in one block.

2. Length principle

Rowkey is a binary code stream, which can be any string, with a maximum length of 64kb. In practical applications, it is generally 10-100 bytes. It is stored in the form of byte[] and is generally designed as a fixed length. It is recommended that the shorter the better, not more than 16 bytes . If the rowkey field is too long, the effective utilization of the memory will be reduced, and the system cannot cache more data, which will reduce the retrieval efficiency.

The current operating systems are all 64-bit systems, and the memory is aligned to 8 bytes, controlled at 16 bytes, and integer multiples of 8 bytes take advantage of the best features of the operating system .

3. Hashing principle

It is recommended to use the high bit of the rowkey as a hash field, randomly generated by the program, and put the time field in the low bit, which will improve the balanced distribution of data in each RegionServer to achieve load balancing. If there is no hash field, the first field is directly the time information, and all data will be concentrated on one RegionServer. In this way, the load will be concentrated on individual RegionServers during data retrieval, causing hot issues and reducing query efficiency.

Improve methods

3.1, add salt

Adding a random number in front of the rowkey is to assign a random prefix to the rowkey to make it different from the beginning of the previous rowkey. The number of prefix types allocated should be the same as the number of data you want to use to distribute to different regions. The salted rowkey will be scattered to each region according to the randomly generated prefix.

3.2, hash

Hashing makes the same line always salted with a prefix. Hashing can also spread the load across the entire cluster, but the reads are predictable.

3.3, reverse

Reverse rowkey in fixed length or numeric format. This allows the frequently changing part (the least meaningful part) of the rowkey to be placed first.

StringBuilder has a reverse() method, String does not.

3.4. Timestamp reversal

Using the inverted timestamp as part of the rowkey is very useful for this problem.

[key][reverse_timestamp], the latest value of [key] can be used to scan [key] to get the first record of [key], because the rowkey in HBase is ordered, and the first record is the last entered data.

For example, you need to save a user's operation records and sort them in reverse order of operation time. When designing rowkey, you can design like this [userId inversion][Long.Max_Value-timestamp], and specify directly when querying all user operation records. UserId after inversion, startRow is [userId inversion][000000000000], stopRow is [userId inversion][ Long.Max_Value-timestamp ].

Three Principles of Rowkey Design