What are the design principles of HBase RowKey?

Analysis & Answers


Rowkey

Used to identify the only row of data in the table, stored in the form of a byte array, similar to the primary key of the table in a relational database. Rowkey is sorted strictly in lexicographic order in HBase.

RowKey design principles

sole principle

Its uniqueness must be guaranteed by design. Since data storage in HBase is in the Key-Value format, if the same Rowkey is inserted into the same table in HBase, the original data will be overwritten (if the version of the table is set to 1), so the uniqueness of the Rowkey must be ensured.

For example: If the hbase table stores data in human units, the rowkey should contain ID card information to ensure its uniqueness.

Sorting principle

HBase's Rowkey is designed in an orderly manner according to ASCII. We must make full use of this when designing Rowkey.

For example: When designing a customer consumption record table, you can put the customer's ID in the rowkey header. In this way, when querying records by customer ID, the data will be stored in the same region, and the query efficiency will be improved.

Hashing principle

The Rowkeys we design should be evenly distributed on each HBase node.

Take the common timestamp as an example. If the Rowkey is incremented according to the system timestamp, if the first part of the Rowkey is the timestamp information, it will cause a hot spot phenomenon where all new data is accumulated on a RegionServer.

This is commonly referred to as the Region hotspot problem. Hotspots occur when a large number of client direct accesses are concentrated on individual RegionServers (the accesses may be read, write or other operations), resulting in a single RegionServer machine itself being overloaded, causing performance degradation or even Region unavailability. .

Common ones are jvm full gc or region too busy exceptions. Of course, this will also affect other regions on the same RegionServer.

For example: when the hbase table contains human behavior data, when designing the rowkey, the timestamp cannot be placed in the front part of the rowkey. It is a good choice to put the ID number at the beginning.

length principle

Rowkey is a binary number, the shorter the better.

Reason : First, the persistent file HFile of HBase is stored according to KeyValue. If the Rowkey is too long, such as 500 bytes, 10 million columns of data alone will occupy 500*10 million = 5 billion bytes, nearly 1G of data. , which will greatly affect the storage efficiency of HFile. Second, MemStore caches some data into memory. If the Rowkey field is too long, the effective utilization of memory will be reduced, and the system cannot cache more data, which will reduce retrieval efficiency.

For example: If the query field is not involved, do not put it in rowkey. If you only need to query by identity number, then there is no need to put name, phone number, address and other information in rowkey.

RowKey field selection

The most basic principle to follow when selecting the RowKey field is uniqueness . RowKey must be able to uniquely identify a row of data. RowKey fields should refer to the most frequent query scenarios.

Databases are usually designed to efficiently read and consume data , rather than to store the data itself.

Combined with the specific load characteristics, the selected RowKey field value is modified. In the combined field scenario, the order of the fields needs to be considered.

Ways to avoid data hotspots

The design of the RowKey field avoids the situation where data is concentrated in the same region, and as much as possible, equal amounts of data are evenly inserted into all regions when writing data.

If the rowkey field selection cannot avoid the data hotspot problem, you can observe that the tail of the rowkey shows a good random distribution number, and you can invert the rowkey to avoid the hotspot problem. Or adding a fixed number of random numbers to the rowkey header can also make the data evenly distributed. However, these methods also cause the data distribution to lose order.

Reflect & Expand


Summarize:

  • HBase data operations are closely related to rowkey. Appropriate rowkey design can maximize the performance of the HBase database.
  • Four major principles should be followed when designing rowkey. When selecting fields, the fields with the highest query frequency should be selected. You should also pay attention to hot issues to avoid data being concentrated in certain regions.

Meow Interview Assistant: One-stop solution to interview questions. You can search the WeChat applet [Meow Interview Assistant] or follow [Meow Interview Assistant] -> Interview Assistant to answer questions for free. If you have any good interview knowledge or skills, I look forward to sharing them with you!

Guess you like

Origin blog.csdn.net/jjclove/article/details/124923095#comments_28149380