Shock! The only direct attack on the essence of range sharding in the whole network [applicable to HBase, Tidb, etc.]

1. Foreword

I chatted with my friends last week and felt that the title of the previous article was too secondary. Although the goal is not to be a headline party, this time I decided to use the "shock style"! ! Try the effect~~

In the previous articles, we described the routing sharding mode model of hash sharding. After analyzing the models of " basic (weighted) round robin ", " virtual bucket " and " consistency hash ", it is not difficult to find that Ha The routing and sharding model of Greek sharding distributes the client's read and write requests as evenly as possible to different backend nodes, and uses different methods to alleviate the problem of shard migration during the node online and offline process (mainly refers to the backend two options).那么是否意味着hash分片并没有缺点呢?除了hash分片还有没有其他分片方式呢?我们带着这些问题,来开始本篇的重点。路由分片大法第二弹:范围分片

2. Pain points of hash sharding

Let's jump out of the inherent thinking of the recent hash sharding, and look at the routing sharding again. The hash sharding method processes the key through the hash function, and then allocates it in a partition, and the partition will eventually pass certain mapping rules. Fall on the machine. Therefore, when you need to query the value of a key, you need to go through the key-partition mapping table and the partition-machine mapping table to address twice, so as to realize routing addressing. As shown in the figure above, it is not difficult to see that when there are a large number of Keys to query, they will still be split into several single-key routes, so the addressing process will be repeated twice, even for two adjacent Keys , such as Key1, Key2, and repeat the same operation. This also leads to如果此时有成千上万个key要查询,会是什么样子呢?hash分片最大的痛点,受hash路由分片模型的限制,hash分片只适合点查询,而不适合范围查询。

3. Breaking the problem - constructing range fragmentation

What kind of routing sharding architecture is suitable for range queries? Next, we will gradually evolve the point query model into a range query model step by step based on the idea of ​​" Big Data Foundation - So This is Routing Fragmentation ".

  • 厘清点查询和范围查询的关系

Point queries are usually single-key queries; range queries are usually multi-key queries (keys are adjacent or far apart). As shown in the figure above, the relationship between point queries and range queries can be summed up in one sentence. Multiple ordered points The reason why the query is approximately equal to the range query is "approximately equal to". In this understanding,范围查询只是点查询的封装,每次范围查询,依然会进行多次寻址,因此并没有本质的变化 if the count of each Key query is 1 (in fact, each key query represents two addressings), point query and range query can be Interpreted as 1+1=2 relationship.


  • 为点查询和范围查询增加分片层

In hash sharding, after being processed by the hash function, generally ordered multiple keys will be stored in multiple shards, especially adjacent keys may not be in the same shard. At this time, you will find that the more keys involved in the query, the more query fragments will be. If you can reduce the number of query fragments at this time, the efficiency of key-parititon mapping will also increase (why is it high? Sell a Guanzi, I will mention it later, you can also think about it first), dry speaking is relatively abstract, combined with the model of the first article " Big Data Foundation-So This is Routing Fragmentation ", as shown below: We will Key1 correspond to The shard is defined as shard a, the shard corresponding to Key2 is set as shard h, and Key(n) corresponds to shard m. In the hash sharding model, there will be multiple keys stored in the same shard through the hash function. In the case of , here is an example where Key(n-1) corresponds to shard a. The 4 Keys will actually be routed to 3 shards when the hash shard is used for Key-Partition query. At this point we perform two operations:

  1. 将已有的三个分片进行重组

Split and reorganize the three fragments in the order of Key, fragment a (including Key1, Key(n-1)), fragment h (including Key2), and fragment m (including Keyn) are reorganized into fragment a ( Contains Key(n-1)), fragment m (contains Keyn), new fragment x (contains Key1 and Key2), of course, if all Keys are continuous, merging into one fragment is of course the most ideal situation.我们在这个示例的基础上向外延伸,分片拆分合并有什么好处?

From unordered to ordered , for the original multiple unordered shards, according to certain rules, similar Keys are merged. Previously continuous keys need to be searched in multiple shards, but now the scope of search can be reduced. Only need Search in a small number of shards to improve query efficiency

  1. 优化路由表

Reduce the Key-Partition from 4 pieces of metadata (Key1->Shard a, Key2 ->Shard h, Key(n-1)->Shard a, Keyn ->Shard m) to 3 pieces (Key1- > New Fragment x), Key(n-1) -> Fragment a, Keyn- Fragment m), 原来4条元数据缩减成3条,由密集索引转化为稀疏索引. What are dense and sparse indexes?

In the routing sharding model, KeyPartition mapping is generally implemented through the routing table (in fact, the consistency hash is also an alternative routing table, but it is not solidified into an index structure, which needs to be calculated each time). In the routing table, each Key will have its location routed to the shard, that is, Key1 -> shard a, Key2 -> shard h, Key(n-1) -> shard a, etc., the same 将每个Key的分片位置都记录路由表中进行索引的方式叫做密集索引,而路由表优化后,由于Key是有序的,因此只需要记录Key1->新分片x即可,此时在新分片x中就可以找到Key2。这种索引类型叫做稀疏索引.

The above 针对分片层进行拆分合并 is 对路由表的元数据进行索引类型转换more suitable for continuous key range query, so in terms of range data addressing efficiency, compared with range query, point query can be simply understood as reaching the state of 1+1<2.


  • 为点查询和范围查询增加机器层

  • IO效率提升The routing sharding model of point query and range query is consistent with the first basic routing sharding model, so each shard will be hosted by a corresponding machine. At this point, at first glance, there may be no obvious difference between the two. Let's look at it in conjunction with the previous section, because when adding a slice layer in the previous section, 范围查询中的分片中数据是有序的,因此在读写时是顺序IO,而不像在点查询时是随机IO ,而对于大数据服务的应用场景来说,顺序IO的提升效果会更加明显.

为什么顺序IO比随机IO效率高呢?Usually, the process of reading and writing a disk is divided into 3 steps:

  1. Seek time: the head moves and locates to the specified track

  2. Rotation Delay Time: Wait for the specified sector to rotate past under the head

  3. Data transfer time: the actual transfer of data between disk and memory

In sequential io, since there is no need to seek frequently, most of the time is data transmission time, but random io will spend a lot of time seeking and waiting for sectors to pass; in addition, in Linux, multiple consecutive read-ahead The page pages are cached in memory, and the speed of reading memory is much higher than that of reading disk, so in comparison, the utilization rate of pre-read pages for range queries is very high, while the utilization rate of point queries for pre-read pages is very low .

  • Physical location suitable for client cache shards

In the point query, since a large number of keys are involved, the positions of each key are relatively scattered. At this time, it is obviously unrealistic for the client to cache the positions of all fragments. The cache hit rate can actually be improved by caching the location of the fragments after the fragmentation merge.

4. Advantages of range sharding

The evolution process of range sharding is described in detail above, and the advantages are also obvious:

  • When reading data in batches, you can directly access the physical address locations of cache fragments to improve the read hit rate.

  • Physically, sequential IO is used to improve batch read and write efficiency compared to random IO.

  • Metadata (routing table) optimization, using sparse indexes, reducing metadata pressure.

  • Range sharding is more flexible and is no longer restricted by the hash function. The size (shard split) and location of each shard in range sharding can be flexibly adjusted.

5. Theoretical Best Practices

So which system in the existing system uses the virtual bucket method for routing fragmentation? There are actually many, as shown in the image below.

Maybe you have never seen some services, but it does not prevent us from knowing that their core sharding logic also confirms a sentence from the side. From a purely technical point of view, the more technologies you can’t learn, the more technologies you can’t learn; but on the other hand, What we are doing now is to jump out of a single technology and look at each service from a global perspective.

Here we take HBase as an example for analysis.


  • HBase分片模型

  1. In HBase, shards are sorted based on rowkey and split according to different ranges, that is, [startKey, endKey) such a left-closed right-open interval, and each shard is called a Region. There are multiple tables in an HBase cluster, each table contains one or more Regions, and each Region has one and only one machine for mapping. In other words, each machine will host 0 or more Regions. The machines here are It is called RegionServer in HBase.

  2. Since the rowkeys in the Region are already sorted, the startKey of the latter Region is actually the endKey of the previous Region. And there is no startKey in the first Region, and similarly, there is no endKey in the last Region. So when all regions are combined, any rowkey value in this table can be overwritten.


  • 元数据路由策略

  1. Data query in HBase involves two levels of routing: one is the route from rowkey to region, and the other is the route from region to RS. Both levels of routing information are stored in the .meta table. The meta table is actually a sparse index, which only records the values ​​of startKey and endKey. The location of the key corresponding to the Region can be located through the sparse index.


  • LSM存储结构与优化

  1. HBase uses the LSM (Log-Structured Merge Tree) storage structure to convert the random IO of the disk into sequential IO to improve the performance of batch read and write, at the cost of sacrificing the performance of point queries.

  2. When a piece of data is written in HBase, it will be written into WAL first, and then written into MemStore. When Mem-Store meets certain conditions, it will start to flush the data to the disk. As the number of writes continues to increase, the disk file HFile will also There will be more and more, because the data location is uncertain, so all HFiles have to be traversed, so the read performance of LSM tree is not as good as that of B+ tree in point query (this is also the main reason why HBase is not as good as Mysql in point query), but HBase has also made some optimizations, and will periodically merge several HFiles, that is, multiple files are merged into one file, so as to improve the reading performance.


  • 更加灵活的调度

  1. In HBase, each Region is internally ordered. When the Region is too large or a Hot Key appears, the Region will be divided according to the corresponding rules. At this time, there is no need to be restricted by the hash function, and the Region can be split and migrated freely. .

  2. The HBase storage layer and computing layer are actually separated, which is also the current mainstream architecture. Therefore, when regions are migrated, physical data does not need to be migrated, so the migration cost is very low.

6. Epilogue

This article starts from hash sharding, introduces the pain points of hash sharding, gradually transitions to range sharding based on the general routing sharding model, then explains the advantages of range sharding, and finally combines the best of HBase Practice to re-confirm the theory of range sharding. Finally, I took out a single sentence in the article. From a purely technical point of view, the more you learn, the more technologies you don’t know; but on the other hand, what we are doing now is to jump out of a single technology and look at it from a global perspective. various services. Believe that persistence will always pay off! !


Originality is not easy , if you feel that you have gained something, please click on this article or forward it ruthlessly . Your support is the motivation for my writing.

Guess you like

Origin blog.csdn.net/weixin_47158466/article/details/108026562