Database sub-database sub-table (sharding) series (5) A sharding expansion scheme that supports free planning without data migration and routing code modification

As a horizontal scaling solution at the data storage level, database sharding technology has a long history, and many massive data systems have gone through the sharding transformation stage of sub-database and sub-table during their development and evolution. Simply put, sharding is to divide the original single database according to certain rules, and distribute the data to multiple physical machines (we call it Shard) for storage, so as to break through the limitation of a single machine and enable the system to scale-out. To cope with the rising mass of data, this segmentation is transparent to upper-layer applications, and multiple physically distributed databases are still logically a library. Implementing sharding needs to solve a series of key technical problems, these problems mainly include: segmentation strategy, node routing, global primary key generation, cross-node sorting/grouping/table association, multi-data source transaction processing and database expansion. For these issues, please refer to the author's blog column http://blog.csdn.net/column/details/sharding.html  This article will focus on "database expansion" for in-depth discussions, and propose a method that allows free planning and avoids data migration And the Sharding expansion plan for modifying the routing code.

 

Sharding Expansion - System Maintenance Unbearable

 

For any sharding system, after running for a period of time, the data will accumulate to the upper limit that the current node scale can carry. At this time, the database needs to be expanded, that is, adding new physical nodes to distribute the data. If the system uses ID-based hashing, the team needs to recalculate the target shard where all data should be based on the new node scale and migrate it, which is undoubtedly a huge maintenance burden for the team ; and if the system is routed according to incremental intervals (for example, every 10 million pieces of data or every month of data is stored on a node), although data migration can be avoided, it may cause "hot spots" problems. That is to say, the recent system reads and writes are concentrated on the newly created nodes (many systems have such characteristics: the read and write frequency of new data is significantly higher than that of old data), which affects system performance. Faced with this dilemma, Sharding expansion is extremely difficult.

 

In general, the "ideal" expansion plan should strive to meet the following requirements:

  1.  Better not to migrate data (data migration is a stressful issue for teams anyway)
  2. Allows free planning of expansion scale and node storage load based on hardware resources
  3. It can evenly distribute data read and write to avoid "hot spots"
  4. Ensure that no data is written to nodes that have reached the storage limit

 

At present, there are not many excellent solutions that can avoid data migration. There are two relatively feasible ones. One is to maintain a mapping table that records the correspondence between data IDs and target shards. When writing, the data is written to the newly expanded shard. , and write the ID and target node into the mapping table at the same time. When reading, check the mapping table first, and then execute the query after finding the target shard. This solution is simple and effective, but both reading and writing data need to access the database twice, and the mapping table itself can easily become a performance bottleneck. For this reason, the system has to introduce a distributed cache to cache the mapping table data, but this cannot avoid accessing the database twice when writing, and at the same time, the consumption of cache resources by a large amount of mapping data and the cost of introducing a distributed cache for this purpose are all It's a matter of balance. Another solution comes from the Taobao integrated business platform team. It uses the forward-compatible feature of taking a remainder of a multiple of 2 (for example, a number that takes a remainder of 4 and takes a remainder of 1 is also a remainder of 2) to allocate data, avoiding the row level. However, table-level migration is still required, and the expansion scale and the number of sub-tables are limited. In general, these solutions are not very ideal, and there are more or less shortcomings, which also reflects the difficulty of Sharding expansion from one aspect.

 

取长补短,兼容并包——一种理想的Sharding扩容方案

 

如前文所述,Sharding扩容与系统采用的路由规则密切相关:基于散列的路由能均匀地分布数据,但却需要数据迁移,同时也无法避免对达到上限的节点不再写入新数据;基于增量区间的路由天然不存在数据迁移和向某一节点无上限写入数据的问题,但却存在“热点”困扰。我们设计方案的初衷就是希望能结合两种路由规则的优势,摒弃各自的劣势,创造出一种接近“理想”状态的扩容方式,而这种方式简单概括起来就是:全局按增量区间分布数据,使用增量扩容,无数据迁移,局部使用散列方式分散数据读写,解决“热点”问题,同时对Sharding拓扑结构进行建模,使用一致的路由算法,扩容时只需追加节点数据,不再修改散列逻辑代码。

 

原理

 

首先,作为方案的基石,为了能使系统感知到Shard并基于Shard的分布进行路由计算,我们需要建立一个可以描述Sharding拓扑结构的编程模型。按照一般的切分原则,一个单一的数据库会首先进行垂直切分,垂直切分只是将关系密切的表划分在一起,我们把这样分出的一组表称为一个Partition。 接下来,如果Partition里的表数据量很大且增速迅猛,就再进行水平切分,水平切分会将一张表的数据按增量区间或散列方式分散到多个Shard上存储。在我们的方案里,我们使用增量区间与散列相结合的方式,全局上,数据按增量区间分布,但是每个增量区间并不是按照某个Shard的存储规模划分的,而是根据一组Shard的存储总量来确定的,我们把这样的一组Shard称为一个ShardGroup,局部上,也就是一个ShardGroup内,记录会再按散列方式均匀分布到组内各Shard上。这样,一条数据的路由会先根据其ID所处的区间确定ShardGroup,然后再通过散列命中ShardGroup内的某个目标Shard。在每次扩容时,我们会引入一组新的Shard,组成一个新的ShardGroup,为其分配增量区间并标记为“可写入”,同时将原有ShardGroup标记为“不可写入”,于是新生数据就会写入新的ShardGroup,旧有数据不需要迁移。同时,在ShardGroup内部各Shard之间使用散列方式分布数据读写,进而又避免了“热点”问题。最后,在Shard内部,当单表数据达到一定上限时,表的读写性能就开始大幅下滑,但是整个数据库并没有达到存储和负载的上限,为了充分发挥服务器的性能,我们通常会新建多张结构一样的表,并在新表上继续写入数据,我们把这样的表称为“分段表”(Fragment Table)。不过,引入分段表后所有的SQL在执行前都需要根据ID将其中的表名替换成真正的分段表名,这无疑增加了实现Sharding的难度,如果系统再使用了某种ORM框架,那么替换起来可能会更加困难。目前很多数据库提供一种与分段表类似的“分区”机制,但没有分段表的副作用,团队可以根据系统的实现情况在分段表和分区机制中灵活选择。总之,基于上述切分原理,我们将得到如下Sharding拓扑结构的领域模型:

 

图1. Sharding拓扑结构领域模型

 

 

在这个模型中,有几个细节需要注意:ShardGroup的writable属性用于标识该ShardGroup是否可以写入数据,一个Partition在任何时候只能有一个ShardGroup是可写的,这个ShardGroup往往是最近一次扩容引入的;startId和endId属性用于标识该ShardGroup的ID增量区间;Shard的hashValue属性用于标识该Shard节点接受哪些散列值的数据;FragmentTable的startId和endId是用于标识该分段表储存数据的ID区间。

 

确立上述模型后,我们需要通过配置文件或是在数据库中建立与之对应的表来存储节点元数据,这样,整个存储系统的拓扑结构就可以被持久化起来,系统启动时就能从配置文件或数据库中加载出当前的Sharding拓扑结构进行路由计算了(如果结点规模并不大可以使用配置文件,如果节点规模非常大,需要建立相关表结构存储这些结点元数据。从最新的Oracle发布的《面向大规模可伸缩网站基础设施的MySQL参考架构》白皮书一文的“超大型系统架构参考”章节给出的架构图中我们可以看到一种名为:Shard Catalog的专用服务器,这个其实是保存结点配置信息的数据库),扩容时只需要向对应的文件或表中加入相关的节点信息重启系统即可,不需要修改任何路由逻辑代码。

 

示例

 

让我们通过示例来了解这套方案是如何工作的。

 

阶段一:初始上线

 

假设某系统初始上线,规划为某表提供4000W条记录的存储能力,若单表存储上限为1000W条,单库存储上限为2000W条,共需2个Shard,每个Shard包含两个分段表,ShardGroup增量区间为0-4000W,按2取余分散到2个Shard上,具体规划方案如下:

图2. 初始4000W存储规模的规划方案

 

与之相适应,Sharding拓扑结构的元数据如下:

图3. 对应Sharding元数据

 

 

阶段二:系统扩容

 

经过一段时间的运行,当原表总数据逼近4000W条上限时,系统就需要扩容了。为了演示方案的灵活性,我们假设现在有三台服务器Shard2、Shard3、Shard4,其性能和存储能力表现依次为Shard2<Shard3<Shard4,我们安排Shard2储存1000W条记录,Shard3储存2000W条记录,Shard4储存3000W条记录,这样,该表的总存储能力将由扩容前的4000W条提升到10000W条,以下是详细的规划方案:

图4. 二次扩容6000W存储规模的规划方案

 

相应拓扑结构表数据下:

 

图5. 对应Sharding元数据

 

 

从这个扩容案例中我们可以看出该方案允许根据硬件情况进行灵活规划,对扩容规模和节点数量没有硬性规定,是一种非常自由的扩容方案。

 

增强

 

接下来让我们讨论一个高级话题:对“再生”存储空间的利用。对于大多数系统来说,历史数据较为稳定,被更新或是删除的概率并不高,反映到数据库上就是历史Shard的数据量基本保持恒定,但也不排除某些系统其数据有同等的删除概率,甚至是越老的数据被删除的可能性越大,这样反映到数据库上就是历史Shard随着时间的推移,数据量会持续下降,在经历了一段时间后,节点就会腾出很大一部分存储空间,我们把这样的存储空间叫“再生”存储空间,如何有效利用再生存储空间是这些系统在设计扩容方案时需要特别考虑的。回到我们的方案,实际上我们只需要在现有基础上进行一个简单的升级就可以实现对再生存储空间的利用,升级的关键就是将过去ShardGroup和FragmentTable的单一的ID区间提升为多重ID区间。为此我们把ShardGroup和FragmentTable的ID区间属性抽离出来,分别用ShardGroupInterval和FragmentTableIdInterval表示,并和它们保持一对多关系。

 

图6. 增强后的Sharding拓扑结构领域模型

 

 

Let's go through an example to understand how the upgraded scheme works.

 

Phase 3: No expansion, reuse of regenerated storage space

 

Suppose that after the system runs for a period of time, the 6000W storage space for the second expansion will be exhausted. However, due to the characteristics of the system, a lot of early data is deleted, and Shard0 and Shard1 each free up half of the storage space, so ShardGroup0 has a total of 2000W of storage space that can be reused. To this end, we re-mark ShardGroup0 as writable=true, and add an ID range to it: 10000W-12000W, and then get the following planning scheme:

Figure 7. Planning scheme for reusing 2000W regenerated storage space

 

 

The metadata for the corresponding topology is as follows:

Figure 8. Corresponding Sharding metadata

 

summary

 

This solution comprehensively utilizes the advantages of incremental interval and hash routing to avoid data migration and "hot spots" problems. At the same time, it models the Sharding topology and uses a consistent routing algorithm, thus avoiding capacity expansion. When modifying the routing code, it is an ideal Sharding expansion scheme.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326224971&siteId=291194637