Database splitting - the basic idea of sharding

Reprinted from: http://blog.csdn.net/bluishglc/article/details/6161475

This article focuses on the basic idea of ​​sharding and the theoretical segmentation strategy. For more detailed implementation strategies and reference examples, please refer to my other blog post: Database sharding series (1) Split implementation strategy and demo 

 

1. Basic idea

      The basic idea of ​​sharding is to divide a database into multiple parts and put them on different databases (servers), so as to alleviate the performance problems of a single database. To put it less strictly, for a database with massive data, if it is because there are many tables and a lot of data, vertical segmentation is suitable at this time, that is, the tables with close relationship (such as the same module) are segmented and placed on a server. If there are not many tables, but each table has a lot of data, it is suitable for horizontal segmentation, that is, the data of the table is divided into multiple databases (servers) according to certain rules (such as hashing by ID). Of course, in reality, these two situations are more mixed together. At this time, you need to make a choice according to the actual situation. You may also use a combination of vertical and horizontal segmentation, so as to divide the original database into a matrix that can be expanded infinitely. Database (server) array. The following is a detailed introduction to vertical segmentation and horizontal segmentation.

      The biggest feature of vertical segmentation is that the rules are simple and the implementation is more convenient, especially suitable for
very little mutual influence, and business logic. Very clear system. In such a system, it is easy to split the tables used by different
business modules into different databases. Splitting according to different tables has less impact on the application
, and the splitting rules will be simpler and clearer. (This is also called "share nothing").



      Horizontal segmentation is a bit more complicated than vertical segmentation. Because different data in the same table needs to be
split into different databases, for the application, the splitting rule itself is more complicated than splitting according to the table name, and
the later data maintenance will also be more complicated Some.



      Let us consider data segmentation from a general situation: on the one hand, it is usually impossible for all tables in a library to be concatenated by a single table. This sentence implies that horizontal segmentation is almost always aimed at a small It is possible to rub a small rub (actually a block that is cut vertically) closely related tables, and it is impossible to do it for all tables. On the other hand, for some systems with very high load, even just a single table cannot bear its load through a single database host, which means that vertical slicing alone cannot fully solve the problem. Therefore, most systems will use a combination of vertical segmentation and horizontal segmentation. First, the system is vertically segmented, and then the horizontal segmentation is selectively performed according to the situation of each small rubbing table. Thereby, the entire database is divided into a distributed matrix.

 

Second, the segmentation strategy

      As mentioned earlier, the slicing is performed in steps of vertical slicing and then horizontal slicing. The result of the vertical split is just the preparation for the horizontal split. The idea of ​​vertical segmentation is to analyze the aggregation relationship between tables and put the closely related tables together. In most cases it may be the same module, or the same "aggregate". The "aggregation" here is what is called aggregation in Domain Driven Design. In the vertically segmented table aggregation, find the "root element" (the "root element" here is the "aggregation root" in domain-driven design), and perform horizontal segmentation according to the "root element", that is, from the "root element" element" and put all the data directly and indirectly associated with it into a shard. In this way, the possibility of cross-shard correlation is very small. The application does not have to break the existing relationship between the tables. For example, for social networking sites, almost all data will eventually be associated with a user, and segmentation based on users is the best choice. Another example is the forum system. The user and forum modules should be divided into two shards when vertically split. For the forum module, Forum is obviously the aggregate root. It is natural that all posts and replies are placed in a shard with the Forum.

      For shared data, if it is a read-only dictionary table, it should be a good choice to maintain a copy in each shard, so that the association does not need to be interrupted. If it is a cross-node association between general data, it must be interrupted.

 

      需要特别说明的是:当同时进行垂直和水平切分时,切分策略会发生一些微妙的变化。比如:在只考虑垂直切分的时候,被划分到一起的表之间可以保持任意的关联关系,因此你可以按“功能模块”划分表格,但是一旦引入水平切分之后,表间关联关系就会受到很大的制约,通常只能允许一个主表(以该表ID进行散列的表)和其多个次表之间保留关联关系,也就是说:当同时进行垂直和水平切分时,在垂直方向上的切分将不再以“功能模块”进行划分,而是需要更加细粒度的垂直切分,而这个粒度与领域驱动设计中的“聚合”概念不谋而合,甚至可以说是完全一致,每个shard的主表正是一个聚合中的聚合根!这样切分下来你会发现数据库分被切分地过于分散了(shard的数量会比较多,但是shard里的表却不多),为了避免管理过多的数据源,充分利用每一个数据库服务器的资源,可以考虑将业务上相近,并且具有相近数据增长速率(主表数据量在同一数量级上)的两个或多个shard放到同一个数据源里,每个shard依然是独立的,它们有各自的主表,并使用各自主表ID进行散列,不同的只是它们的散列取模(即节点数量)必需是一致的。(

本文着重介绍sharding的基本思想和理论上的切分策略,关于更加细致的实施策略和参考事例请参考我的另一篇博文:数据库分库分表(sharding)系列(一) 拆分实施策略和示例演示 


1.事务问题:
解决事务问题目前有两种可行的方案:分布式事务和通过应用程序与数据库共同控制实现事务下面对两套方案进行一个简单的对比。
方案一:使用分布式事务
    优点:交由数据库管理,简单有效
    缺点:性能代价高,特别是shard越来越多时
方案二:由应用程序和数据库共同控制
     原理:将一个跨多个数据库的分布式事务分拆成多个仅处
           于单个数据库上面的小事务,并通过应用程序来总控
           各个小事务。
     优点:性能上有优势
     缺点:需要应用程序在事务控制上做灵活设计。如果使用  
           了spring的事务管理,改动起来会面临一定的困难。
2.跨节点Join的问题
      只要是进行切分,跨节点Join的问题是不可避免的。但是良好的设计和切分却可以减少此类情况的发生。解决这一问题的普遍做法是分两次查询实现。在第一次查询的结果集中找出关联数据的id,根据这些id发起第二次请求得到关联数据。

3.跨节点的count,order by,group by以及聚合函数问题
      这些是一类问题,因为它们都需要基于全部数据集合进行计算。多数的代理都不会自动处理合并工作。解决方案:与解决跨节点join问题的类似,分别在各个节点上得到结果后在应用程序端进行合并。和join不同的是每个结点的查询可以并行执行,因此很多时候它的速度要比单一大表快很多。但如果结果集很大,对应用程序内存的消耗是一个问题。

 

参考资料:

《MySQL性能调优与架构设计》

 

 

注:本文图片摘自《MySQL性能调优与架构设计》一 书

相关阅读:

数据库分库分表(sharding)系列(五) 一种支持自由规划无须数据迁移和修改路由代码的Sharding扩容方案

数据库分库分表(sharding)系列(四) 多数据源的事务处理

数据库分库分表(sharding)系列(三) 关于使用框架还是自主开发以及sharding实现层面的考量

数据库分库分表(sharding)系列(二) 全局主键生成策略

数据库分库分表(sharding)系列(一) 拆分实施策略和示例演示

关于垂直切分Vertical Sharding的粒度

 

数据库Sharding的基本思想和切分策略

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326419833&siteId=291194637