Sub-library and sub-table application scenarios (transfer)

Sub-library and sub-table application scenarios

Sub-database and sub-table are used to deal with two common scenarios on the current Internet—large data volume and high concurrency. Usually divided into vertical split and horizontal split.

Vertical splitting is to split a library (table) into multiple libraries (tables) according to business. For example: split frequently and infrequently accessed fields into different libraries or tables. Due to the close relationship with the business, the current sub-database and sub-table products all use the horizontal split method.

Horizontal splitting is to split a library (table) into multiple libraries (tables) according to the sharding algorithm. For example, take the remainder by 3 according to the last digit of the ID, put the mantissa of 1 into the first library (table), and put the mantissa of 2 into the second library (table), etc.

The retrieval performance of relational databases will drop sharply when the amount of data is larger than a certain amount. In the face of massive data on the Internet, all data is stored in one table, which obviously easily exceeds the data volume threshold that the database table can bear. The data volume threshold that a single table can withstand needs to be obtained through actual testing according to the difference between databases and concurrency.

Although simple table partitioning can solve the problem of slow retrieval due to excessive data volume, it cannot solve the problem of too many concurrent requests accessing the same database, resulting in slow database response. Therefore, horizontal splitting usually requires at least the method of sub-library to solve the problems of large data volume and high concurrency together. This is why some open source sharded database middleware only supports sharded databases.

However, there are also irreplaceable applicable scenarios for sub-tables. The most common sharding needs are transactional issues. The same library does not need to consider distributed transactions, and being good at using different tables in the same library can effectively avoid the troubles caused by distributed transactions. At present, distributed transactions with strong consistency are not necessarily faster to use than separate databases and tables due to performance problems. At present, most flexible transactions adopt eventual consistency. Another reason for the existence of sub-tables is that too many database instances are not conducive to operation and maintenance management. To sum up, the best practice is to use sub-libraries + sub-tables reasonably.

Introduction to Sharding-JDBC

Sharding-JDBC is a database horizontal sharding framework separated from the relational database module dd-rdb in the Dangdang application framework ddframe, and realizes transparent database sub-database and sub-table access. Sharding-JDBC is the third open source project of the ddframe series after dubbox and elastic-job.

Sharding-JDBC directly encapsulates the JDBC API, which can be understood as an enhanced version of the JDBC driver. The migration cost of old code is almost zero:

  • Works with any Java based ORM framework such as JPA, Hibernate, Mybatis, Spring JDBC Template or using JDBC directly.
  • It can be based on any third-party database connection pool, such as DBCP, C3P0, BoneCP, Druid, etc.
  • In theory, any database that implements the JDBC specification can be supported. Although only MySQL is currently supported, there are plans to support databases such as Oracle and SQL Server.

Sharding-JDBC is positioned as a lightweight Java framework. It uses the client to directly connect to the database, and provides services in the form of jar packages. There is no proxy layer, no additional deployment, no other dependencies, and the DBA does not need to change the original operation and maintenance method.

Sharding-JDBC has a flexible sharding strategy, and supports multi-dimensional sharding such as equal sign, between, and in, as well as multiple shard keys.

The SQL parsing function is complete, and supports aggregation, grouping, sorting, limit, or and other queries, as well as Binding Table and Cartesian product table queries.

Comparison with common open source products

Out of respect for other open source projects, we do not intend to comment on projects that are currently still being updated. Here are only a few projects that have stopped updating but are still very influential in the field of database sharding, see Table 1.

图片描述

Table 1 Comparison of database sharding tools

 

As can be seen from the above table, Cobar is a middle-tier solution, and a layer of Proxy is built between the application and MySQL. The middle layer is between the application and the database, and needs to be forwarded once. There is no additional forwarding based on the JDBC protocol, and the application is directly connected to the database, which has a slight performance advantage. This does not mean that the middle layer is necessarily inferior to the direct connection of the client. In addition to performance, there are many factors that need to be considered. The middle layer is more convenient to implement functions such as monitoring, data migration, and connection management.

Cobar-Client, TDDL, and Sharding-JDBC are all client-side direct connection solutions. The advantages of this solution are lightness, compatibility, performance, and little impact on DBA. The implementation of Cobar-Client is based on the ORM (Mybatis) framework, and its compatibility and scalability are not as good as the latter two based on the JDBC protocol.

Implementation principle

As mentioned above, Sharding-JDBC is a jar file that implements the JDBC protocol. The implementation based on the JDBC protocol is slightly different from the middle layer based on database protocols such as MySQL.

No matter which architecture is used, the core logic is very similar. Except for the different protocol implementation layers (JDBC or database protocol), it will be divided into modules such as sharding rule configuration, SQL parsing, SQL rewriting, SQL routing, SQL execution, and result merging.

See Figure 1 for the overall architecture diagram of Sharding-JDBC.

图片描述

Figure 1 Overall architecture diagram of Sharding-JDBC

 

Sharding rule configuration

The sharding logic of Sharding-JDBC is very flexible, and supports functions such as sharding strategy customization, complex sharding keys, and multi-operator sharding.

For example, a sharding strategy combining database and tables according to user ID and table according to order ID; or multi-shard key sharding such as year-based database, month + user area ID table.

In addition to supporting the equal sign operator for sharding, Sharding-JDBC also supports in/between operator sharding, providing more powerful sharding functions.

Sharding-JDBC provides the spring namespace to simplify configuration, and the rule engine to simplify policy writing. Since the core logic of sharding has just been open-sourced, these two modules have not been open-sourced yet, and other modules will be open-sourced when the core is stable.

JDBC specification rewrite

The idea of ​​Sharding-JDBC rewriting the JDBC specification is to encapsulate the five core interfaces of DataSource, Connection, Statement, PreparedStatement and ResultSet, and incorporate multiple real JDBC implementation class sets (such as MySQL JDBC implementation/DBCP JDBC implementation, etc.) into Sharding- JDBC implements class management.

Sharding-JDBC maximizes the implementation of the JDBC protocol, including addBatch, a batch update function that is used in JPA. However, sharded JDBC is different from native JDBC after all, so there are still unimplemented interfaces, including Connection cursor, stored procedure and savePoint related functions, ResultSet forward traversal and modification and other less commonly used functions. In addition, in order to ensure compatibility, JDBC 4.1 and later released interfaces are not implemented (eg DBCP 1.x version does not support JDBC 4.1).

SQL parsing

SQL解析作为分库分表类产品的核心,性能和兼容性是最重要的衡量指标。目前常见的SQL解析器主要有fdb/jsqlparser和Druid。Sharding-JDBC使用Druid作为SQL解析器,经实际测试,Druid解析速度是另外两个解析器的几十倍。

目前Sharding-JDBC支持join、aggregation(包括avg)、order by、 group by、limit、甚至or查询等复杂SQL的解析。目前不支持union、部分子查询、函数内分片等不太应在分片场景中出现的SQL解析。

SQL改写

SQL改写分为两部分,一部分是将分表的逻辑表名称替换为真实表名称。另一部分是根据SQL解析结果替换一些在分片环境中不正确的功能。这里具两个例子:

第1个例子是avg计算。在分片的环境中,以avg1 +avg2+avg3/3计算平均值并不正确,需要改写为(sum1+sum2+sum3)/(count1+count2+ count3)。这就需要将包含avg的SQL改写为sum和count,然后再结果归并时重新计算平均值。

第2个例子是分页。假设每10条数据为一页,取第2页数据。在分片环境下获取limit 10, 10,归并之后再根据排序条件取出前10条数据是不正确的结果。正确的做法是将分条件改写为limit 0, 20,取出所有前2页数据,再结合排序条件算出正确的数据。可以看到越是靠后的Limit分页效率就会越低,也越浪费内存。有很多方法可避免使用limit进行分页,比如构建记录行记录数和行偏移量的二级索引,或使用上次分页数据结尾ID作为下次查询条件的分页方式。

SQL路由

SQL路由是根据分片规则配置,将SQL定位至真正的数据源。主要分为单表路由、Binding表路由和笛卡尔积路由。

单表路由最为简单,但路由结果不一定落入唯一库(表),因为支持根据between和in这样的操作符进行分片,所以最终结果仍然可能落入多个库(表)。

Binding表可理解为分库分表规则完全一致的主从表。举例说明:订单表和订单详情表都根据订单ID作为分片键,任意时刻分片逻辑均相同。这样的关联查询和单表查询难度和性能相当。

笛卡尔积查询最为复杂,因为无法根据Binding关系定位分片规则的一致性,所以非Binding表的关联查询需要拆解为笛卡尔积组合执行。查询性能较低,而且数据库连接数较高,需谨慎使用。

SQL执行

路由至真实数据源后,Sharding-JDBC将采用多线程并发执行SQL,并完成对addBatch等批量方法的处理。

结果归并

结果归并包括4类:普通遍历类、排序类、聚合类和分组类。每种类型都会先根据分页结果跳过不需要的数据。

普通遍历类最为简单,只需按顺序遍历ResultSet的集合即可。

排序类结果将结果先排序再输出,因为各分片结果均按照各自条件完成排序,所以采用归并排序算法整合最终结果。

聚合类分为3种类型,比较型、累加型和平均值型。比较型包括max和min,只返回最大(小)结果。累加型包括sum和count,需要将结果累加后返回。平均值则是通过SQL改写的sum和count计算,相关内容已在SQL改写涵盖,不再赘述。

分组类最为复杂,需要将所有的ResultSet结果放入内存,使用map-reduce算法分组,最后根据排序和聚合条件做相关处理。最消耗内存,最损失性能的部分即是此,可以考虑使用limit合理的限制分组数据大小。

结果归并部分目前并未采用管道解析的方式,之后会针对这里做更多改进。

性能

路由结果在单库单表的性能测试报告:

查询操作:Sharding-JDBC的TPS为JDBC的TPS的99.8%; 
插入操作:Sharding-JDBC的TPS为JDBC的TPS的90.2%; 
更新操作:Sharding-JDBC的TPS为JDBC的TPS的93.1%; 
可以看到,Sharding-JDBC性能损失非常低。

路由结果在多库多表的性能测试报告:

查询操作:TPS双库比单库可以增加大约94%的性能; 
插入操作:TPS双库比单库可以增加大约60%的性能; 
更新操作:TPS双库比单库可以增加大约89%的性能; 
结果表明,Sharding-JDBC可有效利用多线程与分布式资源大幅度提升性能; 
更多详细情况可查看Sharding-JDBC的性能测试报告。

Roadmap

目前Sharding-JDBC集中于分库分表核心逻辑开发,在功能稳定之后将会按照如下线路持续更新:

  • 读写分离;
  • 柔性分布式事务;
  • 分布式主键生成策略;
  • SQL重写优化,进一步提升性能;
  • SQL Hint,可指定某SQL在某具体库表执行,基于业务规则而非SQL解析路由; 
    小表广播;
  • HA相关;
  • 流量控制;
  • 数据库建表工具;
  • 数据迁移;
  • 复杂SQL解析支持,如子查询、存储过程等;
  • Oracle, SQLServer支持;
  • 配置中心;

开源理念

目前国内很多开源产品都在公司内部经受过时间的考验,然后剥离业务逻辑和敏感代码,再开源贡献给社区。这样做的优点是开源的产品相对成熟。但缺点也不可避免,主要有:

  1. 后续支持匮乏。产品已经满足了该公司的业务场景需求,缺乏后续提升的动力。文档、支持也会相对较少,甚至出现文档和代码不同步的状况。
  2. 与该公司业务场景耦合较为严重。大部分框架产品都是为了解决特定的问题。比如:有的公司可能并不需要分表;有的公司只需支持几种分片策略就好。
  3. 开源不完整。和公司业务耦合紧密的部分不会开源。
  4. 缺乏粘度。较为成型的项目由于功能繁多、代码结构复杂,社区志愿者难于扩展或修改核心逻辑。如果测试覆盖率不够,难以保证修改后的代码质量。以上一系列问题会导致项目对社区的粘度不高,难于找寻可合作开发的志愿者。
  5. 分支众多难于维护。由于开源之后公司缺乏持续提升的动力,和本公司关系不大的需求功能得不到重视,导致各公司都开发自己的分支。开源项目虽然一开始给社区注入了新鲜思想,但最终并没有吸取社区精华。如:Dubbo一出现即引起了相当多的关注,而各公司都有自己的版本,如当当的DubboX,但最终Dubbo并未能持续发展。

我们考虑全新的开源策略,在Sharding-JDBC刚完成初版的时候,即向社区和当当内部同时推广。这样做的好处有:

  • 后续支持完善。Sharding-JDBC与当当内部落地绑定,将会在当当内部和社区同时提供支持。虽然无法提供社区需求的优先级高于当当内部的承诺,但我们会综合考虑社区与内部的需求,以更高的视角,尽量整合与优化升级路线。
  • 完整开源。代码的snapshot版本都会首先出现在GitHub上。
  • 共同发展。Sharding-JDBC目前代码较为简单。使社区开源爱好者能更加轻松地理解代码核心,为以后的持续发展奠定基础。并且Sharding-JDBC也会吸纳社区精华,让更多地爱好者参与代码贡献。

最后需要澄清,未经时间考证的Sharding-JDBC并非Bug成堆,完全不可用的项目。目前测试覆盖率超过90%,详细功能以及不支持项都明确地罗列在GitHub的文档中,希望让使用者心中有数。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326534066&siteId=291194637