Sub-library and sub-table application scenarios (transfer)

Sub-library and sub-table application scenarios

Sub-database and sub-table are used to deal with two common scenarios on the current Internet—large data volume and high concurrency. Usually divided into vertical split and horizontal split.

Vertical splitting is to split a library (table) into multiple libraries (tables) according to business. For example: split frequently and infrequently accessed fields into different libraries or tables. Due to the close relationship with the business, the current sub-database and sub-table products all use the horizontal split method.

Horizontal splitting is to split a library (table) into multiple libraries (tables) according to the sharding algorithm. For example, take the remainder by 3 according to the last digit of the ID, put the mantissa of 1 into the first library (table), and put the mantissa of 2 into the second library (table), etc.

The retrieval performance of relational databases will drop sharply when the amount of data is larger than a certain amount. In the face of massive data on the Internet, all data is stored in one table, which obviously easily exceeds the data volume threshold that the database table can bear. The data volume threshold that a single table can withstand needs to be obtained through actual testing according to the difference between databases and concurrency.

Although simple table partitioning can solve the problem of slow retrieval due to excessive data volume, it cannot solve the problem of too many concurrent requests accessing the same database, resulting in slow database response. Therefore, horizontal splitting usually requires at least the method of sub-library to solve the problems of large data volume and high concurrency together. This is why some open source sharded database middleware only supports sharded databases.

However, there are also irreplaceable applicable scenarios for sub-tables. The most common sharding needs are transactional issues. The same library does not need to consider distributed transactions, and being good at using different tables in the same library can effectively avoid the troubles caused by distributed transactions. At present, distributed transactions with strong consistency are not necessarily faster to use than separate databases and tables due to performance problems. At present, most flexible transactions adopt eventual consistency. Another reason for the existence of sub-tables is that too many database instances are not conducive to operation and maintenance management. To sum up, the best practice is to use sub-libraries + sub-tables reasonably.

Introduction to Sharding-JDBC

Sharding-JDBC is a database horizontal sharding framework separated from the relational database module dd-rdb in the Dangdang application framework ddframe, and realizes transparent database sub-database and sub-table access. Sharding-JDBC is the third open source project of the ddframe series after dubbox and elastic-job.

Sharding-JDBC directly encapsulates the JDBC API, which can be understood as an enhanced version of the JDBC driver. The migration cost of old code is almost zero:

  • Works with any Java based ORM framework such as JPA, Hibernate, Mybatis, Spring JDBC Template or using JDBC directly.
  • It can be based on any third-party database connection pool, such as DBCP, C3P0, BoneCP, Druid, etc.
  • In theory, any database that implements the JDBC specification can be supported. Although only MySQL is currently supported, there are plans to support databases such as Oracle and SQL Server.

Sharding-JDBC is positioned as a lightweight Java framework, using the client to directly connect to the database, providing services in the form of jar packages, no proxy layer, no additional deployment, no other dependencies, and DBA does not need to change the original operation and maintenance method.

Sharding-JDBC has a flexible sharding strategy, and supports multi-dimensional sharding such as equal sign, between, and in, as well as multiple shard keys.

The SQL parsing function is complete, and supports aggregation, grouping, sorting, limit, or and other queries, as well as Binding Table and Cartesian product table queries.

Comparison with common open source products

Out of respect for other open source projects, we do not intend to comment on projects that are currently still being updated. Here are only a few projects that have stopped updating but are still very influential in the field of database sharding, see Table 1.

image description

Table 1 Comparison of database sharding tools

 

As can be seen from the above table, Cobar is a middle-tier solution, and a layer of Proxy is built between the application and MySQL. The middle layer is between the application and the database, and needs to be forwarded once. There is no additional forwarding based on the JDBC protocol, and the application is directly connected to the database, which has a slight performance advantage. This does not mean that the middle layer is necessarily inferior to the direct connection of the client. In addition to performance, there are many factors to be considered. The middle layer is more convenient to implement functions such as monitoring, data migration, and connection management.

Cobar-Client, TDDL, and Sharding-JDBC are all client-side direct connection solutions. The advantages of this solution are lightness, compatibility, performance, and little impact on DBA. The implementation of Cobar-Client is based on the ORM (Mybatis) framework, and its compatibility and scalability are not as good as the latter two based on the JDBC protocol.

Implementation principle

As mentioned above, Sharding-JDBC is a jar file that implements the JDBC protocol. The implementation based on the JDBC protocol is slightly different from the middle layer based on database protocols such as MySQL.

No matter which architecture is used, the core logic is very similar, except that the protocol implementation layer is different (JDBC or database protocol), it will be divided into modules such as sharding rule configuration, SQL parsing, SQL rewriting, SQL routing, SQL execution, and result merging.

See Figure 1 for the overall architecture diagram of Sharding-JDBC.

image description

Figure 1 Overall architecture diagram of Sharding-JDBC

 

Sharding rule configuration

The sharding logic of Sharding-JDBC is very flexible, and supports functions such as sharding strategy customization, complex sharding keys, and multi-operator sharding.

For example, a sharding strategy combining database and tables according to user ID and table according to order ID; or multi-shard key sharding such as year-based database, month + user area ID table.

In addition to supporting the equal sign operator for sharding, Sharding-JDBC also supports in/between operator sharding, providing more powerful sharding functions.

Sharding-JDBC provides the spring namespace to simplify configuration, and the rule engine to simplify policy writing. Since the core logic of sharding has just been open-sourced, these two modules have not been open-sourced yet, and other modules will be open-sourced when the core is stable.

JDBC specification rewrite

Sharding-JDBC's idea of ​​rewriting the JDBC specification is to encapsulate the five core interfaces of DataSource, Connection, Statement, PreparedStatement and ResultSet, and incorporate multiple real JDBC implementation class sets (such as MySQL JDBC implementation/DBCP JDBC implementation, etc.) into Sharding- JDBC implements class management.

Sharding-JDBC maximizes the implementation of the JDBC protocol, including addBatch, a batch update function that is used in JPA. However, sharded JDBC is different from native JDBC after all, so there are still unimplemented interfaces, including Connection cursor, stored procedure and savePoint related functions, ResultSet forward traversal and modification and other less commonly used functions. In addition, in order to ensure compatibility, JDBC 4.1 and later released interfaces are not implemented (eg DBCP 1.x version does not support JDBC 4.1).

SQL parsing

SQL parsing is the core of sub-database and sub-table products, and performance and compatibility are the most important metrics. At present, the common SQL parsers mainly include fdb/jsqlparser and Druid. Sharding-JDBC uses Druid as the SQL parser. After actual testing, Druid's parsing speed is dozens of times faster than the other two parsers.

At present, Sharding-JDBC supports the parsing of complex SQL such as join, aggregation (including avg), order by, group by, limit, and even or query. Currently, SQL parsing that should not occur in sharding scenarios, such as unions, partial subqueries, and intra-function sharding, is not supported.

SQL rewrite

The SQL rewriting is divided into two parts. One part is to replace the logical table name of the sub-table with the real table name. The other part is to replace some functions that are not correct in a sharded environment based on the SQL parsing results. Here are two examples:

The first example is the avg calculation. In a sharded environment, it is not correct to use avg1 + avg2 + avg3/3 to calculate the average value, and it needs to be rewritten as (sum1+sum2+sum3)/(count1+count2+count3). This requires rewriting the SQL containing avg to sum and count, and then recalculating the average when the results are merged.

The second example is pagination. Assuming that every 10 pieces of data is a page, take the second page of data. Obtaining limit 10, 10 in a sharding environment, and then fetching the first 10 pieces of data according to the sorting conditions after merging is an incorrect result. The correct way is to rewrite the sub-conditions to limit 0, 20, take out all the first 2 pages of data, and then combine the sorting conditions to calculate the correct data. It can be seen that the later the Limit paging efficiency is, the more memory is wasted. There are many ways to avoid using limit for paging, such as building a secondary index that records the number of row records and row offsets, or using the last paging data end ID as the paging method for the next query condition.

SQL routing

SQL routing is configured according to sharding rules to locate SQL to the real data source. It is mainly divided into single table routing, Binding table routing and Cartesian product routing.

Single-table routing is the simplest, but the routing results do not necessarily fall into a unique library (table), because sharding is supported based on operators such as between and in, so the final result may still fall into multiple libraries (tables).

The Binding table can be understood as a master-slave table with completely consistent rules for sub-database and sub-table. For example: both the order table and the order details table use the order ID as the sharding key, and the sharding logic is the same at any time. Such an association query is similar in difficulty and performance to a single-table query.

The Cartesian product query is the most complicated, because it is impossible to locate the consistency of the sharding rules according to the binding relationship, so the associated query of the non-Binding table needs to be disassembled and executed as a combination of Cartesian products. The query performance is low and the number of database connections is high, so it should be used with caution.

SQL execution

After routing to the real data source, Sharding-JDBC will execute SQL concurrently with multiple threads, and complete the processing of batch methods such as addBatch.

Merge results

The result merging includes 4 categories: common traversal, sorting, aggregation and grouping. Each type will first skip unnecessary data based on pagination results.

The ordinary traversal class is the simplest, just traverse the collection of ResultSet in order.

Sorting results sort the results first and then output them. Because the results of each shard are sorted according to their own conditions, the merge sort algorithm is used to integrate the final results.

There are 3 types of aggregation classes, comparative, cumulative, and average. Comparables include max and min, returning only the largest (smaller) result. The accumulation type includes sum and count, and the results need to be accumulated and returned. The average value is calculated by sum and count rewritten in SQL. The related content has been covered in SQL rewrite and will not be repeated here.

The grouping class is the most complex. It needs to put all the ResultSet results into memory, use the map-reduce algorithm to group them, and finally do related processing according to the sorting and aggregation conditions. This is the part that consumes the most memory and loses the most performance. You can consider using the limit to reasonably limit the size of the packet data.

The result merging part does not currently use the pipeline parsing method, and more improvements will be made here in the future.

performance

The performance test report of the routing result in the single database and single table:

Query operation: TPS of Sharding-JDBC is 99.8% of TPS of JDBC; 
Insert operation: TPS of Sharding-JDBC is 90.2% of TPS of JDBC; 
Update operation: TPS of Sharding-JDBC is 93.1% of TPS of JDBC; 
see So, Sharding-JDBC performance penalty is very low.

The performance test report of the routing results in the multi-database multi-table:

Query operation: TPS dual database can increase the performance by about 94% than single database; 
Insert operation: TPS dual database can increase the performance by about 60% than single database; 
Update operation: TPS dual database can increase the performance by about 89% than single database ; 
The results show that Sharding-JDBC can effectively use multi-threading and distributed resources to greatly improve performance; for 
more details, please refer to the performance test report of Sharding-JDBC.

Roadmap

At present, Sharding-JDBC focuses on the development of the core logic of sub-database and sub-table, and will continue to be updated according to the following lines after the function is stabilized:

  • read-write separation;
  • Flexible distributed transactions;
  • Distributed primary key generation strategy;
  • SQL rewrite optimization to further improve performance;
  • SQL Hint, you can specify a certain SQL to be executed in a specific database table, based on business rules rather than SQL parsing routing; 
    small table broadcast;
  • HA related;
  • flow control;
  • Database table building tool;
  • data migration;
  • Complex SQL parsing support, such as subqueries, stored procedures, etc.;
  • Oracle, SQLServer support;
  • configuration center;

open source concept

At present, many open source products in China have withstood the test of time within the company, and then stripped of business logic and sensitive code, and then open sourced and contributed to the community. The advantage of this is that open source products are relatively mature. But the disadvantages are also inevitable, mainly including:

  1. Lack of follow-up support. The product has already met the needs of the company's business scenarios and lacks motivation for subsequent improvement. Documentation and support will be relatively small, and even the documentation and code will be out of sync.
  2. The coupling with the company's business scenarios is more serious. Most framework products are designed to solve specific problems. For example, some companies may not need sharding; some companies only need to support several sharding strategies.
  3. Open source is incomplete. Parts that are tightly coupled with the company's business will not be open sourced.
  4. Lack of viscosity. Due to the variety of functions and complex code structure of relatively formed projects, it is difficult for community volunteers to extend or modify the core logic. If the test coverage is not enough, it is difficult to guarantee the quality of the modified code. The above series of problems will lead to the low viscosity of the project to the community, and it is difficult to find volunteers who can cooperate in development.
  5. Many branches are difficult to maintain. Due to the lack of motivation for the company to continue to improve after open source, the demand functions that have little to do with the company are not valued, resulting in companies developing their own branches. Although open source projects injected fresh ideas into the community at the beginning, they did not absorb the essence of the community in the end. For example, Dubbo attracted a lot of attention as soon as it appeared, and each company has its own version, such as Dangdang's DubboX, but in the end Dubbo failed to develop continuously.

We consider a brand-new open source strategy, when Sharding-JDBC just completed the first version, that is, to promote it to the community and Dangdang at the same time. The benefits of doing this are:

  • Follow-up support is perfect. Sharding-JDBC is bound to Dangdang and will provide support both within Dangdang and the community. Although the priority of not being able to provide community needs is higher than Dangdang's internal commitments, we will comprehensively consider community and internal needs, and try to integrate and optimize the upgrade route from a higher perspective.
  • Completely open source. Snapshot versions of the code will appear first on GitHub.
  • Develop together. The current code of Sharding-JDBC is relatively simple. Make it easier for community open source enthusiasts to understand the core of the code and lay the foundation for future sustainable development. And Sharding-JDBC will also absorb the essence of the community, allowing more enthusiasts to participate in code contribution.

Finally, it needs to be clarified that Sharding-JDBC, which is not time-tested, is not a bug-ridden, completely unavailable project. At present, the test coverage rate exceeds 90%, and the detailed functions and unsupported items are clearly listed in the GitHub document, hoping to let users know.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326533969&siteId=291194637