Mysql sub-database sub-table: based on the introduction of DangDang-Sharding-Jdbc

[Big Data] Interpretation of the data structure of 100,000 transactions per second

 Based on the introduction of the official website, learn to put

http://dangdangdotcom.github.io/sharding-jdbc/00-overview/

 

What Sharding-JDBC can and cannot do

http://blog.csdn.net/Farrell_zeng/article/details/52958181

 

Ali's sub-database framework cobar-client is based on ibatis' SqlMapClientTemplate with a thin layer of encapsulation, and is subpackaged into CobarSqlMapClientTemplate, which can be transparently operated when users are in CRUD. It is a mature solution for most companies' sub-databases. However, there are some problems faced now:
(1) It does not support sub-tables
(2) It is based on ibatis, and after 2013, there is basically no maintenance or upgrade, so most companies have rewritten their own based on this idea

. Open source sharding-jdbc, official website: https://github.com/dangdangdotcom/sharding-jdbc
Interpretation of sub-database sub-table middleware Sharding-JDBC Reference: http://blog.csdn.net/u4110122855/article/details/ 50670503 Author: Zhang Liang, architect of Dangdang.com The

database sub-database sub-table has been a hot topic since the beginning of the Internet era. Today, when NoSQL is rampant, relational databases are still the preferred database by most companies due to their stability, flexible query, compatibility and other characteristics. Therefore, it is an unavoidable problem for major Internet companies to reasonably adopt the technology of sub-database and sub-table to deal with the impact of massive data and high concurrency on the database. Although many companies are committed to developing their own sub-database and sub

-table middleware, so far, there is still no perfect open source solution covering this field; Two scenarios: "Large data volume and high concurrency. Usually divided into vertical split and horizontal split" (1) Vertical split is to split a library (table) into multiple libraries (tables) according to business. For example: split frequently and infrequently accessed fields into different libraries or tables. Due to the close relationship with the business, the current sub-database and sub-table products all use the horizontal split method.







(2) Horizontal splitting is to split a library (table) into multiple libraries (tables) according to the sharding algorithm. For example, take the remainder by 3 according to the last digit of the ID, put the mantissa of 1 into the first library (table), and put the mantissa of 2 into the second library (table), etc.

The retrieval performance of relational databases will drop sharply when the amount of data is larger than a certain amount. In the face of massive data on the Internet, all data is stored in one table, which obviously easily exceeds the data volume threshold that the database table can bear. The data volume threshold that a single table can withstand needs to be obtained through actual testing according to the difference between databases and concurrency.

(1) "Simple table partition": Although it can solve the problem of slow retrieval caused by excessive data volume, it cannot solve the problem of too many concurrent requests accessing the same library, resulting in slow database response. Therefore, horizontal splitting usually requires at least "sub-database" to solve the problems of large data volume and high concurrency at the same time. This is why some open source sharded database middleware only supports sharded databases.

(2) There are also irreplaceable applicable scenarios for "simple sub-table". The most common sharding needs are transactional issues. The same library does not need to consider distributed transactions, and being good at using different tables in the same library can effectively avoid the troubles caused by distributed transactions. At present, distributed transactions with strong consistency are not necessarily faster to use than separate databases and tables due to performance problems. At present, most flexible transactions adopt eventual consistency.
Another reason for the existence of sub-tables is that too many database instances are not conducive to operation and maintenance management.

To sum up, the best practice is to use sub-libraries + sub-tables reasonably.


[Introduction to Sharding-JDBC]
Sharding-JDBC is a horizontal database sharding framework separated from the relational database module dd-rdb in the Dangdang application framework ddframe, and realizes transparent database sub-database and sub-table access.
Sharding-JDBC is the third open source project of the ddframe series after dubbox and elastic-job.

Sharding-JDBC directly encapsulates the JDBC API, which can be understood as an enhanced version of the JDBC driver. The migration cost of old code is almost zero:
Works with any Java based ORM framework such as JPA, Hibernate, Mybatis, Spring JDBC Template or using JDBC directly.
It can be based on any third-party database connection pool, such as DBCP, C3P0, BoneCP, Druid, etc.
In theory, any database that implements the JDBC specification can be supported. Although only MySQL is currently supported, there are plans to support databases such as Oracle and SQL Server.
Sharding-JDBC is positioned as a lightweight Java framework. It uses the client to directly connect to the database, and provides services in the form of jar packages. There is no proxy layer, no additional deployment, no other dependencies, and the DBA does not need to change the original operation and maintenance method.

Sharding-JDBC has a flexible sharding strategy, and supports multi-dimensional sharding such as equal sign, between, and in, as well as multiple shard keys.

The SQL parsing function is complete, and supports aggregation, grouping, sorting, limit, or and other queries, as well as Binding Table and Cartesian product table queries.

Comparison with common open source products Out
of respect for other open source projects, we do not intend to comment on projects that are currently being updated. Here are only a few projects that have stopped updating but are still very influential in the field of database sharding, see Table 1.

Picture description
Table 1 Comparison of database sharding tools
As can be seen from the above table, Cobar is a middle-tier solution, and a layer of Proxy is built between the application and MySQL. The middle layer is between the application and the database, and needs to be forwarded once. There is no additional forwarding based on the JDBC protocol, and the application is directly connected to the database, which has a slight performance advantage. This does not mean that the middle layer is necessarily inferior to the direct connection of the client. In addition to performance, there are many factors to be considered. The middle layer is more convenient to implement functions such as monitoring, data migration, and connection management.

Cobar-Client, TDDL, and Sharding-JDBC are all client-side direct connection solutions. The advantages of this solution are lightness, compatibility, performance, and little impact on DBA. The implementation of Cobar-Client is based on the ORM (Mybatis) framework, and its compatibility and scalability are not as good as the latter two based on the JDBC protocol.

Implementation Principle
As mentioned above, Sharding-JDBC is a jar file that implements the JDBC protocol. The implementation based on the JDBC protocol is slightly different from the middle layer based on database protocols such as MySQL.

No matter which architecture is used, the core logic is very similar, except that the protocol implementation layer is different (JDBC or database protocol), it will be divided into modules such as sharding rule configuration, SQL parsing, SQL rewriting, SQL routing, SQL execution, and result merging.

See Figure 1 for the overall architecture diagram of Sharding-JDBC.
Picture description
Figure 1 The overall architecture of Sharding-JDBC Sharding

rules configuration
Sharding-JDBC The sharding logic is very flexible, and supports functions such as sharding strategy customization, complex sharding keys, and multi-operator sharding.
For example, a sharding strategy combining database and tables according to user ID and table according to order ID; or multi-shard key sharding such as year and month + user area ID table.

In addition to supporting the equal sign operator for sharding, Sharding-JDBC also supports in/between operator sharding, providing more powerful sharding functions.

Sharding-JDBC provides the spring namespace to simplify configuration, and the rule engine to simplify policy writing. Since the core logic of sharding has just been open-sourced, these two modules have not been open-sourced yet, and other modules will be open-sourced when the core is stable.

JDBC specification rewrite
The idea of ​​Sharding-JDBC rewriting the JDBC specification is to encapsulate the five core interfaces of DataSource, Connection, Statement, PreparedStatement and ResultSet, and incorporate multiple real JDBC implementation class sets (such as MySQL JDBC implementation/DBCP JDBC implementation, etc.) into Sharding- JDBC implements class management.

Sharding-JDBC maximizes the implementation of the JDBC protocol, including addBatch, a batch update function that is used in JPA. However, sharded JDBC is different from native JDBC after all, so there are still unimplemented interfaces, including Connection cursor, stored procedure and savePoint related functions, ResultSet forward traversal and modification and other less commonly used functions. In addition, in order to ensure compatibility, JDBC 4.1 and later released interfaces are not implemented (eg DBCP 1.x version does not support JDBC 4.1).

"==SQL parsing=="
SQL parsing is the core of sub-database and sub-table products, and performance and compatibility are the most important metrics. At present, the common SQL parsers mainly include fdb/jsqlparser and Druid. Sharding-JDBC uses Druid as the SQL parser. After actual testing, Druid's parsing speed is dozens of times faster than the other two parsers.

At present, Sharding-JDBC supports the parsing of complex SQL such as join, aggregation (including avg), order by, group by, limit, and even or query. Currently, SQL parsing that should not occur in sharding scenarios, such as unions, partial subqueries, and intra-function sharding, is not supported.

"==SQL rewriting=="
SQL rewriting is divided into two parts, one part is to replace the logical table name of the sub-table with the real table name. The other part is to replace some functions that are not correct in a sharded environment based on the SQL parsing results. Here are two examples:

The first example is the avg calculation. In a sharded environment, it is not correct to use avg1 + avg2 + avg3/3 to calculate the average value, and it needs to be rewritten as (sum1+sum2+sum3)/(count1+count2+count3). This requires rewriting the SQL containing avg to sum and count, and then recalculating the average when the results are merged.

The second example is pagination. Assuming that every 10 pieces of data is a page, take the second page of data. Obtaining limit 10, 10 in a sharding environment, and then fetching the first 10 pieces of data according to the sorting conditions after merging is an incorrect result. The correct way is to rewrite the sub-conditions to limit 0, 20, take out all the first 2 pages of data, and then combine the sorting conditions to calculate the correct data. It can be seen that the later the Limit paging efficiency is, the more memory is wasted. There are many ways to avoid using limit for paging, such as building a secondary index that records the number of row records and row offsets, or using the last paging data end ID as the paging method for the next query condition.

"SQL routing"
SQL routing is configured according to sharding rules, and locates SQL to the real data source. It is mainly divided into single table routing, Binding table routing and Cartesian product routing.

Single-table routing is the simplest, but the routing results do not necessarily fall into a unique library (table), because sharding based on operators such as between and in is supported, so the final result may still fall into multiple libraries (tables).

The Binding table can be understood as a master-slave table with completely consistent rules for sub-database and sub-table. For example: both the order table and the order details table use the order ID as the sharding key, and the sharding logic is the same at any time. Such an association query is similar in difficulty and performance to a single-table query.

The Cartesian product query is the most complicated, because it is impossible to locate the consistency of the sharding rules according to the binding relationship, so the associated query of the non-Binding table needs to be disassembled and executed as a combination of Cartesian products. The query performance is low and the number of database connections is high, so it should be used with caution.

After "SQL execution"
is routed to the real data source, Sharding-JDBC will execute SQL concurrently with multiple threads, and complete the processing of batch methods such as addBatch.

"Result merging"
Result merging includes 4 categories: common traversal, sorting, aggregation and grouping. Each type will first skip unnecessary data based on pagination results.

The ordinary traversal class is the simplest, just traverse the collection of ResultSet in order.

Sorting results sort the results first and then output them. Because the results of each shard are sorted according to their own conditions, the merge sort algorithm is used to integrate the final results.

There are 3 types of aggregation classes, comparative, cumulative, and average. Comparables include max and min, returning only the largest (smaller) result. The accumulation type includes sum and count, and the results need to be accumulated and returned. The average value is calculated by sum and count rewritten in SQL. The related content has been covered in SQL rewrite and will not be repeated here.

The grouping class is the most complex. It needs to put all the ResultSet results into memory, use the map-reduce algorithm to group them, and finally do related processing according to the sorting and aggregation conditions. This is the part that consumes the most memory and loses the most performance. You can consider using the limit to reasonably limit the size of the packet data.

The result merging part does not currently use the pipeline parsing method, and more improvements will be made here in the future.

[Performance]
Performance test report of routing results in single database and single table:
Query operation: TPS of Sharding-JDBC is 99.8% of TPS of JDBC;
Insert operation: TPS of Sharding-JDBC is 90.2% of TPS of JDBC;
Update operation: The TPS of Sharding-JDBC is 93.1% of that of JDBC;
it can be seen that the performance loss of Sharding-JDBC is very low.

Performance test report of routing results in multiple databases and multiple tables:

Query operation: TPS dual database can increase the performance by about 94% compared to single database;
Insert operation: TPS dual database can increase the performance by about 60% compared to single database;
Update operation: TPS dual database can increase the performance by about 89% compared with single database; the
results show that Sharding-JDBC can effectively use multi-threaded and distributed resources to greatly improve performance; for
more details, please refer to the performance test report of Sharding-JDBC .

Sharding
-JDBC currently focuses on the development of the core logic of sub-database and sub-table, and will be continuously updated according to the following lines after the function is stabilized: read-
write separation;
flexible distributed transaction;
distributed primary key generation strategy;
SQL rewrite optimization to further improve performance ;
SQL Hint, which can specify a certain SQL to be executed in a specific database table, based on business rules instead of SQL parsing routing;
small table broadcasting;
HA related;
flow control;
database table building tools;
data migration;
complex SQL parsing support, such as sub-queries , stored procedures, etc.;
Oracle, SQLServer support;
configuration center;

open source concept
At present, many open source products in China have withstood the test of time within the company, and then strip business logic and sensitive code, and then open source and contribute to the community. The advantage of this is that open source products are relatively mature. But the shortcomings are also inevitable, mainly including:
(1) Lack of follow-up support. The product has already met the needs of the company's business scenarios and lacks motivation for subsequent improvement. Documentation and support will be relatively small, and even the documentation and code will be out of sync.
(2) The coupling with the company's business scenarios is serious. Most framework products are designed to solve specific problems. For example, some companies may not need sharding; some companies only need to support several sharding strategies.
(3) Open source is incomplete. Parts that are tightly coupled with the company's business will not be open sourced.
(4) Lack of viscosity. Due to the variety of functions and complex code structure of relatively formed projects, it is difficult for community volunteers to extend or modify the core logic. If the test coverage is not enough, it is difficult to guarantee the quality of the modified code. The above series of problems will lead to the low viscosity of the project to the community, and it is difficult to find volunteers who can cooperate in development.
(5) Many branches are difficult to maintain. Due to the lack of motivation for the company to continue to improve after open source, the demand functions that have little to do with the company are not valued, resulting in companies developing their own branches. Although open source projects injected fresh ideas into the community at the beginning, they did not absorb the essence of the community in the end. For example, Dubbo attracted a lot of attention as soon as it appeared, and each company has its own version, such as Dangdang's DubboX, but in the end Dubbo failed to develop continuously.
We consider a brand-new open source strategy. When Sharding-JDBC has just completed the first version, it will be promoted to the community and Dangdang internally at the same time. The advantages of doing so are:

follow-up support is improved. Sharding-JDBC is bound to Dangdang and will provide support both within Dangdang and the community. Although the priority of not being able to provide community needs is higher than Dangdang's internal commitments, we will comprehensively consider community and internal needs, and try to integrate and optimize the upgrade route from a higher perspective.
Completely open source. Snapshot versions of the code will appear first on GitHub.
Develop together. The current code of Sharding-JDBC is relatively simple. Make it easier for community open source enthusiasts to understand the core of the code and lay the foundation for future sustainable development. And Sharding-JDBC will also absorb the essence of the community, allowing more enthusiasts to participate in code contributions.
Finally, it needs to be clarified that Sharding-JDBC, which is not time-tested, is not a bug-ridden, completely unavailable project. At present, the test coverage rate exceeds 90%, and the detailed functions and unsupported items are clearly listed in the GitHub document, hoping to let users know.

 

---------------------------------------------------------------------------------






************************Read and write separation************************
Practice :

1. Create a new database and configure it to users, set up master-slave replication

on the master database 122:
dbtbl_0_master
dbtbl_1_master
Build a database:
create database dbtbl_0_master;
create database dbtbl_1_master;
authorize the master database xmtest user to have all permissions of the dbtbl_0_master / dbtbl_1_master database
grant all privileges on dbtbl_0_master.* to 'xmtest'@'%' identified by '123456' WITH GRANT OPTION;
grant all privileges on dbtbl_1_master.* to 'xmtest'@'%' identified by '123456' WITH GRANT OPTION;
refresh system privilege table
flush privileges;

123 on the slave database:
dbtbl_0_slave_0
dbtbl_1_slave_0
Create database:
create database dbtbl_0_slave_0;
create database dbtbl_1_slave_0;
Authorize the slave library xmtest user to have the read permission of the dbtbl_0_slave_0 / dbtbl_1_slave_0 database
grant select on dbtbl_0_slave_0.* to xmtest@localhost identified by '123456';
grant select on dbtbl_1_slave_0.* to xmtest@localhost identified by '123456';
refresh the system permission table .
flush privileges;

because the master-slave configuration is set to be master-slave replication except for the mysql library, you only need to build the library and authorize the user.
Reference : MYSQL installation configuration.txt


Here the master library and the slave library are set to the same, so follow the following Version:
122 Create database authorization on the main database:
dbtbl_0
dbtbl_1
Create database:
create database dbtbl_0;
create database dbtbl_1;
authorize the main database xmtest user to have all the permissions of the dbtbl_0 / dbtbl_1 database
grant all privileges on dbtbl_0.* to 'xmtest' @'%' identified by '123456' WITH GRANT OPTION;
grant all privileges on dbtbl_1.* to 'xmtest'@'%' identified by '
Refresh the system privilege table
flush privileges;

because the master-slave configuration is set to master-slave replication except for the mysql library, you only need to build the library and authorize the user.
Reference : MYSQL installation configuration.txt
















 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326261337&siteId=291194637