Such a good system, why do we need to divide databases and tables?
We combine specific business scenarios and t_order
take tables as an example to optimize the architecture. Since the amount of data has reached the level of 100 million, the query performance has seriously declined, so we have adopted the sub-database and sub-table technology to deal with this problem. Specifically, we divided the original single library into two libraries, namely DB_1
and DB_2
, and performed table division processing again in each library to generate t_order_1
and t_order_2
two tables to realize the sub-database and table processing of the order table.
data sharding
Usually when we refer to sub-databases and tables, we mostly use the horizontal segmentation mode (horizontal sub-database, sub-table) as the basis. Data sharding splits a table with a large amount of data to t_order
generate Several small data volume tables (split tables) with exactly the same table structuret_order_0
, t_order_1
, ... , t_order_n
, each table only stores a part of the data in the original large table.
data node
A data node is an indivisible smallest unit (table) in a data shard, which consists of a data source name and a data table. For example, in the above figure, it DB_1.t_order_1
represents DB_2.t_order_2
a data node.
logical table
Logical table refers to the logical name of a horizontally split table with the same structure.
For example, if we split the order t_order
table into 10 tables such as t_order_0
... t_order_9
, this table no longer exists in our database t_order
, and it is replaced by several t_order_n
tables.
Sub-database and sub-table are usually non-intrusive to business code. Developers only focus on business logic SQL coding. We SQL
still t_order
write according to the code, and parse it into the actual execution of the corresponding database before executing the logical SQL. SQL. At this point t_order is the split table 逻辑表
.
business logic SQL
select * from t_order where order_no='A11111'
Real execution of SQL
select * from DB_1.t_order_n where order_no='A11111'
real table
A real table is a physical table that actually exists in the database DB_1.t_order_n
.
broadcast table
A broadcast table is a special type of table whose table structure and data are completely consistent in all shard data sources. Compared with the split table, the broadcast table has a smaller data volume and a lower update frequency, and is usually used in scenarios such as dictionary tables or configuration tables. Since it has copies on all nodes, it can greatly reduce JOIN
the network overhead of associated queries and improve query efficiency.
It should be noted that the modification operation of the broadcast table needs to ensure synchronization to ensure that the data on all nodes is consistent.
Features of Broadcast Table :
-
In all fragmented data sources, the data in the broadcast table is completely consistent. Therefore, operations on the broadcast table (such as insert, update, and delete) will be performed in each shard data source in real time to ensure data consistency.
-
For the query operation of the broadcast table, it only needs to be executed once in any shard data source.
-
It is feasible to perform JOIN operation with any other table, because the data of the broadcast table is consistent on all nodes, so the same data on any node can be accessed.
What kind of table can be used as a broadcast table?
In the order management system, it is often necessary to query and count the order data of a certain city area, which will involve the province area table t_city
and the order flow table DB_n
. t_order_n
For JOIN query, it can be considered to design the province area table as 广播表
the core idea is to avoid cross-database JOIN operation .
Note : We mentioned above that the data insertion, update, and deletion of the broadcast table will be executed in real time on each shard data source, that is to say, if you have 1000 shard data sources, then modifying the broadcast table will be executed 1000 times SQL, so try not to do it in a concurrent environment and during business peaks, so as not to affect the performance of the system.
single table
Single table refers to the only table (table without fragmentation) among all fragmented data sources, which is suitable for tables with small data volume and no need for fragmentation.
If the data volume of a table is estimated to be in the tens of millions, and there is no need to perform associated queries with other split tables, it is recommended to set it as a single table type and store it in the default shard data source.
shard key
The shard key determines where the data lands, that is, which data node the data will be allocated for storage. Therefore, the choice of shard key is very important.
For example, after we t_order
shard the table, when inserting an order data to execute SQL, we need to calculate which shard the data should fall into by parsing the shard key specified in the SQL statement. Taking the fields in the table as an example, we can obtain the shard number by order_no
taking a modulo operation (for example ), and then assign data to the corresponding database instance (for example , and ) according to the shard number. Split tables are also calculated in the same way.order_no % 2
DB_1
DB_2
In this process, order_no
it is t_order
the partition key of the table. In other words, the value of each order data order_no
determines the database instance and table it should be stored in. Choosing a field that is suitable as a sharding key can better take advantage of the performance improvement brought by horizontal sharding.
In this way, the relevant data of the same order will fall into the same database and table. When querying the order, the calculation is similar, and the location of the data can be directly located, which greatly improves the performance of data retrieval and avoids scanning the entire database table.
Not only that, ShardingSphere
it also supports fragmentation based on multiple fields as the fragmentation key, which will be described in detail in the subsequent corresponding chapters.
Fragmentation strategy
Fragmentation strategy to specify which fragmentation algorithm to use, which field to choose as the fragmentation key, and how to distribute data to different nodes.
The sharding strategy is composed of 分片算法
and 分片健
, and a variety of sharding algorithms and operations on multiple sharding keys can be used in the sharding strategy.
The sharding strategy configuration for database sharding and table sharding is relatively independent, and different strategies and algorithms can be used respectively. Each strategy can be a combination of multiple sharding algorithms, and each sharding algorithm can be used for multiple sharding keys. Make a logical judgment.
Fragmentation Algorithm
The fragmentation algorithm is used to operate on the fragmentation key and divide the data into specific data nodes.
There are many commonly used fragmentation algorithms:
-
Hash sharding : According to the hash value of the shard key, it is determined which node the data should fall on. For example, hash sharding is performed based on the user ID, and the data belonging to the same user is allocated to the same node to facilitate subsequent query operations.
-
Range sharding : shard key values are allocated to different nodes according to range ranges. For example, sharding based on order creation time or geographic location.
-
Modulo sharding : take the modulo value of the shard key value and the number of shards, and use the result as the node number to which the data should be allocated. For example, order_no % 2 splits order data to one of two nodes.
-
.....
The logic of sharding in actual business development is much more complicated. Different algorithms are suitable for different scenarios and needs, and need to be selected and adjusted according to the actual situation.
binding table
Binding tables are a group of sharding tables with the same sharding rules. Because the sharding rules are consistent, the data landing position is the same, which can effectively avoid cross-database operations JOIN
during joint queries .
For example: t_order
the order table and t_order_item
the order item table both order_no
use fields as the shard key and order_no
are associated with each other, so the two tables are bound to each other.
When using a bound table for multi-table association query, you must use the shard key for association, otherwise Cartesian product association or cross-database association will occur, which will affect query efficiency.
When using t_order
and t_order_item
table for multi-table joint query, execute the logical SQL of the joint query as follows.
SELECT * FROM t_order o JOIN t_order_item i ON o.order_no=i.order_no
If the binding table relationship is not configured, the entire database table will be queried if the data location of the two tables is uncertain, and the Cartesian product related query will generate the following four items SQL
.
SELECT * FROM t_order_0 o JOIN t_order_item_0 i ON o.order_no=i.order_no
SELECT * FROM t_order_0 o JOIN t_order_item_1 i ON o.order_no=i.order_no
SELECT * FROM t_order_1 o JOIN t_order_item_0 i ON o.order_no=i.order_no
SELECT * FROM t_order_1 o JOIN t_order_item_1 i ON o.order_no=i.order_no
When you configure the binding table relationship and then perform an associated query, the data generated by the consistent sharding rules will fall into the same database table, so you only need to associate it t_order_n
with the table in the current database.t_order_item_n
SELECT * FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id
SELECT * FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id
Note : It is used as the main table of the entire joint query when linking queries
t_order
. All related routing calculations only use the strategy of the main table, andt_order_item
the calculations related to table sharding will also uset_order
the conditions, so it is necessary to ensure that the sharding keys between the binding tables are exactly the same.
SQL parsing
When executing a SQL statement at the application level after sub-database sub-table, usually need to go through the following six steps: SQL 解析
-> 执⾏器优化
-> -> SQL 路由
-> - SQL 改写
> SQL 执⾏
-> 结果归并
.
insert image description here
The SQL parsing process is divided 词法解析
into 语法解析
two steps. For example, the following SQL query for user orders is first disassembled into indivisible atomic units by lexical analysis. In dictionaries provided by different database dialects, these units are classified as keywords, expressions, variables or operators.
SELECT order_no FROM t_order where order_status > 0 and user_id = 10086
Then the syntax analysis will convert the split SQL keywords into an abstract syntax tree. By traversing the abstract syntax tree, the context required for fragmentation is extracted. The context includes query field information ( ), table information ( ), and query Field
conditions Table
( Condition
), sorting information ( Order By
), grouping information ( Group By
), and paging information ( Limit
), etc., and mark the positions in SQL that may need to be rewritten.
abstract syntax tree
Executor optimization
Executor optimization is to select and execute the optimal query plan based on SQL query characteristics and execution statistics. For example, if a user_id
field has an index, the positions of the two query conditions will be adjusted, mainly to improve SQL execution efficiency.
SELECT order_no FROM t_order where user_id = 10086 and order_status > 0
SQL routing
The sharding context data is obtained through the above SQL parsing. After matching the sharding strategy and algorithm configured by the user, a routing path can be calculated and routed to the corresponding data node.
A simple understanding is to obtain information such as the sharding key configured in the sharding strategy, find the value of the corresponding sharding key field from the SQL parsing result, and calculate which database and table the SQL should be executed in, and the SQL routing is based on Whether there is a shard key is divided into 分片路由
a sum 广播路由
.
A route with a shard key is called a shard route, which is subdivided into three types: direct route, standard route, and Cartesian product route.
standard route
Standard routing is the most recommended and commonly used sharding method, and its scope of application is SQL that does not include associated queries or only includes associated queries between bound tables.
When the operator of the SQL sharding key =
is , the routing result will fall into a single database (table). When the sharding operator is in the range BETWEEN
of or IN
equal, the routing result does not necessarily fall into the only database (table). A logical SQL may eventually be split into multiple real SQLs for execution.
SELECT * FROM t_order where t_order_id in (1,2)
After SQL routing processing
SELECT * FROM t_order_0 where t_order_id in (1,2)
SELECT * FROM t_order_1 where t_order_id in (1,2)
direct routing
Direct routing is a sharding method that directly routes SQL to specified databases and tables, and direct routing can be used in scenarios where the sharding key is not in SQL, and can also execute complex situations including subqueries and custom functions. Arbitrary SQL.
Cartesian product routing
Cartesian routing is generated by association queries between unbound tables. For example, the shard t_order
key of the order table is the same as the shard key of the t_order_id
user table . Carl product routing, query performance is low, try to avoid this routing mode.t_user
t_order_id
SELECT * FROM t_order_0 t LEFT JOIN t_user_0 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_0 t LEFT JOIN t_user_1 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_1 t LEFT JOIN t_user_0 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_1 t LEFT JOIN t_user_1 u ON u.user_id = t.user_id WHERE t.user_id = 1
Routing without a shard key is also called broadcast routing, which can be divided into five types: full-database table routing, full-database routing, full-instance routing, unicast routing, and blocking routing.
Full table routing
The full database table routing is aimed at operations such as database DQL
and DML
, and DDL
etc. When we execute a logical table SQL, it will be executed one by one in the corresponding real tables t_order
in all sharded databases .t_order_0
t_order_n
Full Library Routing
Full database routing is mainly for operations at the database level, such as SET
database management commands of the database type, and transaction control statements such as TCL.
After setting autocommit
the attribute on the logical library, this command is executed in all corresponding real libraries.
SET autocommit=0;
Full instance routing
Full-instance routing is a DCL operation for database instances (setting or changing database user or role permissions), such as: create a user order, this command will be executed in all real database instances, so as to ensure that the order user can normally access each database instance.
CREATE USER [email protected] identified BY '程序员小富';
unicast routing
Unicast routing is used to obtain information about a real table, such as the description information of the table:
DESCRIBE t_order;
t_order
The real table of is t_order_0
... t_order_n
, their description structure is exactly the same, we only need to execute it once on any real table.
block routing
Used to shield SQL operations on the database, for example:
USE order_db;
This command will not be executed in the real database, because ShardingSphere
the logical schema (organization and structure of the database) is adopted, so there is no need to send the command to switch the database to the real database.
SQL rewriting
After the SQL has been parsed, optimized, and routed, the specific execution location of the shards has been determined, and then the SQL developed based on the logical table must be rewritten into a statement that can be correctly executed in the real database. For example, to query t_order
the order table, the SQL in our actual development is t_order
written according to the logic table.
SELECT * FROM t_order
At this time, it is necessary to rewrite the logical table name in the sub-table configuration to the real table name obtained after routing.
SELECT * FROM t_order_n
SQL execution
The routed and rewritten real SQL is safely and efficiently sent to the underlying data source for execution. However, this process cannot directly send SQL to the data source through JDBC for execution. It needs to balance the consumption of data source connection creation and memory usage. It will automatically balance resource control and execution efficiency.
Merge results
Merging the multi-data result sets obtained from each data node into a large result set and returning it to the requesting client correctly is called result merging. The sorting, grouping, paging, and aggregation syntaxes in our SQL are all operated on the merged result set.
distributed primary key
After the data is fragmented, a logical table ( t_order
) corresponds to many real tables ( t_order_n
). Since they cannot perceive each other, the primary key IDs are accumulated from the initial value, so duplicate primary key IDs will inevitably occur, and the primary key is no longer unique Then it doesn't make sense for business.
Although you can avoid ID collisions by setting the table auto-increment primary key sum, this will increase maintenance costs and poor scalability 初始值
. 步⻓
At this time, we need to manually assign a globally unique ID to a data record. This ID is called a distributed ID, and the system that produces this ID is usually called an issuer.
data desensitization
Desensitization of sub-database and sub-table data is an effective data protection measure, which can ensure the confidentiality and security of sensitive data and reduce the risk of data leakage.
For example, we can specify which fields of the table are desensitized columns when we divide the database and table, and set the corresponding desensitization algorithm. When the data is sharded, we parse the fields to be desensitized in the execution SQL, and the field values will be desensitized directly. Then write to the library table.
For the user's personal information, such as name, address, and phone number, etc., it can be desensitized by encrypting, randomizing, or replacing with pseudo-random data to ensure that the user's privacy is protected.
distributed transaction
The core issue of distributed transactions is how to implement atomic operations across multiple data sources.
Since different services often use different data sources to store and manage data, operations across data sources may lead to the risk of data inconsistency or loss. Therefore, it is very important to ensure the consistency of distributed transactions.
Taking the order system as an example, it needs to call multiple systems such as the payment system, inventory system, and point system, and each system maintains its own database instance, and the systems exchange data through API interfaces.
In order to ensure that multiple systems call successfully at the same time after the order is placed, you can use 强一致性事务
the XA protocol, or 柔性事务
the representative tool Seata, to achieve the consistency of distributed transactions. These tools can help developers simplify the implementation of distributed transactions, reduce errors and vulnerabilities, and improve system stability and reliability.
After sub-database sub-table, the difficulty of the problem is further improved. Its own order service also needs to handle operations across data sources. As a result, the complexity of the system increases significantly. Therefore, it is best to avoid the solution of sub-database and sub-table unless it is a last resort.
data migration
There is still a headache after sub-database sub-table, that is, data migration. In order not to affect the existing business system, a new database cluster is usually created to migrate data. Migrate data from the database and tables of the old cluster to the sub-databases and tables of the new cluster. 数据量
This is a relatively complicated process, and many factors such as , 数据一致性
, and so on need to be considered during the migration process 迁移速度
.
Migration is mainly for processing 存量数据
and 增量数据
processing. Stock data refers to the existing and valuable historical data in the old data source. Incremental data refers to the current continuous growth and business data generated in the future.
Inventory data can be migrated at regular intervals and in batches, and the migration process may last for several days.
Incremental data can adopt the new and old database cluster double-write mode. After the data migration is complete and the business verifies the data consistency, the application can directly switch the data source.
Later, we will combine three-party tools to demonstrate the migration process.
shadow library
What is a shadow library ( Shadow Table
)?
The shadow database is an instance with the same database structure as the production environment. It exists to verify the correctness of database migration or other database modification operations, as well as full-link stress testing without affecting the online system. The data stored in the shadow library is regularly copied from the production environment, but it does not have any impact on online business and is only used for testing, verification and debugging.
Before performing operations such as database upgrade, version change, and parameter tuning, potential problems can be found by simulating these operations on the shadow database, because the data in the test environment is unreliable.
When using shadow libraries, the following principles need to be followed:
-
The structure of the production environment database should be completely consistent, including table structure, index, constraint, etc.;
-
The data must be consistent with the production environment, which can be achieved through regular synchronization;
-
Read and write operations will not affect the production environment. In general, operations such as updating and deleting on the shadow library should be prohibited;
-
Due to the data characteristics of the shadow database, access rights should be strictly controlled, and only authorized personnel are allowed to access and operate;