21 rules to help you "play" sub-database sub-table

Such a good system, why do we need to divide databases and tables?

We combine specific business scenarios and t_ordertake tables as an example to optimize the architecture. Since the amount of data has reached the level of 100 million, the query performance has seriously declined, so we have adopted the sub-database and sub-table technology to deal with this problem. Specifically, we divided the original single library into two libraries, namely DB_1and DB_2, and performed table division processing again in each library to generate t_order_1and t_order_2two tables to realize the sub-database and table processing of the order table.

data sharding

Usually when we refer to sub-databases and tables, we mostly use the horizontal segmentation mode (horizontal sub-database, sub-table) as the basis. Data sharding splits a table with a large amount of data to  t_order generate Several small data volume tables (split tables)  with exactly the same table structuret_order_0 , t_order_1, ... , t_order_n, each table only stores a part of the data in the original large table.

data node

A data node is an indivisible smallest unit (table) in a data shard, which consists of a data source name and a data table. For example, in the above figure, it  DB_1.t_order_1represents DB_2.t_order_2 a data node.

logical table

Logical table refers to the logical name of a horizontally split table with the same structure.

For example, if we split the order t_order table into  10 tables such as t_order_0 ...  t_order_9, this table no longer exists in our database  t_order, and it is replaced by several t_order_ntables.

Sub-database and sub-table are usually non-intrusive to business code. Developers only focus on business logic SQL coding. We SQLstill  t_orderwrite according to the code, and parse it into the actual execution of the corresponding database before executing the logical SQL. SQL. At this point t_order is the split table 逻辑表.

business logic SQL

select * from t_order where order_no='A11111'

Real execution of SQL

select * from DB_1.t_order_n where order_no='A11111'

real table

A real table is a physical table that actually exists in the database DB_1.t_order_n.

broadcast table

A broadcast table is a special type of table whose table structure and data are completely consistent in all shard data sources. Compared with the split table, the broadcast table has a smaller data volume and a lower update frequency, and is usually used in scenarios such as dictionary tables or configuration tables. Since it has copies on all nodes, it can greatly reduce JOINthe network overhead of associated queries and improve query efficiency.

It should be noted that the modification operation of the broadcast table needs to ensure synchronization to ensure that the data on all nodes is consistent.

Features of Broadcast Table :

  • In all fragmented data sources, the data in the broadcast table is completely consistent. Therefore, operations on the broadcast table (such as insert, update, and delete) will be performed in each shard data source in real time to ensure data consistency.

  • For the query operation of the broadcast table, it only needs to be executed once in any shard data source.

  • It is feasible to perform JOIN operation with any other table, because the data of the broadcast table is consistent on all nodes, so the same data on any node can be accessed.

What kind of table can be used as a broadcast table?

In the order management system, it is often necessary to query and count the order data of a certain city area, which will involve the province area table t_cityand the order flow table DB_n. t_order_nFor JOIN query, it can be considered to design the province area table as 广播表the core idea is to avoid cross-database JOIN operation .

Note : We mentioned above that the data insertion, update, and deletion of the broadcast table will be executed in real time on each shard data source, that is to say, if you have 1000 shard data sources, then modifying the broadcast table will be executed 1000 times SQL, so try not to do it in a concurrent environment and during business peaks, so as not to affect the performance of the system.

single table

Single table refers to the only table (table without fragmentation) among all fragmented data sources, which is suitable for tables with small data volume and no need for fragmentation.

If the data volume of a table is estimated to be in the tens of millions, and there is no need to perform associated queries with other split tables, it is recommended to set it as a single table type and store it in the default shard data source.

shard key

The shard key determines where the data lands, that is, which data node the data will be allocated for storage. Therefore, the choice of shard key is very important.

For example, after we  t_order shard the table, when inserting an order data to execute SQL, we need to calculate which shard the data should fall into by parsing the shard key specified in the SQL statement. Taking the fields in the table as an example, we can obtain the shard number by  order_notaking a modulo operation (for example  ), and then assign data to the corresponding database instance (for example  , and  ) according to the shard number. Split tables are also calculated in the same way.order_no % 2DB_1DB_2

In this process, order_no it is  t_order the partition key of the table. In other words, the value of each order data  order_no determines the database instance and table it should be stored in. Choosing a field that is suitable as a sharding key can better take advantage of the performance improvement brought by horizontal sharding.

In this way, the relevant data of the same order will fall into the same database and table. When querying the order, the calculation is similar, and the location of the data can be directly located, which greatly improves the performance of data retrieval and avoids scanning the entire database table.

Not only that,  ShardingSphere it also supports fragmentation based on multiple fields as the fragmentation key, which will be described in detail in the subsequent corresponding chapters.

Fragmentation strategy

Fragmentation strategy to specify which fragmentation algorithm to use, which field to choose as the fragmentation key, and how to distribute data to different nodes.

The sharding strategy is composed of 分片算法and 分片健, and a variety of sharding algorithms and operations on multiple sharding keys can be used in the sharding strategy.

The sharding strategy configuration for database sharding and table sharding is relatively independent, and different strategies and algorithms can be used respectively. Each strategy can be a combination of multiple sharding algorithms, and each sharding algorithm can be used for multiple sharding keys. Make a logical judgment.

Fragmentation Algorithm

The fragmentation algorithm is used to operate on the fragmentation key and divide the data into specific data nodes.

There are many commonly used fragmentation algorithms:

  • Hash sharding : According to the hash value of the shard key, it is determined which node the data should fall on. For example, hash sharding is performed based on the user ID, and the data belonging to the same user is allocated to the same node to facilitate subsequent query operations.

  • Range sharding : shard key values ​​are allocated to different nodes according to range ranges. For example, sharding based on order creation time or geographic location.

  • Modulo sharding : take the modulo value of the shard key value and the number of shards, and use the result as the node number to which the data should be allocated. For example, order_no % 2 splits order data to one of two nodes.

  • .....

The logic of sharding in actual business development is much more complicated. Different algorithms are suitable for different scenarios and needs, and need to be selected and adjusted according to the actual situation.

binding table

Binding tables are a group of sharding tables with the same sharding rules. Because the sharding rules are consistent, the data landing position is the same, which can effectively avoid cross-database operations JOINduring joint queries .

For example: t_order the order table and  t_order_item the order item table both  order_no use fields as the shard key and  order_no are associated with each other, so the two tables are bound to each other.

When using a bound table for multi-table association query, you must use the shard key for association, otherwise Cartesian product association or cross-database association will occur, which will affect query efficiency.

When using  t_order and  t_order_item table for multi-table joint query, execute the logical SQL of the joint query as follows.

SELECT * FROM t_order o JOIN t_order_item i ON o.order_no=i.order_no

If the binding table relationship is not configured, the entire database table will be queried if the data location of the two tables is uncertain, and the Cartesian product related query will generate the following four items SQL.

SELECT * FROM t_order_0 o JOIN t_order_item_0 i ON o.order_no=i.order_no 
SELECT * FROM t_order_0 o JOIN t_order_item_1 i ON o.order_no=i.order_no 
SELECT * FROM t_order_1 o JOIN t_order_item_0 i ON o.order_no=i.order_no 
SELECT * FROM t_order_1 o JOIN t_order_item_1 i ON o.order_no=i.order_no 

When you configure the binding table relationship and then perform an associated query, the data generated by the consistent sharding rules will fall into the same database table, so you only need to   associate it t_order_n with  the table in the current database.t_order_item_n

SELECT * FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id 
SELECT * FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id 

Note : It is used as the main table of the entire joint query when linking queries  t_order . All related routing calculations only use the strategy of the main table, and t_order_item the calculations related to table sharding will also use  t_order the conditions, so it is necessary to ensure that the sharding keys between the binding tables are exactly the same.

SQL parsing

When executing a SQL statement at the application level after sub-database sub-table, usually need to go through the following six steps: SQL 解析 ->  执⾏器优化 -> ->  SQL 路由 -> -  SQL 改写 >  SQL 执⾏ ->  结果归并 .

insert image description here

The SQL parsing process is divided 词法解析into 语法解析two steps. For example, the following SQL query for user orders is first disassembled into indivisible atomic units by lexical analysis. In dictionaries provided by different database dialects, these units are classified as keywords, expressions, variables or operators.

SELECT order_no FROM t_order where  order_status > 0  and user_id = 10086 

Then the syntax analysis will convert the split SQL keywords into an abstract syntax tree. By traversing the abstract syntax tree, the context required for fragmentation is extracted. The context includes query field information ( ), table information ( ), and query Fieldconditions Table( Condition), sorting information ( Order By), grouping information ( Group By), and paging information ( Limit), etc., and mark the positions in SQL that may need to be rewritten.

abstract syntax tree

Executor optimization

Executor optimization is to select and execute the optimal query plan based on SQL query characteristics and execution statistics. For example, if a user_idfield has an index, the positions of the two query conditions will be adjusted, mainly to improve SQL execution efficiency.

SELECT order_no FROM t_order where user_id = 10086 and order_status > 0

SQL routing

The sharding context data is obtained through the above SQL parsing. After matching the sharding strategy and algorithm configured by the user, a routing path can be calculated and routed to the corresponding data node.

A simple understanding is to obtain information such as the sharding key configured in the sharding strategy, find the value of the corresponding sharding key field from the SQL parsing result, and calculate which database and table the SQL should be executed in, and the SQL routing is based on Whether there is a shard key is divided into  分片路由 a sum  广播路由.

A route with a shard key is called a shard route, which is subdivided into three types: direct route, standard route, and Cartesian product route.

standard route

Standard routing is the most recommended and commonly used sharding method, and its scope of application is SQL that does not include associated queries or only includes associated queries between bound tables.

When the operator of the SQL sharding key = is , the routing result will fall into a single database (table). When the sharding operator is in the range  BETWEEN of or  IN equal, the routing result does not necessarily fall into the only database (table). A logical SQL may eventually be split into multiple real SQLs for execution.

SELECT * FROM t_order  where t_order_id in (1,2)

After SQL routing processing

SELECT * FROM t_order_0  where t_order_id in (1,2)
SELECT * FROM t_order_1  where t_order_id in (1,2)

direct routing

Direct routing is a sharding method that directly routes SQL to specified databases and tables, and direct routing can be used in scenarios where the sharding key is not in SQL, and can also execute complex situations including subqueries and custom functions. Arbitrary SQL.

Cartesian product routing

Cartesian routing is generated by association queries between unbound tables. For example, the shard t_order key of the order table is the same as the shard key of the t_order_id user table . Carl product routing, query performance is low, try to avoid this routing mode.t_usert_order_id 

SELECT * FROM t_order_0 t LEFT JOIN t_user_0 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_0 t LEFT JOIN t_user_1 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_1 t LEFT JOIN t_user_0 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_1 t LEFT JOIN t_user_1 u ON u.user_id = t.user_id WHERE t.user_id = 1

Routing without a shard key is also called broadcast routing, which can be divided into five types: full-database table routing, full-database routing, full-instance routing, unicast routing, and blocking routing.

Full table routing

The full database table routing is aimed at operations such as database  DQL and  DML, and  DDLetc. When we execute a logical table  SQL,  it will be executed one by one in the corresponding real  tables   t_order in all sharded databases  .t_order_0t_order_n

Full Library Routing

Full database routing is mainly for operations at the database level, such as  SET database management commands of the database type, and transaction control statements such as TCL.

After setting  autocommit the attribute on the logical library, this command is executed in all corresponding real libraries.

SET autocommit=0;

Full instance routing

Full-instance routing is a DCL operation for database instances (setting or changing database user or role permissions), such as: create a user order, this command will be executed in all real database instances, so as to ensure that the order user can normally access each database instance.

CREATE USER [email protected] identified BY '程序员小富';

unicast routing

Unicast routing is used to obtain information about a real table, such as the description information of the table:

DESCRIBE t_order; 

t_order The real table of is  t_order_0 ...  t_order_n, their description structure is exactly the same, we only need to execute it once on any real table.

block routing

Used to shield SQL operations on the database, for example:

USE order_db;

This command will not be executed in the real database, because  ShardingSphere the logical schema (organization and structure of the database) is adopted, so there is no need to send the command to switch the database to the real database.

SQL rewriting

After the SQL has been parsed, optimized, and routed, the specific execution location of the shards has been determined, and then the SQL developed based on the logical table must be rewritten into a statement that can be correctly executed in the real database. For example, to query  t_order the order table, the SQL in our actual development is  t_order written according to the logic table.

SELECT * FROM t_order

At this time, it is necessary to rewrite the logical table name in the sub-table configuration to the real table name obtained after routing.

SELECT * FROM t_order_n

SQL execution

The routed and rewritten real SQL is safely and efficiently sent to the underlying data source for execution. However, this process cannot directly send SQL to the data source through JDBC for execution. It needs to balance the consumption of data source connection creation and memory usage. It will automatically balance resource control and execution efficiency.

Merge results

Merging the multi-data result sets obtained from each data node into a large result set and returning it to the requesting client correctly is called result merging. The sorting, grouping, paging, and aggregation syntaxes in our SQL are all operated on the merged result set.

distributed primary key

After the data is fragmented, a logical table ( t_order) corresponds to many real tables ( t_order_n). Since they cannot perceive each other, the primary key IDs are accumulated from the initial value, so duplicate primary key IDs will inevitably occur, and the primary key is no longer unique Then it doesn't make sense for business.

Although you can   avoid ID collisions by setting the table auto-increment primary key sum, this will increase maintenance costs and poor scalability 初始值 . 步⻓

At this time, we need to manually assign a globally unique ID to a data record. This ID is called a distributed ID, and the system that produces this ID is usually called an issuer.

data desensitization

Desensitization of sub-database and sub-table data is an effective data protection measure, which can ensure the confidentiality and security of sensitive data and reduce the risk of data leakage.

For example, we can specify which fields of the table are desensitized columns when we divide the database and table, and set the corresponding desensitization algorithm. When the data is sharded, we parse the fields to be desensitized in the execution SQL, and the field values ​​will be desensitized directly. Then write to the library table.

For the user's personal information, such as name, address, and phone number, etc., it can be desensitized by encrypting, randomizing, or replacing with pseudo-random data to ensure that the user's privacy is protected.

distributed transaction

The core issue of distributed transactions is how to implement atomic operations across multiple data sources.

Since different services often use different data sources to store and manage data, operations across data sources may lead to the risk of data inconsistency or loss. Therefore, it is very important to ensure the consistency of distributed transactions.

Taking the order system as an example, it needs to call multiple systems such as the payment system, inventory system, and point system, and each system maintains its own database instance, and the systems exchange data through API interfaces.

In order to ensure that multiple systems call successfully at the same time after the order is placed, you can use 强一致性事务the XA protocol, or 柔性事务the representative tool Seata, to achieve the consistency of distributed transactions. These tools can help developers simplify the implementation of distributed transactions, reduce errors and vulnerabilities, and improve system stability and reliability.

After sub-database sub-table, the difficulty of the problem is further improved. Its own order service also needs to handle operations across data sources. As a result, the complexity of the system increases significantly. Therefore, it is best to avoid the solution of sub-database and sub-table unless it is a last resort.

data migration

There is still a headache after sub-database sub-table, that is, data migration. In order not to affect the existing business system, a new database cluster is usually created to migrate data. Migrate data from the database and tables of the old cluster to the sub-databases and tables of the new cluster. 数据量This is a relatively complicated process, and many factors such as , 数据一致性, and so on need to be considered during the migration process 迁移速度.

Migration is mainly for  processing 存量数据 and  增量数据 processing. Stock data refers to the existing and valuable historical data in the old data source. Incremental data refers to the current continuous growth and business data generated in the future.

Inventory data can be migrated at regular intervals and in batches, and the migration process may last for several days.

Incremental data can adopt the new and old database cluster double-write mode. After the data migration is complete and the business verifies the data consistency, the application can directly switch the data source.

Later, we will combine three-party tools to demonstrate the migration process.

shadow library

What is a shadow library ( Shadow Table)?

The shadow database is an instance with the same database structure as the production environment. It exists to verify the correctness of database migration or other database modification operations, as well as full-link stress testing without affecting the online system. The data stored in the shadow library is regularly copied from the production environment, but it does not have any impact on online business and is only used for testing, verification and debugging.

Before performing operations such as database upgrade, version change, and parameter tuning, potential problems can be found by simulating these operations on the shadow database, because the data in the test environment is unreliable.

When using shadow libraries, the following principles need to be followed:

  • The structure of the production environment database should be completely consistent, including table structure, index, constraint, etc.;

  • The data must be consistent with the production environment, which can be achieved through regular synchronization;

  • Read and write operations will not affect the production environment. In general, operations such as updating and deleting on the shadow library should be prohibited;

  • Due to the data characteristics of the shadow database, access rights should be strictly controlled, and only authorized personnel are allowed to access and operate;

Guess you like

Origin blog.csdn.net/m0_37723088/article/details/130978625