One article to understand the 21 rules of sub-database and sub-table architecture

This article introduces 21 general concepts about the sub-database and sub-table architecture. After a certain understanding, we will enter more in-depth content, including read-write separation, data desensitization, distributed primary key, distributed transaction, configuration center , Registration Center, Proxy service and other actual combat cases explanation and source code analysis.

(1) With a good system, why do we need to divide databases and tables?

Let’s first introduce some general concepts that we will encounter during the implementation of the sub-database and sub-table architecture. Understanding these concepts can help us understand other sub-database and sub-table tools on the market. Although their implementation methods may be different, the overall idea Basically the same. Therefore, before starting the actual operation, it is necessary for us to master these general concepts in order to better understand and apply the sub-database and sub-table technology.

We combine specific business scenarios and take the t_order table as an example to optimize the architecture. Since the amount of data has reached the level of 100 million, the query performance has seriously declined, so we have adopted the sub-database and sub-table technology to deal with this problem. Specifically, we divided the original single database into two databases, namely DB_1 and DB_2, and performed sub-table processing in each database again to generate two tables, t_order_1 and t_order_2, to realize the sub-database and sub-database of the order table. table processing.

data sharding

Usually when we refer to sub-databases and tables, we mostly use the horizontal segmentation mode (horizontal sub-database, sub-table) as the basis. Data sharding splits the original table t_order with a large amount of data Generate several small data volume tables (split tables) with exactly the same table structure t_order_0, t_order_1, ..., t_order_n, and each table only stores a part of the data in the original large table.

data node

A data node is an indivisible smallest unit (table) in data sharding, which consists of a data source name and a data table. For example, DB_1.t_order_1 and DB_2.t_order_2 in the above figure represent a data node.

logical table

Logical table refers to the logical name of a horizontally split table with the same structure.

For example, we split the order table t_order sub-table into 10 tables such as t_order_0 ··· t_order_9. At this time, the t_order table no longer exists in our database, and it is replaced by several t_order_n tables.

Sub-database and sub-table are usually non-intrusive to business code. Developers only focus on business logic SQL coding. We still write SQL in t_order in the code, and parse it into the corresponding database real before executing logical SQL. The SQL to execute. At this point t_order is the logical table of these split tables.

business logic SQL

copy

select * from t_order where order_no='A11111'
  • 1.

Real execution of SQL

copy

select * from DB_1.t_order_n where order_no='A11111'
  • 1.

real table

The real table is the physical table DB_1.t_order_n that actually exists in the database.

broadcast table

A broadcast table is a special type of table whose table structure and data are completely consistent in all shard data sources. Compared with the split table, the broadcast table has a smaller data volume and a lower update frequency, and is usually used in scenarios such as dictionary tables or configuration tables. Since it has copies on all nodes, it can greatly reduce the network overhead of JOIN association queries and improve query efficiency.

It should be noted that the modification operation of the broadcast table needs to ensure synchronization to ensure that the data on all nodes is consistent.

Features of Broadcast Table:

  • In all fragmented data sources, the data in the broadcast table is completely consistent. Therefore, operations on the broadcast table (such as insert, update, and delete) will be performed in each shard data source in real time to ensure data consistency.
  • For the query operation of the broadcast table, it only needs to be executed once in any shard data source.
  • It is feasible to perform JOIN operation with any other table, because the data of the broadcast table is consistent on all nodes, so the same data on any node can be accessed.

What kind of table can be used as a broadcast table?

In the order management system, it is often necessary to query and count the order data of a certain city area, which will involve the JOIN query of the province area table t_city and the order flow table DB_n.t_order_n, so it can be considered to design the province area table as a broadcast table, the core The idea is to avoid cross-library JOIN operations.

Note: We mentioned above that the data insertion, update, and deletion of the broadcast table will be executed in real time on each shard data source, that is to say, if you have 1000 shard data sources, then modifying the broadcast table will be executed 1000 times SQL, so try not to do it in a concurrent environment and during business peaks, so as not to affect the performance of the system.

single table

Single table refers to the only table (table without fragmentation) among all fragmented data sources, which is suitable for tables with small data volume and no need for fragmentation.

If the data volume of a table is estimated to be in the tens of millions, and there is no need to perform associated queries with other split tables, it is recommended to set it as a single table type and store it in the default shard data source.

shard key

The shard key determines where the data lands, that is, which data node the data will be allocated for storage. Therefore, the choice of shard key is very important.

For example, after we shard the t_order table, when inserting an order data to execute SQL, we need to calculate which shard the data should fall into by parsing the shard key specified in the SQL statement. Taking the order_no field in the table as an example, we can obtain the shard number by taking a modulo operation (such as order_no % 2), and then allocate data to the corresponding database instances (such as DB_1 and DB_2) according to the shard number. Split tables are also calculated in the same way.

In this process, order_no is the shard key of the t_order table. In other words, the order_no value of each order data determines the database instance and table it should be stored in. Choosing a field that is suitable as a sharding key can better take advantage of the performance improvement brought by horizontal sharding.

In this way, the relevant data of the same order will fall into the same database and table. When querying the order, the calculation is similar, and the data location can be directly located, which greatly improves the performance of data retrieval and avoids scanning the entire database table.

Not only that, ShardingSphere also supports sharding based on multiple fields as sharding keys, which will be described in detail in subsequent corresponding chapters.

Fragmentation strategy

Fragmentation strategy to specify which fragmentation algorithm to use, which field to choose as the fragmentation key, and how to distribute data to different nodes.

The sharding strategy is composed of sharding algorithms and sharding keys. Multiple sharding algorithms and operations on multiple sharding keys can be used in the sharding strategy.

The sharding strategy configuration for database sharding and table sharding is relatively independent, and different strategies and algorithms can be used respectively. Each strategy can be a combination of multiple sharding algorithms, and each sharding algorithm can be used for multiple sharding keys. Make a logical judgment.

Fragmentation Algorithm

The fragmentation algorithm is used to operate on the fragmentation key and divide the data into specific data nodes.

There are many commonly used fragmentation algorithms:

  • Hash sharding: According to the hash value of the shard key, it is determined which node the data should fall on. For example, hash sharding is performed based on the user ID, and the data belonging to the same user is allocated to the same node to facilitate subsequent query operations.
  • Range sharding: shard key values ​​are allocated to different nodes according to range ranges. For example, sharding based on order creation time or geographic location.
  • Modulo sharding: take the modulo value of the shard key value and the number of shards, and use the result as the node number to which the data should be allocated. For example, order_no % 2 splits order data to one of two nodes.
  • .....

The logic of sharding in actual business development is much more complicated. Different algorithms are suitable for different scenarios and needs, and need to be selected and adjusted according to the actual situation.

binding table

Binding tables are a group of sharding tables with the same sharding rules. Because the sharding rules are consistent, the data landing position is the same, which can effectively avoid cross-database operations during JOIN joint queries.

For example: the t_order order table and the t_order_item order item table both use the order_no field as the shard key, and use order_no for association, so the two tables are bound to each other.

When using a bound table for multi-table association query, you must use the shard key for association, otherwise Cartesian product association or cross-database association will occur, which will affect query efficiency.

When using the t_order and t_order_item tables for multi-table joint query, execute the logical SQL of the joint query as follows.

copy

SELECT * FROM t_order o JOIN t_order_item i ON o.order_no=i.order_no
  • 1.

If the binding table relationship is not configured, the entire database table will be queried if the data location of the two tables is uncertain, and the Cartesian product related query will generate the following four SQLs.

copy

SELECT * FROM t_order_0 o JOIN t_order_item_0 i ON o.order_no=i.order_no 
SELECT * FROM t_order_0 o JOIN t_order_item_1 i ON o.order_no=i.order_no 
SELECT * FROM t_order_1 o JOIN t_order_item_0 i ON o.order_no=i.order_no 
SELECT * FROM t_order_1 o JOIN t_order_item_1 i ON o.order_no=i.order_no
  • 1.
  • 2.
  • 3.
  • 4.

When the association query is performed after configuring the binding table relationship, the data generated by the consistent sharding rules will fall into the same library table, so you only need to associate the t_order_n and t_order_item_n tables in the current library.

copy

SELECT * FROM t_order_0 o JOIN t_order_item_0 i ON o.order_id=i.order_id 
SELECT * FROM t_order_1 o JOIN t_order_item_1 i ON o.order_id=i.order_id
  • 1.
  • 2.

Note: t_order is used as the main table for the entire union query when joining queries. All related routing calculations only use the strategy of the main table, and the sharding-related calculations of the t_order_item table also use the t_order condition, so it is necessary to ensure that the sharding keys between the binding tables are exactly the same.

SQL parsing

When executing a SQL statement at the application level after sub-database and sub-table, it usually needs to go through the following six steps: SQL parsing -> executor optimization -> SQL routing -> SQL rewriting -> SQL execution -> result merging.

The SQL parsing process is divided into two steps: lexical parsing and syntax parsing. For example, query the SQL for user orders below, and first use lexical parsing to disassemble the SQL into indivisible atomic units. In dictionaries provided by different database dialects, these units are classified as keywords, expressions, variables or operators.

copy

SELECT order_no FROM t_order where  order_status > 0  and user_id = 10086
  • 1.

Then the syntax analysis will convert the split SQL keywords into an abstract syntax tree, and extract the context required for fragmentation by traversing the abstract syntax tree. The context includes query field information (Field), table information (Table), query Condition (Condition), sorting information (Order By), grouping information (Group By) and paging information (Limit), etc., and mark the positions in SQL that may need to be rewritten.

abstract syntax tree

Executor optimization

Executor optimization is to select and execute the optimal query plan based on SQL query characteristics and execution statistics. For example, if the user_id field has an index, the positions of the two query conditions will be adjusted, mainly to improve SQL execution efficiency.

copy

SELECT order_no FROM t_order where user_id = 10086 and order_status > 0
  • 1.

SQL routing

The sharding context data is obtained through the above SQL parsing. After matching the sharding strategy and algorithm configured by the user, a routing path can be calculated and routed to the corresponding data node.

A simple understanding is to obtain information such as the sharding key configured in the sharding strategy, find the value of the corresponding sharding key field from the SQL parsing result, and calculate which database and table the SQL should be executed in, and the SQL routing is based on Whether there is a fragment key is divided into fragment routing and broadcast routing.

A route with a shard key is called a shard route, which is subdivided into three types: direct route, standard route, and Cartesian product route.

standard routing

Standard routing is the most recommended and commonly used sharding method, and its scope of application is SQL that does not include associated queries or only includes associated queries between bound tables.

When the operator of the SQL sharding key is =, the routing result will fall into a single library (table). When the sharding operator is in a range such as BETWEEN or IN, the routing result may not necessarily fall into the only library (table) , so a logical SQL may eventually be split into multiple real SQLs for execution.

copy

SELECT * FROM t_order  where t_order_id in (1,2)
  • 1.

After SQL routing processing

copy

SELECT * FROM t_order_0  where t_order_id in (1,2)
SELECT * FROM t_order_1  where t_order_id in (1,2)
  • 1.
  • 2.

direct routing

Direct routing is a sharding method that directly routes SQL to specified databases and tables, and direct routing can be used in scenarios where the sharding key is not in SQL, and can also execute complex situations including subqueries and custom functions. Arbitrary SQL.

Cartesian product routing

Cartesian routing is generated by association queries between unbound tables. For example, the shard key of the order table t_order is t_order_id and the shard key of the user table t_user is t_order_id. The shard keys of the two tables are different, and a joint table query is required. , Cartesian product routing will be executed, and the query performance is low, try to avoid this routing mode.

copy

SELECT * FROM t_order_0 t LEFT JOIN t_user_0 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_0 t LEFT JOIN t_user_1 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_1 t LEFT JOIN t_user_0 u ON u.user_id = t.user_id WHERE t.user_id = 1
SELECT * FROM t_order_1 t LEFT JOIN t_user_1 u ON u.user_id = t.user_id WHERE t.user_id = 1
  • 1.
  • 2.
  • 3.
  • 4.

Routing without a shard key is also called broadcast routing, which can be divided into five types: full-database table routing, full-database routing, full-instance routing, unicast routing, and blocking routing.

Full table routing

The full database table routing is aimed at operations such as database DQL, DML, and DDL. When we execute a logical table t_order SQL, it will be executed one by one in the corresponding real tables t_order_0···t_order_n in all sharded databases.

Full Library Routing

Full database routing is mainly for operations at the database level, such as database management commands of the database SET type, and transaction control statements such as TCL.

After setting the autocommit attribute on the logical library, the command will be executed in all corresponding real libraries.

copy

SET autocommit=0;
  • 1.

Full instance routing

Full-instance routing is a DCL operation for database instances (setting or changing database user or role permissions), such as: create a user order, this command will be executed in all real database instances, so as to ensure that the order user can normally access each database instance.

copy

CREATE USER [email protected] identified BY '程序员小富';
  • 1.

unicast routing

Unicast routing is used to obtain information about a real table, such as the description information of the table:

copy

DESCRIBE t_order;
  • 1.

The real table of t_order is t_order_0 ···· t_order_n, their description structures are exactly the same, we only need to execute it once in any real table.

block routing

Used to shield SQL operations on the database, for example:

copy

USE order_db;
  • 1.

This command will not be executed in the real database, because ShardingSphere uses a logical Schema (organization and structure of the database), so there is no need to send the command to switch the database to the real database.

SQL rewriting

After the SQL has been parsed, optimized, and routed, the specific execution location of the shards has been determined, and then the SQL developed based on the logical table must be rewritten into a statement that can be correctly executed in the real database. For example, to query the t_order order table, the SQL in our actual development is written according to the logical table t_order.

copy

SELECT * FROM t_order
  • 1.

At this time, it is necessary to rewrite the logical table name in the sub-table configuration to the real table name obtained after routing.

copy

SELECT * FROM t_order_n
  • 1.

SQL execution

The routed and rewritten real SQL is safely and efficiently sent to the underlying data source for execution. However, this process cannot directly send SQL to the data source through JDBC for execution. It needs to balance the consumption of data source connection creation and memory usage. It will automatically balance resource control and execution efficiency.

Merge results

Merging the multi-data result sets obtained from each data node into a large result set and returning it to the requesting client correctly is called result merging. The sorting, grouping, paging, and aggregation syntaxes in our SQL are all operated on the merged result set.

distributed primary key

After the data is fragmented, one logical table (t_order) corresponds to many real tables (t_order_n). Since they cannot perceive each other, the primary key IDs are accumulated from the initial value, so duplicate primary key IDs will inevitably occur. At this time, the primary key is not If it is the only one, then it is meaningless for the business.

Although you can avoid ID collisions by setting the initial value and step size of the table auto-increment primary key, this will increase the maintenance cost and poor scalability.

At this time, we need to manually assign a globally unique ID to a data record. This ID is called a distributed ID, and the system that produces this ID is usually called an issuer.

You can refer to this article I published before 9 kinds of distributed ID generation schemes

data desensitization

Desensitization of sub-database and sub-table data is an effective data protection measure, which can ensure the confidentiality and security of sensitive data and reduce the risk of data leakage.

For example, we can specify which fields of the table are desensitized columns when we divide the database and table, and set the corresponding desensitization algorithm. When the data is sharded, we parse the fields to be desensitized in the execution SQL, and the field values ​​will be desensitized directly. Then write to the library table.

For the user's personal information, such as name, address, and phone number, etc., it can be desensitized by encrypting, randomizing, or replacing with pseudo-random data to ensure that the user's privacy is protected.

You can refer to this article I published before, 6 kinds of data desensitization schemes that big factories are also using

distributed transaction

The core issue of distributed transactions is how to implement atomic operations across multiple data sources.

Since different services often use different data sources to store and manage data, operations across data sources may lead to the risk of data inconsistency or loss. Therefore, it is very important to ensure the consistency of distributed transactions.

Taking the order system as an example, it needs to call multiple systems such as the payment system, inventory system, and point system, and each system maintains its own database instance, and the systems exchange data through API interfaces.

In order to ensure that multiple systems call successfully at the same time after the order is placed, you can use the XA protocol for strongly consistent transactions, or Seata, a representative tool for flexible transactions, to achieve the consistency of distributed transactions. These tools can help developers simplify the implementation of distributed transactions, reduce errors and vulnerabilities, and improve system stability and reliability.

After sub-database sub-table, the difficulty of the problem is further improved. Its own order service also needs to handle operations across data sources. As a result, the complexity of the system increases significantly. Therefore, it is best to avoid the solution of sub-database and sub-table unless it is a last resort.

For a detailed introduction to distributed transactions, you can refer to this article I published before. Compared with 5 distributed transaction solutions, I still favor Ali's Seata (principle + actual combat)

data migration

There is still a headache after sub-database sub-table, that is, data migration. In order not to affect the existing business system, a new database cluster is usually created to migrate data. Migrate data from the database and tables of the old cluster to the sub-databases and tables of the new cluster. This is a relatively complicated process, and many factors such as data volume, data consistency, and migration speed need to be considered during the migration process.

Migration is mainly for the processing of stock data and incremental data. Stock data refers to the existing and valuable historical data in the old data source. Incremental data refers to the business data that continues to grow and will be generated in the future.

Inventory data can be migrated at regular intervals and in batches, and the migration process may last for several days.

Incremental data can adopt the new and old database cluster double-write mode. After the data migration is complete and the business verifies the data consistency, the application can directly switch the data source.

Later, we will combine three-party tools to demonstrate the migration process.

shadow library

What is a Shadow Table?

The shadow database is an instance with the same database structure as the production environment. It exists to verify the correctness of database migration or other database modification operations, as well as full-link stress testing without affecting the online system. The data stored in the shadow library is regularly copied from the production environment, but it does not have any impact on online business and is only used for testing, verification and debugging.

Before performing operations such as database upgrade, version change, and parameter tuning, potential problems can be found by simulating these operations on the shadow database, because the data in the test environment is unreliable.

When using shadow libraries, the following principles need to be followed:

  • The structure of the production environment database should be completely consistent, including table structure, index, constraint, etc.;
  • The data must be consistent with the production environment, which can be achieved through regular synchronization;
  • Read and write operations will not affect the production environment. In general, operations such as updating and deleting on the shadow library should be prohibited;
  • Due to the data characteristics of the shadow database, access rights should be strictly controlled, and only authorized personnel are allowed to access and operate;

Summarize

This article introduces 21 general concepts about the sub-database and sub-table architecture. After a certain understanding, we will enter more in-depth content, including read-write separation, data desensitization, distributed primary key, distributed transaction, configuration center , Registration Center, Proxy service and other actual combat cases explanation and source code analysis.

Guess you like

Origin blog.csdn.net/xxxzzzqqq_/article/details/130677763