Comprehensive knowledge of sharding and sharding and Sharding-JDBC practice

Article directory

1. Why should we divide databases and tables?

1.1 What is a sub-library?

One database is divided into multiple databases and deployed to different machines

Insert image description here

1.2 What is a sub-table?

A database table is divided into multiple tables

Insert image description here

1.3 Why split the database?

If the business volume increases sharply, the database may experience performance bottlenecks. At this time, we need to consider splitting the database.

1.3.1 Disk storage

The business volume increased dramatically, and the MySQL single-machine disk capacity was insufficient. It was split into multiple databases, and the disk usage was greatly reduced.

1.3.2 Concurrent connection support

In a high-concurrency scenario, a large number of requests access the database, and MySQL alone cannot handle it. The emergence of microservice architecture is to cope with high concurrency. It splits different modules such as orders, users, products, etc. into multiple applications, and also splits a single database into multiple databases with different functional modules (order database, user database, product database) to share the reading and writing pressure.

1.4 Why divide tables?

If the amount of data is too large, SQL queries will slow down. If a SQL query misses the index, millions of data tables may overwhelm the database.
Even if SQL hits the index, if the data volume of the table exceeds tens of millions, the query will slow down significantly.
B+ tree structure, with tens of millions of data levels, the height of the B+ tree will increase and the query will become slower.

Tips:
Mysql now uses the innodb storage engine, the data structure is B+ tree, and the number of disk IOs is the same as the height of the tree.
The B+ tree has many branches, and each non-leaf node only stores the primary key value (primary key index) and pointer, and the data exists in the leaf nodes.
Its smallest storage unit is a page, and the default size of a page is 16KB. Each node representing the B+ tree can store 16KB of data. Here we assume that the size of one row of data is 1K, and one node can store 16 rows of data.
Real data is stored in leaf nodes, so this means that leaf nodes can store 16 rows of data. Assuming that the primary key type is bigint and occupies 8Byte, the pointer can be set to occupy 6Byte, resulting in a total of 14Byte. In this way, you can calculate how many pointers a node can store (pointers point to the next layer of nodes), which is approximately 16KB/14Byte=1170. From this, it can be deduced that a 2-layer B+ tree can store 1170 16 = 18720 rows of data. For a 3-layer B+ tree, 1170 1170*16=21902400 rows of data can be stored. If more are added, the height of the tree will become higher.

2. Sub-database and sub-table solution

There are two common types of vertical segmentation: vertical database partitioning and vertical table partitioning.

2.1 Vertical (longitudinal) segmentation

Vertical sharding stores different tables with low correlation in different databases based on business coupling. The approach is similar to splitting a large system into multiple small systems, which are divided independently according to business classification. Similar to the approach of "microservice governance", each microservice uses a separate database.
Vertical table partitioning is based on "columns" in the database. If a table has many fields, you can create a new extension table and split the fields that are not frequently used or have large field lengths into the extension table. When there are many fields (for example, a large table has more than 100 fields), "split the large table into small tables" is easier to develop and maintain, and can also avoid cross-page problems. The bottom layer of MySQL is stored through data pages. Recording that takes up too much space will lead to page crossing, causing additional performance overhead. In addition, the database loads data into the memory in units of rows, so that the field length in the table is shorter and the access frequency is higher. The memory can load more data, the hit rate is higher, and the disk IO is reduced, thus improving the database performance.

2.1.1 Advantages of vertical segmentation

Solve the coupling at the business system level and make the business clear.
Similar to the governance of microservices, it can also perform hierarchical management, maintenance, monitoring, and expansion of data of different businesses. In
high scenarios, vertical segmentation can increase the number of IO, database connections, and single-machine Hardware resource bottleneck

2.1.2 Disadvantages of vertical segmentation

Some tables cannot be joined and can only be solved through interface aggregation, which increases the complexity of development.
Distributed transaction processing is complex.
There is still the problem of excessive data volume in a single table (needs horizontal segmentation).

2.2 Horizontal (lateral) segmentation

Horizontal segmentation is divided into intra-database sharding and sub-database sharding. It is based on the fact that when an application is difficult to perform fine-grained vertical segmentation within the data in the table, or the amount of data after segmentation is huge, there are problems with reading, writing, and storing in a single database. Performance bottleneck, at this time horizontal segmentation is required.
In the logical relationship, the same table is dispersed into multiple databases or multiple tables according to different conditions. Each table only contains a part of the data, thus making the amount of data in a single table smaller and achieving a distributed effect.
Table splitting within the database only solves the problem of excessive data volume in a single table, but it does not distribute the tables to databases on different machines. Therefore, it is not very helpful in reducing the pressure on the MySQL database. It still competes for the same physical machine. CPU, memory, and network IO are best solved through sub-databases and sub-tables.

2.2.1 Advantages of horizontal segmentation

There is no performance bottleneck caused by excessive data volume or high concurrency in a single database, which improves system stability and load capacity.
Application-side modifications are small and there is no need to split business modules.

2.2.2 Disadvantages of horizontal segmentation

Transaction consistency across shards is difficult to guarantee.
Cross-database join query performance is poor.
Multiple expansions of data are difficult and the amount of maintenance is extremely large.
After horizontal slicing, the same table will appear in multiple databases/tables, each database/table The content is different.

2.3 Two questions

Now that we know why we need to divide and divide, we may have two questions:
When should we divide the database into tables? How to divide the database into tables?
Let's continue to see

2.4 When to consider segmentation

2.4.1 Estimated data volume

Alibaba recommends that if the data volume of a single table is expected to be greater than 5 million or the data file of a single table is greater than 2G within three years, consider sub-databases and tables.

2.4.2 Estimated data trends

From the perspective of data growth trend, should the growth rate change from slow to fast or from fast to slow?

2.4.3 Estimated application scenarios

Suitable for business scenarios with more reading and less writing

2.4.4 Estimating business complexity

2.4.5 Five suggestions

2.4.5.1 Try not to segment if possible

Not all tables need to be split, it mainly depends on the growth rate of the data. Segmentation will increase the complexity of the business to a certain extent. In addition to carrying data storage and query, the database is also one of its important tasks to assist the business in better realizing its needs.
Don't use the big trick of sub-database and sub-table unless absolutely necessary to avoid "over-design" and "premature optimization". Before splitting databases and tables, don’t split just for the sake of splitting. Try your best to do what you can first, such as upgrading hardware, upgrading network, separating read and write, index optimization, etc. When the amount of data reaches the bottleneck of a single table, consider sharding databases and tables.

2.4.5.2 The amount of data is too large and normal operation and maintenance affects business access segmentation

The operation and maintenance mentioned here refers to:
1) For database backup, if a single table is too large, a large amount of disk IO and network IO will be required during backup. For example, if 1T of data is transmitted over the network and occupies 50MB, it will take 20,000 seconds to complete. The risk of the entire process is relatively high. 2)
When DDL is modified on a large table, MySQL will lock the entire table. This time will For a long time, the business cannot access this table during this period, which has a great impact. If you use pt-online-schema-change, triggers and shadow tables will be created during use, which also takes a long time. During this operation, it is counted as risk time. Splitting the data table and reducing the total amount can help reduce this risk.
3) Large tables are frequently accessed and updated, and lock waits are more likely to occur. Split data, trade space for time, and reduce access pressure in disguised form

2.4.5.3 As business develops, some fields need to be split vertically

In the initial stage of the project, this design meets simple business requirements and facilitates rapid iterative development. When the business develops rapidly, the amount of data surges, and the number of users surges from 100,000 to 1 billion. Users are very active. The last_login_name field is updated every time they log in, causing the user table to be constantly updated, which is very stressful. Other fields: id, name, personal_info are unchanged or rarely updated. From a business perspective, it is necessary to split last_login_time and create a new user_time table.

2.4.5.4 Consider segmentation if data volume grows rapidly

With the rapid development of business, the amount of data in a single table will continue to grow. When performance approaches the bottleneck, it is necessary to consider horizontal sharding and create separate databases and tables. At this time, you must choose appropriate segmentation rules and estimate the data capacity in advance.

2.4.5.5 Security and usability aspects

Vertical segmentation at the business level separates the databases of unrelated businesses, because the data volume and access volume of each business are different. Using horizontal segmentation, when a database has a problem, it will not affect 100% of users. , each library only bears part of the business data, so that the overall availability can be improved.

2.5 How to segment? -Several typical data sharding rules are:

2.5.1 According to numerical range

Split according to time interval or ID interval. For example: distribute data of different months or even days into different libraries by date; assign records with userId 1 9999 to the first library, records with userId 10000 20000 to the second library, and so on. In a sense, the "hot and cold data separation" used in some systems, migrating some less used historical data to other libraries, and only providing hot data queries in business functions, is a similar practice.
Advantages:
The size of a single table is controllable
. It is naturally easy to expand horizontally. If you want to expand the entire sharded cluster in the future, you only need to add nodes. There is no need to migrate the data of other shards.
When using shard fields to perform range searches, continuous sharding is required. Shards can quickly locate shards for quick query, effectively avoiding the problem of cross-shard queries.
Disadvantages:
Hot data becomes a performance bottleneck. Continuous sharding may have data hotspots, such as sharding by time fields. Some shards store data in the most recent time period and may be frequently read and written, while some shards store historical data that are rarely queried.

2.5.2 Taking modulo based on numerical value

Generally, the splitting method of hash modulus is used. For example, the Customer table is split into 4 libraries according to the cusno field. The remainder with a remainder of 0 is placed in the first library, and the remainder with a remainder of 1 is placed in the second library. And so on. In this way, the data of the same user will be scattered into the same database. If the query condition contains the cusno field, the corresponding database can be clearly positioned for query.
Advantages:
Data sharding is relatively even, and hot spots and concurrent access bottlenecks are less likely to occur.
Disadvantages:
When the sharded cluster is expanded later, old data needs to be migrated (using a consistent hash algorithm can better avoid this problem) and it
is easy to face cross-sharding. The complex problem of slice query. For example, in the above example, if cusno is not included in the frequently used query conditions, the database will not be located. Therefore, it is necessary to initiate queries to four libraries at the same time, then merge the data in the memory, take the minimum set and return it to the application. Instead, the library became a drag.

3. Problems caused by sub-database and sub-table

Sub-database and sub-table can effectively alleviate the performance bottlenecks and pressures caused by single machines and single databases, and break through the bottlenecks of network IO, hardware resources, and number of connections. It also brings some problems. We mainly look at it from five perspectives.

3.1 Transaction consistency issues

3.1.1 Distributed transactions

When updated content is distributed in different libraries at the same time, cross-database transaction problems will inevitably occur. Cross-shard transactions are also distributed transactions, and there is no simple solution. Generally, the "XA protocol" and "two-phase commit" can be used to handle them.
Distributed transactions can ensure the atomicity of database operations to the greatest extent. However, when submitting a transaction, multiple nodes need to be coordinated, which delays the time point of submitting the transaction and prolongs the execution time of the transaction. This leads to an increased probability of conflicts or deadlocks when transactions access shared resources. As the number of database nodes increases, this trend will become more and more serious, thus becoming a shackle for the horizontal expansion of the system at the database level.

3.1.2 Eventual consistency

For those systems with high performance requirements but low consistency requirements, the real-time consistency of the system is often not required. As long as the final consistency is achieved within the allowed time period, transaction compensation can be used. Different from the method of rolling back the transaction immediately after an error occurs during execution, transaction compensation is a post-mortem check and remedial measure. Some common implementation methods include: reconciliation check of data, comparison based on logs, and regular comparison with standard data sources. Sync and more. Transaction compensation should also be considered in conjunction with the business system.

3.2 Cross-node association query join problem

Before segmentation, the data required for many lists and detail pages in the system can be completed through SQL join. After segmentation, the data may be distributed on different nodes. At this time, the problems caused by join will be more troublesome. Considering the performance, try to avoid using join query.
Some ways to solve this problem:

3.2.1 Global table

Global tables can also be regarded as "data dictionary tables", which are tables that all modules in the system may depend on. In order to avoid cross-database join queries, a copy of such tables can be saved in each database. These data are usually rarely modified, so there is no need to worry about consistency issues.

3.2.2 Field redundancy

A typical anti-paradigm design uses space for time and avoids join queries for performance. For example: when the order table saves the userId, it also saves a redundant copy of the userName, so that when querying the order details, there is no need to query the "buyer user table".
However, the applicable scenarios of this method are also limited, and it is more suitable for situations where there are few dependent fields. The data consistency of redundant fields is also difficult to ensure. Just like the example of the order table above, after the buyer modifies the userName, does it need to be updated synchronously in the historical orders? This should also be considered in conjunction with actual business scenarios.

3.2.3 Data assembly

At the system level, the query is divided into two. The results of the first query focus on finding the associated data ID, and then a second request is initiated based on the ID to obtain the associated data. Finally, the obtained data is assembled into fields.

3.2.4ER fragmentation

In a relational database, if you can first determine the association between tables and store those associated table records on the same shard, you can better avoid cross-shard join problems. In the case of 1:1 or 1:n, it is usually split according to the ID primary key of the main table.

3.3 Cross-node paging, sorting, and function issues

When querying multiple databases across nodes, problems such as limit paging and order by sorting may occur. Paging needs to be sorted according to the specified field. When the sorting field is a sharding field, it is easier to locate the specified shard through the sharding rules; when the sorting field is not a sharding field, it becomes more complicated. The data needs to be sorted and returned in different shard nodes first, and then the result sets returned by different shards are summarized and sorted again, and finally returned to the user.
However, if the number of pages obtained is very large, the situation becomes much more complicated, because the data in each shard node may be random. For the accuracy of sorting, the first N pages of data of all nodes need to be sorted and merged. Finally, Then perform the overall sorting. Such an operation consumes a lot of CPU and memory resources, so the larger the number of pages, the worse the system performance will be.
When using functions such as Max, Min, Sum, and Count for calculation, you also need to execute the corresponding function on each shard first, then summarize the result sets of each shard and calculate again, and finally return the results.

3.4 Global primary key avoidance problem

In a sub-database and sub-table environment, since the data in the table exists in different databases at the same time, the usual auto-increment of the primary key value will be useless, and the self-generated ID of a certain partition database cannot be guaranteed to be globally unique. Therefore, it is necessary to design the global primary key separately to avoid the duplication of primary keys across databases. There are some common primary key generation strategies:

3.4.1 UUID (deprecated)

The standard form of UUID contains 32 hexadecimal digits, divided into 5 segments, 36 characters in the form of 8-4-4-4-12, for example: 550e8400-e29b-41d4-a716-446655440000
UUID is the primary key and is the simplest The solution is generated locally, with high performance and no network consumption. But the shortcomings are also obvious. Because UUID is very long, it will take up a lot of storage space; in addition, there will be performance problems when creating an index as a primary key and querying based on the index, because it is unordered. In mysql's innodb, the index field Insertion efficiency is relatively low and will also cause frequent changes in data location.

3.4.2 Snowflake distributed self-increasing ID algorithm (recommended)

Twitter's snowflake algorithm solves the need for distributed systems to generate global IDs, generating a 64-bit Long type number. Its components: the first bit is
not used. The
next 41 bits are millisecond time. The length of 41 bits can represent 69 years of time.
5-digit datacenterId, 5-digit workerId. The length of 10 bits supports the deployment of up to 1024 nodes.
The last 12 bits are counts within milliseconds. The 12-bit counting sequence number supports each node generating 4096 ID sequences every millisecond. The
number of milliseconds in the high bits, the generated IDs increase according to the time trend as a whole. ; Does not rely on third-party systems, has high stability and efficiency, theoretically QPS is about 409.6w/s (1000*2^12), and there will be no ID collisions in the entire distributed system; bits can be flexibly allocated according to its own business Bit. Unlike UUID, the snowflake algorithm can ensure the orderliness of the primary keys of the same process.
The disadvantage is that it relies heavily on the machine clock. If the clock is set back, it may result in duplicate ID generation.

3.5 Data migration and expansion issues

When the business develops rapidly and faces performance and storage bottlenecks, sharding design will be considered. At this time, it is inevitable to consider the issue of historical data migration. The general approach is to first read the historical data, and then write the data to each shard node according to the specified sharding rules. In addition, it is necessary to carry out capacity planning based on the current data volume and QPS, as well as the speed of business development, and calculate the approximate number of shards required (it is generally recommended that the data volume of a single table on a single shard does not exceed 1000W).
If you use numerical range sharding, you only need to add nodes to expand the capacity, and there is no need to migrate the sharded data. If numerical modulo sharding is used, it will be relatively troublesome to consider later expansion issues.

4. Framework/component/middleware that supports sub-library and sub-table in Java

Let’s list some of the more common ones and briefly introduce them:

sharding-jdbc (Dangdang)
TSharding (Mogujie)
Atlas (Qihoo 360)
Cobar (Alibaba)
MyCAT (based on Cobar)
TDDL (Taobao)
Vitess (Google)

4.1 sharding-jdbc

First of all, the first one, and probably the most common and commonly used one, is Sharding-JDBC. This is the earliest name. It has now developed into ShardingSphere, an ecosystem. We will introduce its usage in detail below in the article.

4.2 TSharding

The sub-database and sub-table component for Mogujie trading platform can be developed with very little resource investment. It supports the sharding requirements of the transaction order table, sub-database and sub-table, supports data source routing, supports transactions, and supports result set merging. Supports reading and writing separation.

4.3 Atlas

Based on the secondary development of MySQL-Proxy, it mainly supports two features: table sharding and read-write separation. However, table sharding only supports multiple tables in a single database. In fact, distributed sharding is not supported. All sharding tables are in the same library.

4.4 Cobar

Alibaba's distributed processing system for relational data is located in the form of a proxy between the front-end application and the actual database. The open interface to the front-end is the MySQL communication protocol. Change the front-end SQL statement and forward it to the appropriate back-end data sub-database according to the data distribution rules, and then merge and return the results to simulate the database behavior under a single database.

4.5 MyCAT

For a server that implements the MySQL protocol, front-end users can regard it as a database agent and access it using MySQL client tools and command lines, while its back-end can use the MySQL native protocol to communicate with multiple MySQL servers, or the JDBC protocol. Communicates with most mainstream database servers, and its core function is sharding databases and tables. In conjunction with the master-slave mode of the database, read-write separation can also be achieved.

4.6 TDDL

It mainly solves the problem of application transparency of sub-databases and sub-tables and data replication between heterogeneous databases. It is a jdbc datasource implementation based on centralized configuration, with functions such as primary and backup, read-write separation, and dynamic database configuration.

4.7 Speed

It is Youtube's open source database expansion and high-availability solution. It has been used in production environments and has powerful functions. However, it has a complex architecture and high deployment and operation and maintenance costs.

5. ShardingSphere

This is the upgraded version of Sharding-JDBC introduced just now. What is the upgraded version? The meaning of Sphere ecosystem is that it has gradually transformed from a single product into an ecosystem. Let’s introduce it in detail below. In actual operation, Sharding-JDBC is still used.

5.1 Introduction to ShardingSphere

It was first used internally by Dangdang.com as a sharding-JDBC framework for sharding libraries and tables. It is positioned as a lightweight Java framework and provides additional services in the JDBC layer of Java. It uses the client to directly connect to the database and provides services in the form of jar packages without additional deployment and dependencies. Sharding-JDBC directly encapsulates the JDBC API and can be understood as an enhanced version of the JDBC driver. The cost of migrating old code is almost zero and is suitable for any application based on JDBC's ORM framework supports any third-party database connection pool and any database that implements the JDBC specification.
It began to be open sourced to the outside world in 2017. With the continuous iteration of a large number of community contributors, its functions have gradually improved, and it has now been renamed ShardingSphere.
In 2020, it officially became a top-level project of the Apache Software Foundation. It consists of three independent products: Sharding-JDBC, Sharding-Proxy and Sharding-Sidecar (planned).

5.2 Sharding-JDBC Advantages

Sharding-JDBC is positioned as a lightweight Java framework that provides additional services at the JDBC layer of Java.
The client directly connects to the database and provides services in the form of jar packages without additional deployment and dependencies.
Sharding-JDBC directly encapsulates the JDBC API.
The cost of migrating old code is almost zero.
Applicable to any JDBC-based ORM framework.
Supports any third-party database connection pool.
Supports any database that implements the JDBC specification.

5.3 Core concepts

5.3.1 LogicTable (logical table)

Logical table for data sharding. For horizontally split databases (tables), the general name for tables of the same type. Example: The order data is split into 10 tables based on the primary key's mantissa, namely t_order_0 to t_order_9, and their logical table names are t_order.

5.3.2 ActualTable (real table)

Physical tables that actually exist in a sharded database. That is, t_order_0 to t_order_9 in the previous example.

5.3.3 DataNode (data node)

The smallest unit of data fragmentation. It consists of data source name and data table, for example: ds_1.t_order_0. During configuration, the table structure of each sharded database is the same by default. You can directly configure the corresponding relationship between the logical table and the real table. If the table results of each database are different, you can use ds.actual_table configuration.

5.3.4 BindingTable (binding table)

Refers to the main table and sub-tables whose sharding rules are consistent in any scenario. For example: the order table and the order item table are both sharded according to the order ID, so the two tables have a BindingTable relationship with each other. There will be no Cartesian product association in the multi-table association query of the BindingTable relationship, and the efficiency of the association query will be greatly improved.

5.3.5 BroadcastTable (broadcast table)

Refers to the table that exists in all sharded data sources. The table structure and data in the table are completely consistent in each database. It is suitable for scenarios where the amount of data is not large and requires associated queries with tables with massive data, such as dictionary tables.

5.4 Sharding

5.4.1 Sharding key

The database field used for sharding is the key field for horizontally splitting the database (table). Example: The order table order ID fragmentation mantissa is modulo fragmentation, then the order ID is the fragmentation field. If there is no shard field in SQL, full routing will be performed with poor performance. Multiple shard fields are supported.

5.4.2 Sharding algorithm

5.4.2.1 Exact Sharding Algorithm

Precise Sharding Algorithm Precise Sharding Algorithm (= and IN statement), used to handle = and IN sharding scenarios that use a single key as the sharding key. Need to be used with StandardShardingStrategy

5.4.2.2 Range Sharding Algorithm

The range sharding algorithm (RangeShardingAlgorithm) is used to handle sharding scenarios using BETWEEN AND with a single key as the sharding key. Need to be used with StandardShardingStrategy.

5.4.2.3 Composite Sharding Algorithm

The compound sharding algorithm (ComplexKeysShardingAlgorithm) is used for sharding operations in which multiple fields are used as sharding keys. The values ​​of multiple sharding keys are obtained at the same time, and business logic is processed based on multiple fields. Need to be used under ComplexShardingStrategy.

5.4.2.4 Hint sharding algorithm

Hint sharding algorithm (HintShardingAlgorithm) is slightly different. In the above algorithm, we parse the statement to extract the sharding key, and set the sharding strategy for sharding. But sometimes we do not use any sharding keys and sharding strategies, but if we still want to route SQL to the target database and table, we need to specify the target database and table information of SQL through manual intervention, which is also called forced routing.

5.4.3 Sharding strategy

Contains the sharding key and sharding algorithm. Due to the independence of the sharding algorithm, it is extracted independently. What can really be used for sharding operations is the sharding key + sharding algorithm, which is the sharding strategy. Currently, 5 sharding strategies are provided.
A good sharding strategy = good sharding key + good sharding algorithm

5.4.3.1 Standard Sharding Strategy

Corresponds to StandardShardingStrategy. Provides support for sharding operations of =, IN and BETWEEN AND in SQL statements. StandardShardingStrategy only supports single sharding keys and provides two sharding algorithms: PreciseShardingAlgorithm and RangeShardingAlgorithm. PreciseShardingAlgorithm is required and is used to handle = and IN sharding. RangeShardingAlgorithm is optional and is used to process BETWEEN AND sharding. If RangeShardingAlgorithm is not configured, BETWEEN AND in SQL will be processed according to the full library route.

5.4.3.2 Composite sharding strategy

Corresponds to ComplexShardingStrategy. Composite sharding strategy. Provides support for sharding operations of =, IN and BETWEEN AND in SQL statements. ComplexShardingStrategy supports multiple sharding keys. Due to the complex relationship between multiple sharding keys, it does not carry out too much encapsulation. Instead, it directly transparently transmits the sharding key value combination and sharding operator to the sharding algorithm, which is completely developed by the application. implementation, providing maximum flexibility.

5.4.3.3 Row expression sharding strategy

Corresponds to InlineShardingStrategy. Using Groovy expressions, it provides support for sharding operations of = and IN in SQL statements, and only supports single sharding keys. For a simple sharding algorithm, it can be used through simple configuration to avoid tedious Java code development, such as: t_user_$->{u_id % 8} means that the t_user table is divided into 8 tables according to u_id modulo 8, and the table name is t_user_0 to t_user_7.

5.4.3.4 Hint sharding strategy

Corresponds to HintShardingStrategy. A strategy for sharding through Hint rather than SQL parsing.

5.4.3.5 No fragmentation strategy

Corresponds to NoneShardingStrategy. No fragmentation strategy.

5.5 Distributed primary key

Equivalent to section 3.4 of this article

5.6 Distributed transactions

Database transactions need to meet the four characteristics of ACID (atomicity, consistency, isolation, and durability).
Atomicity means that the transaction is executed as a whole, either entirely or not at all.
Consistency means that transactions should ensure that data changes from one consistent state to another consistent state.
Isolation means that when multiple transactions are executed concurrently, the execution of one transaction should not affect the execution of other transactions.
Durability means that the submitted transaction modification data will be persisted.
In a single data node, transactions are limited to access control of a single database resource, called a local transaction. Almost all mature relational databases provide native support for local transactions. However, in a distributed application environment based on microservices, more and more application scenarios require that access to multiple services and their corresponding multiple database resources can be included in the same transaction, and distributed transactions emerged as the times require.

5.6.1 Local affairs

Without opening any distributed transaction manager, let each data node manage its own transactions. They have no coordination and communication capabilities, and they do not know the success or failure of other data node transactions. Local transactions do not suffer any loss in performance, but they are insufficient in terms of strong consistency and eventual consistency.

5.6.2 XA strongly consistent transactions

The earliest distributed transaction model of the XA protocol is the model proposed by the X/OPEN International Alliance, referred to as the XA protocol.
Distributed transactions implemented based on the XA protocol have little intrusion into the business. Its biggest advantage is that it is transparent to users. Users can use distributed transactions based on the XA protocol just like local transactions. The XA protocol can strictly guarantee the ACID characteristics of transactions.
Strictly ensuring transaction ACID properties is a double-edged sword. During transaction execution, all required resources need to be locked, and it is more suitable for short transactions with a determined execution time. For long transactions, the exclusive use of data during the entire transaction will lead to a significant decline in the concurrent performance of business systems that rely on hot data. Therefore, in high-concurrency performance-first scenarios, distributed transactions based on the XA protocol are not the best choice.

5.6.3 Flexible transactions

If a transaction that implements ACID transaction elements is called a rigid transaction, then a transaction based on BASE transaction elements is called a flexible transaction. BASE is the abbreviation for the three elements of basic availability, flexible state and eventual consistency.
Basically Available ensures that distributed transaction participants are not necessarily online at the same time.
Soft state allows a certain delay in system status updates, which may not be noticeable to customers.
Eventually consistency (Eventually consistent) usually ensures the eventual consistency of the system through message reachability.

5.7 Separation of reading and writing

Insert image description here

5.7.1 Core concept of separation of reading and writing

5.7.1.1 Main library

The database used for adding, updating, and deleting data operations currently only supports a single master database.

5.7.1.2 Slave library

The database used for query data operations can support multiple slave libraries.

5.7.1.3 Master-slave synchronization

The operation of asynchronously synchronizing data from the master database to the slave database. Due to the asynchronous nature of master-slave synchronization, the data in the slave database and the master database will be inconsistent for a short period of time.

5.7.1.4 Load balancing strategy

Direct query requests to different slave libraries through load balancing strategies.

5.7.1.5 Config Map

Configure the metadata of the read-write separated data source. You can obtain the masterSlaveConfig data in the ConfigMap by calling ConfigMapContext.getInstance(). Example: If the machine weight is different, the traffic may be different. The machine weight metadata can be configured through ConfigMap.

5.7.2 Core functions

It provides a read-write separation configuration of one master and multiple slaves, which can be used independently, or can be used with sub-databases
and sub-tables using the same thread and within the same database connection. If there is a write operation, subsequent read operations will be read from the main database. Use To ensure data consistency.
Spring namespace
Hint-based mandatory main library routing.

5.7.3 Range not supported

Data synchronization between the master database and slave database.
Data inconsistency caused by delay in data synchronization between the master database and the slave database.
The main library is double-written or multi-written.

6. Sharding-JDBC practical operation

Core dependencies:

		<dependency>
            <groupId>org.apache.shardingsphere</groupId>
            <artifactId>sharding-jdbc-spring-boot-starter</artifactId>
            <version>4.1.1</version>
        </dependency>

6.1 Preparatory operations before configuring the sharding algorithm

Before configuring the Sharding-JDBC sharding algorithm, you must do some basic configurations, such as configuring data sources, data nodes, setting primary keys and generating algorithms. Only with these configurations can we continue to configure the sharding algorithm.

6.1.1 Configure data source

I have set up two libraries here and configured two data sources.
Insert image description here

6.1.2 Configure data nodes

$-> is the row expression provided by Sharding. This expression means that the data nodes we configure are the m0 library course0 table, the m0 library course1 table, the m1 library course0 table and the m1 library course1 table. There are four nodes in total, like this The configuration is very simple. I have set up relatively few libraries and tables. It seems that it will be OK to list them directly. However, if there are many libraries and tables, it will be very troublesome to list them one by one.
Insert image description here

#数据节点
spring.shardingsphere.sharding.tables.course.actual-data-nodes=m$->{
    
    0..1}.course$->{
    
    0..1}

6.1.3 Configure global primary key and generation strategy

The primary key set for the course table below is cid, and the snowflake algorithm is used to ensure that the global primary key is unique.
Insert image description here

#设置主键
spring.shardingsphere.sharding.tables.course.key-generator.column=cid
#主键生成策略,雪花算法
spring.shardingsphere.sharding.tables.course.key-generator.type=SNOWFLAKE

6.2 Precise sharding algorithm practice

6.2.1 Configuration class

First, we need to set up the sharding key and set up the sharding strategy for the logical table named course. Here we need to customize the algorithm class. The sharding strategy is
similar. Change table-strategy to database-strategy and set up a custom algorithm for the sharding strategy. kind
Insert image description here

#分片键
spring.shardingsphere.sharding.tables.course.table-strategy.standard.sharding-column=user_id
#分表策略-精准
spring.shardingsphere.sharding.tables.course.table-strategy.standard.precise-algorithm-class-name=com.mine.sharding.algorithm.MyPreciseTableShardingAlgorithm

6.2.2 Custom algorithm class

Implement the precise sharding interface provided by Sharding, rewrite the doSharding method, and the next few sharding algorithms are all similar operations.
Insert image description here

public class MyPreciseTableShardingAlgorithm implements PreciseShardingAlgorithm<Long> {
    
    
    //select * from course where cid = '' or cid in ('','')
    @Override
    public String doSharding(Collection<String> collection, PreciseShardingValue<Long> preciseShardingValue) {
    
    
        String logicTableName = preciseShardingValue.getLogicTableName();
        String columnName = preciseShardingValue.getColumnName();
        Long cidValue = preciseShardingValue.getValue();
        //实现course$->{cid%2}
        BigInteger shardingValueB = BigInteger.valueOf(cidValue);
        BigInteger resB = shardingValueB.mod(new BigInteger("2"));
        String key = logicTableName+resB;
        if (collection.contains(key)){
    
    
            return key;
        }
        throw new UnsupportedOperationException("route:"+key+" is not supported,please check your config");

    }
}

6.3 Range Sharing Algorithm Practice

6.3.1 Configuration class

Insert image description here

#分片键
spring.shardingsphere.sharding.tables.course.table-strategy.standard.sharding-column=user_id
#分表策略-范围
spring.shardingsphere.sharding.tables.course.table-strategy.standard.range-algorithm-class-name=com.mine.sharding.algorithm.MyRangeTableShardingAlgorithm

6.3.2 Custom algorithm class

Insert image description here

public class MyRangeTableShardingAlgorithm implements RangeShardingAlgorithm<Long> {
    
    

    @Override
    public Collection<String> doSharding(Collection<String> collection, RangeShardingValue<Long> rangeShardingValue) {
    
    
        Long upperValue = rangeShardingValue.getValueRange().upperEndpoint();
        Long lowerValue = rangeShardingValue.getValueRange().lowerEndpoint();

        String logicTableName = rangeShardingValue.getLogicTableName();
        //course_$->{cid%2}

        return Arrays.asList(logicTableName+"0",logicTableName+"1");
    }
}

6.4 Composite Sharding Algorithm Practice

What is composite, or complex? It is no longer a single condition. You can see that s is added to sharding-column in the configuration class, and it is no longer singular.
This composite sharding strategy supports multiple sharding keys. Combine complex sharding strategies based on multiple sharding keys

6.4.1 Configuration class

Insert image description here

#分片键,可多个
spring.shardingsphere.sharding.tables.course.table-strategy.complex.sharding-columns=cid,user_id
#分表策略
spring.shardingsphere.sharding.tables.course.table-strategy.complex.algorithm-class-name=com.mine.sharding.algorithm.MyComplexTableShardingAlgorithm

6.4.2 Custom algorithm class

Insert image description here

public class MyComplexTableShardingAlgorithm implements ComplexKeysShardingAlgorithm<Long> {
    
    

    @Override
    public Collection<String> doSharding(Collection<String> collection, ComplexKeysShardingValue<Long> complexKeysShardingValue) {
    
    
        Range<Long> cidRange = complexKeysShardingValue.getColumnNameAndRangeValuesMap().get("cid");
        Collection<Long> userIdCol = complexKeysShardingValue.getColumnNameAndShardingValuesMap().get("user_id");

        Long cidUpper = cidRange.upperEndpoint();
        Long cidLower = cidRange.lowerEndpoint();

        List<String> result = new ArrayList<>();
        for (Long userId : userIdCol) {
    
    
            BigInteger userIdB = BigInteger.valueOf(userId);
            BigInteger target = userIdB.mod(new BigInteger("2"));
            result.add(complexKeysShardingValue.getLogicTableName()+target);
        }
        return result;
    }
}

6.5 Hint fragmentation algorithm (forced routing) practice

6.5.1 Configuration class

In the configuration file, the sharding key is not specified. Why? You can look at the name and force the routing. If it is forced, I don’t need to set a sharding key in advance. I can intervene in the program and specify the target database and table of SQL. information.

Insert image description here

#Hint在配置中不指定分片键
#分表策略
spring.shardingsphere.sharding.tables.course.table-strategy.hint.algorithm-class-name=com.mine.sharding.algorithm.MyHintTableShardingAlgorithm

6.5.2 Custom algorithm class

Insert image description here

public class MyHintTableShardingAlgorithm implements HintShardingAlgorithm<Integer> {
    
    

    /**
     * Hint限制:
     * 不支持union查询
     * 不支持多层子查询
     * 不支持函数计算
     */

    @Override
    public Collection<String> doSharding(Collection<String> collection, HintShardingValue<Integer> hintShardingValue) {
    
    

        String key = hintShardingValue.getLogicTableName() + hintShardingValue.getValues().toArray()[0];
        if (collection.contains(key)) {
    
    
            return Arrays.asList(key);
        }
        throw new UnsupportedOperationException("route:" + key + " is not supported,please check your config");
    }
}

6.6.3 Test class

How does the Hint algorithm intervene in the program? Let's write a simple test method.
Insert image description here
We can see that it is specified through HintManager, which is very convenient,
but there is something to note. This is thread-safe and must be turned off after use. It cannot be brought into the next thread.
Insert image description here

    /**
     * Hint
     * 强制路由
     */
    @Test
    public void queryCourseLHint() {
    
    
        HintManager hintManager = HintManager.getInstance();
        hintManager.addTableShardingValue("course", 0);
        List<Course> courses = courseMapper.selectList(null);
        courses.forEach(System.out::println);
        hintManager.close();

    }

6.6 Broadcast table practice

Doing the same operation in all data sources means that each data source will save the same, full amount of data.

Insert image description here

spring.shardingsphere.sharding.broadcast-tables=dict

6.7 Binding table practice

When performing associated queries on tables with the same sharding key, the most important thing is to avoid Cartesian product

Insert image description here

spring.shardingsphere.sharding.binding-tables[0]=user,dict

6.8 Practice of separation of reading and writing

The prerequisite is to configure the master-slave database at the database level.

6.8.1 Configure the master-slave database data source

Insert image description here

6.8.2 Configure read and write separation between master and slave nodes

Of course, you also need to specify the primary key and generation algorithm here.
Insert image description here

#主从配置、读写分离
spring.shardingsphere.sharding.master-slave-rules.ds0.master-data-source-name=m
spring.shardingsphere.sharding.master-slave-rules.ds0.slave-data-source-names=s

spring.shardingsphere.sharding.tables.student.actual-data-nodes=ds0.student
spring.shardingsphere.sharding.tables.student.key-generator.column=sid
spring.shardingsphere.sharding.tables.student.key-generator.type=SNOWFLAKE


This time the content ends here. The author’s level is limited. Please point out the shortcomings of the article.

Best Regards

Guess you like

Origin blog.csdn.net/m0_68681879/article/details/132534385