Spring Boot integrates ShardingSphere to implement data fragmentation (1) | Spring Cloud 40

1. Background

The traditional solution of centrally storing data on a single node has been difficult to meet the scenario of massive data in terms of performance, availability, and operation and maintenance costs.

In terms of performance, since most relational databases use B+tree- type indexes, when the amount of data exceeds the threshold, the increase in index depth will also IOincrease the number of disk accesses, which in turn will lead to a decline in query performance; at the same time, high concurrency Access requests also make the centralized database the biggest bottleneck of the system.

From the perspective of availability, the stateless nature of service can achieve random expansion at a small cost, which will inevitably lead to the ultimate pressure of the system on the database. And a single data node, or a simple master-slave architecture, has become increasingly difficult to afford. The availability of the database has become the key to the whole system.

From the perspective of operation and maintenance costs, when the data in a database instance exceeds the threshold, the pressure DBAon operation and maintenance will increase. The time cost of data backup and recovery will become increasingly uncontrollable with the size of the data. Generally speaking, the data threshold of a single database instance is 1TBwithin , which is a reasonable range.

When traditional relational databases cannot meet the needs of Internet scenarios, there are more and more NoSQLattempts . NoSQLHowever , SQLthe incompatibility with and the imperfection of the ecosystem make it impossible for them to complete the fatal blow in the game with the relational database, but the position of the relational database is still unshakable.

2. Data Fragmentation

Data sharding refers to distributing data stored in a single database into multiple databases or tables according to a certain dimension to improve performance bottlenecks and availability.

An effective means of data sharding is to shard databases and tables of relational databases. Both sub-database and sub-table can effectively avoid the query bottleneck caused by the amount of data exceeding the tolerable threshold.

In addition, the sub-database can also be used to effectively disperse the access to a single point of the database; although the sub-table cannot relieve the pressure on the database, it can provide the possibility of converting distributed transactions into local transactions as much as possible. The update operation of the library, distributed transactions often complicate the problem.

Using the multi-master and multi-slave sharding method can effectively avoid data single point, thereby improving the availability of data architecture.

It is an effective means to deal with high concurrency and massive data systems by splitting data into sub-databases and sub-tables to keep the data volume of each table below the threshold, and to channel traffic to deal with high access volume.

Data sharding can be divided into vertical sharding and horizontal sharding.

2.1 Vertical Sharding

According to the way of business splitting, it is called vertical sharding, also known as vertical splitting. Its core concept is dedicated to dedicated databases. Before splitting, a database consists of multiple data tables, and each table corresponds to a different business. After the split, the tables are classified according to the business and distributed to different databases, thereby distributing the pressure to different databases.

The following figure shows the scheme of vertically sharding the user table and order table into different databases according to business needs.

insert image description here
Vertical sharding often requires adjustments to the architecture and design. Generally speaking, it is too late to cope with the rapid changes in Internet business requirements; moreover, it cannot really solve the single-point bottleneck. Vertical splitting can alleviate the problems caused by data volume and access volume, but it cannot cure them. If after vertical splitting, the amount of data in the table still exceeds the threshold that a single node can carry, horizontal sharding is required for further processing.

2.2 Horizontal sharding

Horizontal sharding is also known as horizontal splitting. Compared with vertical sharding, it no longer classifies data according to business logic, but disperses data into multiple libraries or tables according to certain rules through a certain field (or several fields), and each shard contains only part of the data. For example: according to the primary key sharding, the records of the even primary key are put into the 0 library (or table), and the records of the odd primary key are put into the 1 library (or table), as shown in the following figure.
insert image description here

Horizontal sharding theoretically breaks through the bottleneck of single-machine data processing, and is relatively free to expand. It is a standard solution for data sharding.

3. Challenge

Although data sharding solves problems such as performance, availability, and single-point backup and recovery, the distributed architecture also introduces new problems while gaining benefits.

Faced with such scattered data after fragmentation, one of the important challenges is that the operation of application development engineers and database administrators on the database becomes extremely heavy. They need to know which specific database subtable the data needs to be retrieved from.

Another challenge is that what can run correctly in a single-node database SQLmay not necessarily work correctly in a fragmented database. For example, table splitting results in modification of the table name, or incorrect handling of operations such as pagination, sorting, aggregation grouping, etc.

Cross-database transactions are also tricky things that distributed database clusters have to face. Reasonable use of sub-tables can use local transactions as much as possible while reducing the amount of data in a single table. Being good at using different tables in the same database can effectively avoid the troubles caused by distributed transactions.

In scenarios where cross-database transactions cannot be avoided, some businesses still need to maintain transactional consistency. However, based XAon distributed transactions, the performance cannot meet the needs in scenarios with high concurrency, and has not been used on a large scale by Internet giants. Most of them use eventual consistent flexible transactions instead of strong consistent transactions.

Guess you like

Origin blog.csdn.net/ctwy291314/article/details/130378909