Sub-database and sub-table: how to solve the slow reading and writing of large data volume

I. Introduction

For a system, the current order data volume has reached hundreds of millions, and it is growing at a rate of millions every day, and it may even be tens of millions in the future.

Faced with such a huge amount of data, once the amount of data grows wildly, it will inevitably cause slow reading and writing.

So, in order to enable the system to withstand the pressure of tens of millions of data volumes, what are the solutions?


2. Sub-table and sub-database

When the reading and writing of database tables is slow, the first thing we consider is to optimize the program reading and writing modules and adjust the software architecture; I can't bear it, and the optimization effect is limited only from the software.

The solution we want to introduce here is: sub-table and sub-database, that is, first split the table, and then perform distributed storage.


3. Technology selection for split storage

There are four commonly used solutions for splitting storage, including: MySQL partitioning technology, NoSQL, NewSQL, and MySQL-based sub-table and sub-database.

3.1 MySQL partition technology

Let's first look at the MySQL architecture diagram of the official MySQL document.
insert image description here
From the above MySQL architecture diagram, it is not difficult to find that MySQL's partitioning is mainly in the file storage layer. It can store different rows of a table in different storage files. In practical applications, MySQL partitioning technology is not recommended for three main reasons:

  • There is only one MySQL instance, which only distributes the storage and cannot distribute the request load.
  • MySQL's partitioning is transparent to users, so users often pay little attention to it during actual operations, making cross-partition operations seriously affect system performance.
  • MySQL has some other limitations, such as not supporting query cache, bit operation expressions, etc.

3.2 NoSQL

A typical NoSQL is MongoDB.
MongoDB's sharding function can already meet the general needs of large amounts of data from the two perspectives of concurrency and data.

However, you still need to pay attention to the following 3 main points:

  • Constraint considerations : MongoDB is not a relational database, but a document database. Each row of its records is a Json with a flexible structure. For example, when storing very important orders, MongoDB cannot be used, because the order data must be stored in a strongly constrained relational database.
  • Business function considerations : operations such as transactions, locks, SQL, and expressions have been verified in MySQL, and MySQL can meet all business requirements. MongoDB cannot.
  • Stability considerations : MySQL has been tested in practice, and NoSQL is yet to be verified.

3.3 NewSQL

NewSQL technology is still relatively new, but after considering the stability and functional scalability, it was not used in the end. The specific reasons are similar to MongoDB.

3.4 Table sub-database based on MySQL

What is sub-table and sub-database?
Table splitting is to split and store a large table data into multiple split tables with the same structure;
database splitting is to split a large database into multiple small databases with the same structure.

Sub-databases and sub-tables have less dependence on third parties, and the business logic is flexible and controllable. It does not require very complicated underlying principles, nor does it need to redo the database. It just uses different SQL statements and data sources according to different logics.


4. General requirements for sub-database and sub-table technology

If sub-databases and sub-tables are used, there are three general technical requirements that need to be implemented:
1) SQL combination : because the associated representation is dynamic, dynamic SQL needs to be assembled according to logic;
2) Database routing : because the database name is also dynamic, Therefore, it is necessary to use different databases through different logics;
3) Consolidation of execution results : Some requirements need to be executed through multiple sub-databases, and then merged and collected.

At present, the middleware on the market that can solve the above problems is divided into two categories: Proxy mode and Client mode .

4.1 Proxy mode

Borrow the diagram in the official document of ShardingSphere to illustrate, focusing on the Sharding-Proxy layer.
insert image description here
This mode stores all functions such as SQL combination, database routing, and execution result merging in a proxy service, and all the processing logic related to sub-table and sub-database is stored. in another service. The advantage of this mode is that there is no intrusion into the business code, and the business value only needs to focus on its own business logic.

4.2 Client mode

Borrowing diagrams from ShardingSphere official documents for illustration.
insert image description here
This mode puts the logic related to sub-table and sub-database on the client. Generally, the application of the client will reference a jar, and then process SQL combination, database routing, execution result merging and other related functions in the jar.



On the market, the middleware of the above two modes are:
insert image description here
Proxy and Client Mode Pros and Cons Comparison:
insert image description here
In practical applications, we can choose the mode that suits us according to our needs.


Five, sub-database sub-table realization ideas

5.1. What field to use as the shard key

Let's take the following order form and choose to use the Client mode as an example to illustrate.
insert image description here
Split the data in the above table into an order table. The main data structure in the table is as follows:
insert image description here
When selecting a field as a shard key, three requirements need to be considered :
1) The data should be evenly distributed in different tables or libraries;
2) Minimize cross-database queries;
3) The value of this field will not change.


In the above table, we use user_id as the sharding primary key, why is it so divided? Mainly based on business needs.
Such as some common business requirements:

  • The user needs to query all orders, and the order data must contain different order_time;
  • The background needs to query local orders according to the city;
  • The background needs to count the order trend of each time period.

According to the above requirements, determine the priority, and the user operation is the first requirement that must be satisfied first.
At this time, if user_id is used as the sharding field of the order, it can be guaranteed that the data can be obtained in a sub-table of a sub-database every time the user queries the data.
Use user_id as the shard primary key. When querying in sub-tables and sub-databases, the user_id will be passed as a parameter first.

5.2 What is the fragmentation strategy

Common sharding strategies are divided into: sharding by range, sharding by hash value, and sharding by hash value and range.

1) Fragmentation according to range
If the user id is an auto-incrementing number, we divide the user id into a library according to each 100w shares, and divide each 10w shares into a table for sharding:
insert image description here


2) Sharding according to the hash value
refers to sharding according to the hash value of the user id mod a specific number (for expansion, generally several powers of 2).


3) Fragmentation based on hash value and range Segmentation
is performed according to range first, and then modulo sharding according to hash value.
For example: table name=order_#user_id%10#_#hash(user_id)%8, it is divided into 10*8=80 tables. In order to facilitate your understanding, we draw a picture to illustrate, as shown in the following figure:
insert image description here

How to choose a sharding strategy?
How to choose the above three different sharding strategies?
We only need to consider one point: Assuming that the amount of data becomes larger, we need to divide the table into finer details. At this time, we only need to ensure that the migrated data is as small as possible.

Therefore, when sharding based on the hash value, it is generally recommended to split into 2nth power tables, for example, into 8 tables. During data migration, each original table is split in half to form a new table, so that the amount of data migration is small.

Project experience value: According to the hash value of the user id, take the modulo 32, divide the data into 32 databases, and each database is further divided into 16 tables.

A simple calculation can be made:
Assuming that the order volume is 10 million per day, each warehouse has a daily increase of 10 million/32=312,500, and each table has a daily increase of 10 million/32/16=19,500.
If the daily order volume is 10 million, the data volume of each table after 3 years will be 1.95x3x365=21.35 million, which is still within the controllable range.

If the business is growing very fast, and the operation and maintenance can still handle it, in order to avoid expansion problems in the future, it is recommended to allocate as many libraries as possible.

5.3 How to modify the business code

Modifying the business code part is strongly related to the business, and how to modify it is not informative. No need to pay attention to the following points:

  • For the sub-table and sub-database of a specific table, the impact of micro-service is only in the service where the table is located. If it is an application with a single architecture, it will be more troublesome;
  • In the Internet architecture, foreign key constraints are basically not applicable;
  • With the popularity of query separation, there are many operations in the background system that require cross-database queries, resulting in poor system performance. At this time, sub-tables and sub-databases generally decouple query separation and operate together: first index all data in ES, and then use ES Query data directly in the background. If the amount of order data is large, there is another common practice: first store the index field in ES (the field used as the query condition), and then put the detailed data in HBase.

5.4. Historical data migration

insert image description here
The basic idea of ​​data migration:
storage data is directly migrated, incremental data is monitored in the binlog, and then the migration program is notified to move the data through canal. The new database has the full amount of data, and the traffic is gradually switched after the verification is passed.

Detailed steps of data migration solution:

  • Launch canal and trigger incremental data migration through canal;
  • After the migration database script test passes, migrate the old data to the new sub-table and sub-database;
  • Pay attention to the time difference between the migration data and the migration of old data, to ensure that all data is migrated without omission;
  • After steps 2 and 3 are completed, the new sub-table and sub-database already has the full amount of data. At this time, we can run the data verification program to ensure that all data is stored in the new database;
  • At this point, the data migration is complete, and then the new version of the code will be launched. As for grayscale or direct upload, it needs to be decided according to the actual situation, and the rollback plan is the same.

5.5 What is the future expansion plan

With the development of business, if the original sharding design can no longer meet the growing data demand, it is necessary to consider expansion, which depends on the following two points:

  • Whether the sharding strategy can make the migration source of the new table data only one old table instead of multiple old tables, which is why it is recommended to use 2 to the Nth power table;
  • Data migration: It is necessary to migrate the data on the old shards to the new shards. This solution is the same as the historical data migration mentioned above.

Guess you like

Origin blog.csdn.net/locahuang/article/details/123497108