After the database is divided into tables, how to achieve expansion?

In actual development, database expansion is directly related to different sub-database and sub-table rules. Today, from the perspective of system design, we abstracted the business scenarios that emerged during project development, from the perspective of database design, routing rules, and data migration solutions. Angle to discuss.

Discuss based on business scenarios

Assume such a business scenario, and now we need to design the order database module of the e-commerce website. After estimating the business growth, it is estimated that in three years, the data size may reach 60 million, and the number of daily orders will exceed 100,000.

First choosestorage implementation. As the core data of e-commerce business, orders should be avoided as much as possible and data consistency should be strong. The requirement is definitely to choose a relational database that supports transactions, such as using MySQL and InnoDB storage engines.

Then there ishigh availability of the database. Order data is typically data that requires more reading and less writing. It must not only be read from the consumer side, but also There are also many upstream and downstream related business modules being called internally, and the number of calls for data query for orders will be very large. Based on this, we configure read-write separation based on master-slave replication in our business, and set up multiple slave libraries to improve data security.

The last one isdata scale. The amount of 60 million data is obviously beyond the endurance of a single table. Please refer to "Alibaba Java Development Manual" "The number of rows in a single table exceeds 5 million rows" is recommended for table sharding. At this time, it is necessary to consider database sharding and table sharding. So how to design routing rules and splitting plans? This will be discussed next.

Routing rules and expansion plans

Now we consider three types of routing rules: hash modulo primary key, routing based on data range, and database and table sharding rules that combine hash and data range.

1. How to take the hash modulus

Hash modulo is the most common solution in database and table sharding, that is, based on different business primary key inputs, the database is modulated to obtain the location of inserted data.

We split the 60 million data scale into 64 tables based on the capacity of a single table in the order of millions. We can further split the 64 tables into two databases, with 32 tables configured in each database. When a new order is created, the order ID is first generated, the number of databases is modulated, and the corresponding accessed database is calculated; then the data table is moduloed, and the data table to be routed is calculated. When processing query operations, the same rules are used Processing, so as to locate the specific data table through the order ID.

3.png


Rule diagram

The advantage of routing through hash modulus is that the data is split evenly, but the disadvantage is that it is not conducive to subsequent expansion. Assume that our orders grow faster than expected, and the data size quickly reaches the order of hundreds of millions. The original data table no longer meets the performance requirements, and the database needs to continue to be split.

After the database is split, the number of order libraries and tables needs to be adjusted, and the routing rules also need to be adjusted. In order to adapt to the new sub-database and sub-table rules and ensure normal data reading and writing, it is inevitable to perform data Migration, specific operations, can be divided into downtime migration and non-downtime migrationTwo ways.

  • Downtime migration

The method of shutdown migration is relatively simple. For example, when we use some websites or applications, we often receive notifications of suspending services within a certain period of time. Generally, during this period, the data migration is completed and the historical data is reconstructed according to new rules. Allocate to new storage and switch services.

  • Migrate without downtime

Non-stop migration, also known as dynamic expansion, relies on double-write operations in the business. It requires processing both stock and incremental data, and performing various data verifications.

Generally speaking, the specific database expansion methods include adding nodes based on the original storage and redeploying a new database. Different migration methods and dual-write strategy support are required for different expansion methods.

If you redeploy new database storage, it can be roughly divided into the following steps:

  • Create a new order database;

  • At a certain point in time, historical data is distributed to a new database according to new routing rules;

  • Enable double writing in the operation of the old database and write to both databases at the same time;

  • Gradually replace the old service with the new read and write service, perform data inconsistency verification simultaneously, and finally complete the comprehensive flow cut.

This is a very simplified process. There are many details to be dealt with in actual development. Interested students can learn about standardized processes such as ETL for data migration.

2. Split based on data range

Routing based on data range usually divides different intervals according to specific fields. When splitting the order table, if routing is based on data range, the range can be divided according to order ID.

It is also split into 64 data tables. Data with an order ID of less than 30 million can be divided into the first order database, and data with an order ID of more than 30 million can be divided into the second order database. In each database, continue to follow each The table is divided into a range of 1 million.

4.png


Rule diagram

It can be seen that based on the routing rules of the data range, when expanding the capacity, you can directly add new storage and map the newly generated data interval to the newly added storage node. There is no need to make adjustments between nodes, and no Historical data needs to be migrated.

But the disadvantage of this method is uneven data access. If this rule is followed, another database will not be used for a long time, resulting in uneven load on the data nodes. In extreme cases, the current hotspot database may have a performance bottleneck and be unable to take advantage of the advantages brought by sub-databases and sub-tables. Performance advantages.

3. Combine data range and hash modulus

Now consider, if we combine the above two methods of data range and hash modulus, can we achieve uniform data distribution and better expand the capacity?

We design such a routing rule, first perform a hash modulo on the order ID, and then perform range partitioning on the modulo data again.

5.png


Order database further split

It can be seen that by combining the hash modulus with the data interval, the advantages and disadvantages of the two routing schemes can be better balanced. When data is written, a database is first calculated through a modulus, and then a second calculation is performed using the range of order IDs to disperse the data into different data tables.

This method avoids hot storage that may occur based solely on the data range, and during later expansion, the corresponding expansion table can be directly added, avoiding complex data migration work.

Above, we have considered several routing rules and expansion plans under sub-databases and sub-tables through a business scenario design. This is an open question. Ideas are more important than plans, and actual business is much more complicated than this. You can You can combine it with project practice to think about how to design routing rules in the module you are responsible for, and how to expand data capacity.

Summarize

This article starts from the design of a real business scenario and shares the design of different routing rules for sharding databases and tables, the corresponding advantages and disadvantages, and the impact on the expansion method.

If today's question appears in an interview, it can be considered a typical system design question. So what should you pay attention to when answering system design questions?

First of all, system design questions appear in interviews. A very important aspect is communication. Confirm the overall data scale, input and output with the interviewer, and clarify the boundaries of the system design. For example, different data scales will directly affect the database tables. design approach.

The second step is to find the main problems and understand the bottlenecks of the system. Then you can apply various system design techniques to design each business layer.

Guess you like

Origin blog.csdn.net/caryxp/article/details/135023490