E-commerce system architecture design series (8): What should I do if there are more and more order data and the database is getting slower and slower?

In the last article , I left you a thought question: What should I do if the order data is getting more and more and the database is getting slower and slower?

In today's article, let's talk about how to deal with the continuous growth of data, especially data like order data that will accumulate over time.

introduction

Why the larger the amount of data, the slower the database? You have to understand the root cause of this.

We know that no matter which operation is "addition, deletion, modification and query", it is actually a search problem, because you have to find the data before you can operate on the data. The performance problem of the storage system is actually a problem of searching speed.

No matter what kind of storage system it is, the time it takes for a query depends on two factors:

  1. lookup time complexity
  2. total data

This is why Dachang always likes to ask questions related to "time complexity" during interviews.

The time complexity of the lookup again depends on two factors:

  1. search algorithm
  2. Data structure for storing data

You see, these two knowledge points are also frequent visitors in interview questions, right? So the interviewer doesn't have to ask you some unusable questions to embarrass you. These knowledge points are really not unusable, but that you don't know how to use them.

For most of our business systems, off-the-shelf databases are used. The data storage structure and search algorithms are implemented by the database, and the business system basically cannot change it. For example, MySQL's InnoDB storage engine has a storage structure of B+ tree, and most of its search algorithms are tree search, and the time complexity of search is O(log n), which is fixed. Then the only thing we can change is the total amount of data.

Therefore, to solve the problem of slow storage system due to massive data, the idea is very simple. It is a word of "demolition", which splits a large amount of data into N smaller ones, which is called "sharding (Shard)" . After disassembly, there is not so much data in each shard, and then let the search fall on a certain shard as much as possible, so as to improve the search performance.

All distributed storage systems follow this idea to solve the problem of massive data search , but thinking is not enough, and it needs to be implemented. Let's talk about how to split data.

Archive historical order data to improve query performance

When we develop business systems, a lot of data has time attributes, and as the system runs, the cumulative growth will increase, and the amount of data will become slower and slower when it reaches a certain level, such as order data in e-commerce , which is the case. According to what we just said, the data needs to be split at this time.

Our order data is generally stored in the order table in MySQL. When it comes to splitting MySQL tables, most people's first reaction may be to "separate databases and tables". Don't worry, our current data volume is still small. It has not reached the step of non-scoring library sub-table. If archiving can solve the problem, don't split databases and tables. (In the next article, we will talk about sub-database and sub-table, and you can also read my other article: Summary of sub-database and sub-table schemes ).

When there is too much order data in a single table, which affects performance, the preferred solution is to archive historical orders.

The so-called archiving is actually a strategy for splitting data. Simply put, it is to move a large number of historical orders to another historical order table. Why do you do this? Because data with time attributes, such as orders, has a hot tail effect. In most cases, recent data is accessed, but a large amount of data in the order table is old data that is not commonly used.

Because new data only accounts for a small part of the total data, after separating the new and old data, the amount of new data will be much smaller, and the query speed will be much faster. Although the old data is not much less than before, the query speed is not significantly improved, but because the old data is rarely accessed, it is not a big problem if it is slower.

Another advantage of this split is that when splitting an order, very little code needs to be changed. Most of the operations on the order table are before the order is completed, and these business logics do not need to be modified at all. Even after the completion of the order, such as returns and refunds, there is a time limit, so these business logics do not need to be modified, and how to operate the order table is still how to operate.

Basically, there is only the function of querying statistics, and historical orders can be found. These need to be slightly adjusted. According to the time, you can choose to query the order table or the historical order table. Many large e-commerce companies have used this order splitting plan for many years in the process of their gradual development and growth. You may still have the impression that when you checked your orders on JD.com and Taobao a few years ago, there was an option to check "orders three months ago", which is actually to check the order history table.

The general process of archiving historical orders is as follows:

  1. First, we need to create a historical order table with the same structure as the order table;
  2. Then, check out the historical order data in the order table in batches and insert them into the historical order table. You can implement this process in any way, using stored procedures, writing a script, or writing a small program to import data, just use the method you are most familiar with. If your database has been separated from the master and slave, it is best to query the order from the slave database and write it to the historical order table of the master database, so that the pressure on the master database will be less.
  3. Now, both the order table and the historical order table have historical order data. Don't worry about deleting the data in the order table. You should test and launch the new version code that supports the historical order table. Because both tables have historical orders, this database can now support the old and new versions of the code. If there is a bug in the new version of the code, you can immediately roll back to the old version without affecting the online business.
  4. After the new version code is launched and verified to be correct, the historical order data in the order table can be deleted.
  5. Finally, a program or script for migrating data needs to be launched to periodically move expired orders from the order table to the historical order table. 

 The related sub-tables of orders similar to the order product table also need to be filed in their respective history tables in the same way, because they all use the order ID as a foreign key to associate with the main order table. The orders in the table are filed together just fine.

In this process, the problem we need to pay attention to is to minimize the impact on online business. Migrating such a large amount of data will more or less affect the performance of the database. You should try to migrate in your spare time, and make sure to make a backup before the migration , so that if you accidentally misuse it, you can also use the backup to restore it.

How to delete a large amount of data in batches?

There is also a very important detail here: how to delete the historical order data that has been moved from the order table? Can we directly execute a SQL to delete historical orders? Delete orders older than three months like this:

delete from orders
where timestamp < SUBDATE(CURDATE(),INTERVAL 3 month);

There is a high probability that you will encounter an error, indicating that the deletion failed, because the amount of data to be deleted is too large, so it needs to be deleted in batches. For example, if we delete 1000 records in batches, the SQL for batch deletion can be written as follows:

delete from orders
where timestamp < SUBDATE(CURDATE(),INTERVAL 3 month)
order by id limit 1000;

When executing delete statements, it is best to pause for a while between each delete to avoid putting too much pressure on the database. The above delete statement is already available, and the delete task can be completed by repeatedly executing this SQL until all historical orders are deleted.

However, this SQL still has room for optimization. Every time it is executed, it must first find the eligible records in the index corresponding to the timestamp, then sort these records according to the order ID, and then delete the first 1000 records.

In fact, there is no need to compare orders according to timestamp every time, so we can first find the largest order ID among the eligible historical orders through a query, and then convert the deletion condition to delete by primary key in the delete statement.

select max(id) from orders
where timestamp < SUBDATE(CURDATE(),INTERVAL 3 month);


delete from orders
where id <= ?
order by id limit 1000;

In this way, every time you delete, because the condition becomes the primary key comparison, in the InnoDB storage engine of MySQL, the table data structure is a B+ tree organized according to the primary key, and the B+ tree itself is ordered, so not only the search is very fast , and no additional sorting operations are required. Of course, the prerequisite for this is that the order ID must be positively related to the order time . Most order ID generation rules can meet this condition, so the problem is not big.

Then let's talk about it again, why do we have to add a sort in the delete statement? Because after sorting by ID, the records we delete in each batch are basically a batch of records with continuous IDs. Due to the orderliness of the B+ tree, these records with similar IDs are roughly put together on the physical file on the disk. , so that the deletion efficiency will be higher, and it is also convenient for MySQL to recycle pages.

After deleting a large amount of historical order data, if you check the disk space occupied by MySQL, you will find that the disk space occupied by it has not decreased. What is the reason? This is also related to the physical storage structure of InnoDB.

Although logically each table is a B+ tree, physically, each record is stored in a disk file, and these records are organized into a B+ tree through some location pointers. When MySQL deletes a record, it can only find the position in the file where the record is located, then mark this area of ​​the file as free, and then modify some related pointers in the B+ tree to complete the deletion. In fact, the deleted record is still lying in the same position of the file, so the disk space will not be released.

There is no way to do this, because the file is a continuous binary byte, similar to an array, it does not support deleting part of the data from the middle of the file. If you have to delete it in this way, you can only move all the data after this position forward, which is equivalent to moving a large amount of data, which is very, very slow. Therefore, when deleting, it can only be marked, not actually deleted, and this space will be reused when new data is written later.

After understanding this principle, you can easily know that not only MySQL, but also many other databases have similar problems. There is no particularly good way to solve this problem. If there is enough disk space, let it be like this. At least the data will be deleted and the query speed will be faster. Basically, the goal has been achieved.

If the disk space of our database is very tight and we must release this part of the disk space, we can execute OPTIMIZE TABLE once to release the storage space. For InnoDB, executing OPTIMIZE TABLE is actually rebuilding the table. During the execution, the table will be locked all the time, that is to say, the order will be stuck at this time. This needs to be paid attention to. In addition, there is a prerequisite for such optimization. The MySQL configuration must be an independent table space for each table (innodb_file_per_table = ON). If all tables are put together, executing OPTIMIZE TABLE will not release disk space.

During the process of rebuilding the table, the index will also be rebuilt, so that the table data and index data will be more compact, not only occupying less disk space, but also improving query efficiency. For such a table that frequently inserts and deletes a large amount of data, if the locked table is acceptable, it is very necessary to execute OPTIMIZE TABLE regularly.

If our system can accept temporary service suspension, the fastest way is this: directly create a temporary order table, then copy the current order to the temporary order table, rename the old order table, and finally put the temporary order table The table name of the table is changed to Formal Order Table. In this way, it is equivalent to rebuilding the order table manually, but there is no need for a long process of deleting historical orders. I put the SQL of the execution process below for your reference:

-- 新建一个临时订单表
create table orders_temp like orders;


-- 把当前订单复制到临时订单表中
insert into orders_temp
  select * from orders
  where timestamp >= SUBDATE(CURDATE(),INTERVAL 3 month);


-- 修改替换表名
rename table orders to orders_to_be_droppd, orders_temp to orders;


-- 删除旧表
drop table orders_to_be_dropp

Summarize

For data with time attributes such as orders, it will accumulate over time, and the amount of data will increase. In order to improve query performance, the data needs to be split. The preferred splitting method is to archive the old data into the history table. This splitting method can achieve very good results, and more importantly, the changes to the system are small and the upgrade cost is low.

In the process of migrating historical data, if the service can be stopped, the fastest way is to rebuild a new order table, then copy the order data within three months to the new order table, and then modify the table name to make the new order table The table takes effect. If you can only migrate online, you need to iteratively delete historical order data in batches. When deleting, pay attention to controlling the deletion rhythm to avoid too much pressure on the online database.

Finally, a reminder: online data operations are very dangerous, and data backup must be done before operations .

Thanks for reading, if you think this article has inspired you, please share it with your friends.

thinking questions

It is really necessary to divide the database and table, how to design it? What issues should be considered?

Looking forward to, you are welcome to leave a message or contact online, discuss and exchange with me, "learn together, grow together".

previous article

E-commerce system architecture design series (7): How to build an e-commerce product search system?


recommended reading

Series sharing

------------------------------------------------------

------------------------------------------------------

My CSDN homepage

About me (personal domain name, more information about me)

My open source project collection Github

I look forward to learning, growing and encouraging together with everyone , O(∩_∩)O thank you

If you have any suggestions, or knowledge you want to learn, you can discuss and exchange with me

Welcome to exchange questions, you can add personal QQ 469580884,

Or, add my group number 751925591 to discuss communication issues together

Don't talk about falsehood, just be a doer

Talk is cheap,show me the code

Guess you like

Origin blog.csdn.net/hemin1003/article/details/132236331