mysql million-level data query optimization practice-from start to expulsion

In a project, the order table will generate 50,000 to 100,000 records per day.
Previously, I didn't have much experience in processing big data queries. When it comes to statistics that are difficult to check, it is usually to build a table and count it once at a time. Or write stored procedures to optimize the query process.
According to the usual thinking in the past, build tables and add indexes.
However, this time when there were only more than 100,000 data in the table, the daily statistical query could not be checked, and the check took four or five seconds.
If it is millions of data, I can still throw the pot away because of server performance problems. One hundred thousand data cannot be checked. If the report asks the leader to upgrade the server, it is estimated that the leader will directly ask me to settle the salary with the personnel.
I imagined it like this:
I: "It's a server performance problem that can't be checked." Leader: "Now the server is one hundred a month, one hundred thousand one hundred, one million one thousand. Let's do this, you settle the salary with personnel, I use your salary to buy a high-end server."
So I started Baidu to find ways to optimize data.
The table structure is as follows (fields are deleted, the number of fields, and the amount of data have an impact on the query speed):

CREATE TABLE `myorder`  (
  `ID` int(11) NOT NULL AUTO_INCREMENT COMMENT '数据id',
  `CREATE_TIME` datetime(0) NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  `UPDATE_TIME` datetime(0) NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP(0) COMMENT '修改时间',
  `sn` varchar(50) NOT NULL COMMENT '订单编号',
  `client_id` int(11) NOT NULL COMMENT '供应商id',
  `money` decimal(19,2) DEFAULT NULL COMMENT '订单金额',
  PRIMARY KEY (`ID`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
ALTER TABLE myorder ADD INDEX index_mo_sn ( `sn` ) ;
ALTER TABLE myorder ADD INDEX index_mo_client_id ( `client_id` ) ;

The SQL is as follows:

select client_id,sum(money)
from myorder
where CREATE_TIME>=CURRENT_DATE
group by client_id

Remove the grouping conditions, the query is fast.
After trying, the result of the first optimization is to add an index to the time field.
I thought the problem was solved. One night, a phone call scared me to sleep. The leader found that each statistical query takes half a minute.
I'm wondering, it's impossible. Yesterday was no problem. Why did it die today? There is a big difference between 400,000 and 500,000?
So I personally reviewed the function, and I really couldn't find it. Then I told the leader: "The server network fluctuates, I'll fix it."
After testing, I found that the amount of data on this day was a bit large, and it would take more than ten seconds to query by time, plus grouping, it took longer.
After more than ten minutes, the leader asked me: "How is it?" Me: "The service provider said that their network was attacked and is being repaired." I logged into the server, and the firewall closed port 80.
Check the execution plan, the query does not take the time index. Finally, I continued to write SQL tests and found that using time for query conditions was not working.
After mute the phone, I thought about it carefully, and at one point the idea of ​​building a statistical table for timing statistics emerged. But when I think about this statistics every time it takes more than ten seconds, it would be embarrassing if the watch is locked.
After some thinking, I thought of querying by ID, using the ID of the last data of the previous day as the statistical condition
SQL as follows:

select client_id,sum(money)
from myorder
where id>(select max(id) from myorder where CREATE_TIME<CURRENT_DATE)
group by client_id

After testing, the query time for millions of data is controlled within 1 second.
Then a small problem was discovered. Because the paging query is divided into two steps, the first step is to query the total number of records, and the second step is to check the paging data. When checking the total number of records, without adding any conditions, the query speed is a bit slow. The larger the data volume, the slower the speed.
The SQL is as follows:

select count(1)
from pay_order

After checking the 100w data for 2 seconds
, it can be found that the query did not use the index, so let him use the index.
The SQL is as follows:

select count(1)
from pay_order
where id is not null

After adding it, it only took 0.2 seconds
to redeploy the project and called the leader back. Leader: "Why didn't you answer my call?" I: "I just went out for a while and didn't bring my mobile phone." Leader: "The service provider told me that their network was not attacked." I sophistry said: "Impossible, I just didn't have anything. Do it, if it hasn't been attacked, how can it be better on its own." The leader was speechless.
One day, the leader suddenly called me to chat. Leader: "I find that the query is very slow now, especially the last page." I explained: "The amount of data is large, and the query is slow." Leader: "The monthly salary of the development of my friend company is 4K, millions of data The query speed is about the same as yours. You have to be worthy of your 8K salary. Isn't it too demanding to be twice as fast?" I was stunned, but I didn't expect to be able to compare it like this: "It seems there is so much truth...I think you can upgrade the server." The leader looked unhappy: "Should I give you a Tianhe No. 1?" I: "..."
First look at the slow query speed on the last page.
According to the method of Baidu, rewrite the query conditions:
original SQL:

select *
from myorder
limit 1000000,100

After rewriting:

select *
from myorder
where id >= (SELECT id FROM myorder LIMIT 1000000, 1)
limit 100

It seemed that there was no problem, so the deployment went online. After two days, the leader suddenly came to me. Leader: "Why is the data sorted in reverse order the same no matter which page?" When I saw it, it really was a clever move: "There is a problem with the default configuration of the service provider's database. Let me deal with it."
In practice, our where There will be many query conditions later, and there will be multiple field combinations in ascending and descending order. There is a good way to disable sorting (reduce order by as much as possible) and disable search.
After many tests, it is found that if the result field only has the id column, the query speed will be very fast. Check the execution plan and add one more field to the result column, so the index will not be taken. The fields used by where and order by must be tested again to view the execution plan.
In that case, split the above SQL into two queries

SELECT id FROM myorder where ... order by ... LIMIT 1000000, 100
select * from myorder where id in (1,2,3,4...)

First find out the id, spell the id into a string, and then use in to query, the query speed has no effect, the data is accurate

One day, the leader: "Why can't the server be accessed?" I checked the log and found that someone tried to export hundreds of thousands of data, the memory overflowed, and the application crashed: "The server network fluctuates." Not long after, the leader: "The service provider said the network is not Question, gave me a suggestion." I was curious: "What suggestion?" Leader: "Suggest that I fire you."

Guess you like

Origin blog.csdn.net/xymmwap/article/details/108902725