Dachang interviewer: How to quickly query tables with tens of millions of data?

Recommended reading:

Preface

  • Interviewer: Let’s talk about the ten million data, how do you query it?
  • Brother B: Direct paging query, use limit paging.
  • Interviewer: Have you practiced it?
  • Brother B: There must be

Present a song "Cool and Cool" at this moment

Maybe some people have never encountered a table with tens of millions of data, and they don't know what happens when querying tens of millions of data.

Today I will show you the practical exercises, this time it is based on MySQL 5.7.26 for testing

Prepare data

What if you don't have 10 million data?

Create

10 million code creation? It's impossible, it's too slow, it may really take a day. You can use database scripts to execute much faster.

Create table
CREATE TABLE `user_operation_log` (
  `id` int(11) NOT NULL AUTO_INCREMENT, 
  `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, 
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic; 
Create data script

Using batch insertion, the efficiency will be much faster, and commit every 1000 items, too much data, will also lead to slow batch insertion efficiency

DELIMITER ;
;
CREATE PROCEDURE batch_insert_log()BEGIN  DECLARE i iNT DEFAULT 1;
  DECLARE userId iNT DEFAULT 10000000;
 set @execSql = 'INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES';
 set @execData = '';
  WHILE i<=10000000 DO   set @attr = "'测试很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长的属性'";
  set @execData = concat(@execData, "(", userId + i, ", '10.0.69.175', '用户登录操作'", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")");
  if i % 1000 = 0  then     set @stmtSql = concat(@execSql, @execData,";");
    prepare stmt from @stmtSql;
    execute stmt;
    DEALLOCATE prepare stmt;
    commit;
    set @execData = "";
    else     set @execData = concat(@execData, ",");
   end if;
  SET i=i+1;
  END WHILE;
END;
;
DELIMITER ;

start testing

Brother’s computer configuration is relatively low: win10 standard pressure residue i5 read and write about 500MB of SSD

Due to the low configuration, only 3148000 pieces of data were prepared for this test, which occupied 5G of the disk (without indexing). After running for 38 minutes, students with good computer configuration can insert multi-point data test.

SELECT count(1) FROM `user_operation_log`

Return result: 3148000

The three query times are:

  • 14060 ms
  • 13755 ms
  • 13447 ms

Normal paging query

MySQL supports the LIMIT statement to select the specified number of data, Oracle can use ROWNUM to select.

The MySQL paging query syntax is as follows:

SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset
  • The first parameter specifies the offset of the first returned record row
  • The second parameter specifies the maximum number of rows returned

Now we start to test the query results:

SELECT * FROM `user_operation_log` LIMIT 10000, 10

The time for the 3 queries are:

  • 59 ms
  • 49 ms
  • 50 ms

It seems that the speed is okay, but it is a local database, which is naturally faster.

Test from another angle

Same offset, different data volume
SELECT * FROM `user_operation_log` LIMIT 10000, 10
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 1000
SELECT * FROM `user_operation_log` LIMIT 10000, 10000
SELECT * FROM `user_operation_log` LIMIT 10000, 100000
SELECT * FROM `user_operation_log` LIMIT 10000, 1000000

The query time is as follows:

Quantity the first time the second time the third time
10 pieces 53ms 52ms 47ms
100 pieces 50ms 60ms 55ms
1000 pieces 61ms 74ms 60ms
10,000 164ms 180ms 217ms
100000 1609ms 1741ms 1764ms
1000000 16219ms 16889ms 17081ms

The conclusion can be drawn from the above results: the larger the amount of data, the longer it takes

Same amount of data, different offset
SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT * FROM `user_operation_log` LIMIT 1000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 100000, 100
SELECT * FROM `user_operation_log` LIMIT 1000000, 100
Offset the first time the second time the third time
100 36ms 40ms 36ms
1000 31ms 38ms 32ms
10000 53ms 48ms 51ms
100000 622ms 576ms 627ms
1000000 4891ms 5076ms 4856ms

The conclusion can be drawn from the above results: the greater the offset, the longer it takes

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT id, attr FROM `user_operation_log` LIMIT 100, 100

How to optimize

Now that we have gone through the above toss, we have also come to the conclusion that for the above two problems: large offset and large amount of data, we will proceed to optimize

Optimize the problem of large offset

Use subquery

We can locate the id of the offset position first, and then query the data

SELECT  *  FROM  `user_operation_log`  LIMIT 1000000, 
 10SELECT id FROM  `user_operation_log`  LIMIT 1000000, 
 1SELECT  *  FROM  `user_operation_log`  WHERE id  >=  (
  SELECT id FROM  `user_operation_log`  LIMIT 1000000, 
   1
)  LIMIT 10 

The query results are as follows:

sql Spend time
Article 1 4818ms
Article 2 (without index) 4329ms
Article 2 (with index) 199ms
Article 3 (without index) 4319ms
Article 3 (with index) 201ms

Draw conclusions from the above results:

  • The first one takes the most time, the third one is slightly better than the first
  • Subqueries use indexes faster

Disadvantages: only applicable when the id is increasing

In the case of non-increasing id, you can use the following wording, but this disadvantage is that the paging query can only be placed in the subquery

Note: Some mysql versions do not support limit in the in clause, so multiple nested selects are used

SELECT  *  FROM  `user_operation_log`  WHERE id IN (
  SELECT t.id FROM (
    SELECT id FROM  `user_operation_log`  LIMIT 1000000, 
     10
  )  AS t
) 
Adopt id restriction method

This method is more demanding, id must be continuously increasing, and the range of id must be calculated, and then use between, sql is as follows

SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100

The query results are as follows:

sql Spend time
Article 1 22ms
Article 2 21ms

It can be seen from the results that this method is very fast

Note: LIMIT here limits the number of items, and does not use offset

Optimize the problem of large data volume

The amount of data returned will also directly affect the speed

SELECT * FROM `user_operation_log` LIMIT 1, 1000000

SELECT id FROM `user_operation_log` LIMIT 1, 1000000

SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000

The query results are as follows:

sql Spend time
Article 1 15676ms
Article 2 7298ms
Article 3 15960ms

From the results, it can be seen that reducing unnecessary columns, query efficiency can also be significantly improved

The query speed of the first and third is almost the same. At this time, you will definitely complain, then what am I doing with so many fields? Direct* is not over

Note that my MySQL server and client are on the same machine , so the query data is similar. Students with conditions can test the client and MySQL separately

SELECT * Isn't it fragrant?

By the way, add here why SELECT * should be prohibited. Isn't it simple and brainless, isn't it fragrant?

Two main points:

  1. With "SELECT *", the database needs to parse more objects, fields, permissions, attributes, and other related content. In the case of complex SQL statements and more hard parsing, it will cause a heavy burden on the database.
  2. Increase network overhead, * Sometimes useless and large text fields such as log and IconMD5 are mistakenly brought, and the data transmission size will increase geometrically. Especially MySQL and the application are not on the same machine, this overhead is very obvious.

End

Finally, I hope everyone can do it on their own. You can definitely gain more. Welcome to leave a message! !

I just created the script for you, what are you waiting for! ! !

Guess you like

Origin blog.csdn.net/weixin_45784983/article/details/108295166