Interviewer: 10 million data, how did you query it?

Interviewer: 10 million data, how did you query it?

1 Give the conclusion first

For 10 million data queries, the main focus is on the performance of the paging query process.
For the slow query speed due to large offsets:
first create a unique index for the queried fields.
According to business needs, first locate the query range (corresponding to the range of the primary key id, such as Greater than how much, less than how much, IN)
query, use the range determined in step 2 as the query condition.
For the large amount of query data that leads to slow query speed:
when querying, reduce unnecessary columns, and the query efficiency can also be significantly improved once. It is possible to query fewer data items on demand, and use nosql cache data to reduce the pressure on the mysql database

2 Prepare data

2.1 Create table

CREATE TABLE `user_operation_log`  (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
   PRIMARY KEY (`id`) USING BTREE
 ) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

2.2 Create data script

Using batch insert, the efficiency will be much faster, and commit every 1000 pieces, the amount of data is too large, which will also lead to slow batch insert efficiency

 DELIMITER ;;
 CREATE DEFINER=`root`@`%` PROCEDURE `batch_insert_log`()
 BEGIN
   DECLARE i INT DEFAULT 1;
   DECLARE userId INT DEFAULT 10000000;
  set @execSql = 'INSERT INTO `big_data`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES';
  set @execData = '';
   WHILE i<=10000000 DO
    set @attr = "rand_string(50)";
   set @execData = concat(@execData, "(", userId + i, ", '110.20.169.111', '用户登录操作'", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")");
   if i % 1000 = 0
   then
      set @stmtSql = concat(@execSql, @execData,";");
     prepare stmt from @stmtSql;
     execute stmt;
     DEALLOCATE prepare stmt;
     commit;
     set @execData = "";
    else
      set @execData = concat(@execData, ",");
    end if;
   SET i=i+1;
   END WHILE;
 END
 DELIMITER ;
 delimiter $$
 create function rand_string(n INT) 
 returns varchar(255) #该函数会返回一个字符串
 begin 
 #chars_str定义一个变量 chars_str,类型是 varchar(100),默认值'abcdefghijklmnopqrstuvwxyzABCDEFJHIJKLMNOPQRSTUVWXYZ';
  declare chars_str varchar(100) default
    'abcdefghijklmnopqrstuvwxyzABCDEFJHIJKLMNOPQRSTUVWXYZ';
  declare return_str varchar(255) default '';
  declare i int default 0;
  while i < n do 
    set return_str =concat(return_str,substring(chars_str,floor(1+rand()*52),1));
    set i = i + 1;
    end while;
  return return_str;
 end $$

2.3 Execute the stored procedure function

Because the analog data flow is 1000W, my computer configuration is not high, it took a lot of time, it should be an hour

 SELECT count(1) FROM `user_operation_log`;

insert image description here

2.4 Common pagination query

MySQL supports the LIMIT statement to select the specified number of data, and Oracle can use ROWNUM to select.

The MySQL pagination query syntax is as follows:

 SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset
  • The first parameter specifies the offset of the first returned record row
  • The second parameter specifies the maximum number of rows to return

Let's start testing the query results:

SELECT * FROM `user_operation_log` LIMIT 10000, 10;

The time of querying three times is as follows:
insert image description here
insert image description here
insert image description here
it seems that the speed is not bad, but it is a local database, so the speed is naturally faster.

Test from another angle

Same offset, different amount of data

 SELECT * FROM `user_operation_log` LIMIT 10000, 10;
 SELECT * FROM `user_operation_log` LIMIT 10000, 100;
 SELECT * FROM `user_operation_log` LIMIT 10000, 1000;
 SELECT * FROM `user_operation_log` LIMIT 10000, 10000;
 SELECT * FROM `user_operation_log` LIMIT 10000, 100000;
 SELECT * FROM `user_operation_log` LIMIT 10000, 1000000;

insert image description here
From the above results, it can be concluded that the larger the amount of data, the longer it takes (isn't this nonsense?)

Same amount of data, different offsets

SELECT * FROM `user_operation_log` LIMIT 100, 100;
 SELECT * FROM `user_operation_log` LIMIT 1000, 100;
 SELECT * FROM `user_operation_log` LIMIT 10000, 100;
 SELECT * FROM `user_operation_log` LIMIT 100000, 100;
 SELECT * FROM `user_operation_log` LIMIT 1000000, 100;

insert image description here
From the above results, it can be concluded that the larger the offset, the longer it takes

3 How to optimize

Now that we have gone through the above toss, we have also come to a conclusion. For the above two problems: large offset and large amount of data, we will start to optimize them separately.

3.1 Optimize the problem of large amount of data

SELECT * FROM `user_operation_log` LIMIT 1, 1000000
SELECT id FROM `user_operation_log` LIMIT 1, 1000000
SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000

The query results are as follows:
insert image description here
the above simulation is to query 100W pieces of data at a time from 1000W data tables. It seems that the performance is not good, but in our regular business, it is rare to query so many pieces of data from mysql at one time. Scenes. It can be combined with nosql cache data and so on to reduce the pressure on the mysql database.

Therefore, for the problem of querying a large amount of data:

When querying, reduce unnecessary columns, and the query efficiency can also be significantly improved. Query as little data as possible at a time. Use nosql cache data to reduce the pressure on the mysql database.

The query speed of the first and the third is about the same. At this time, you will definitely complain. Then why do I write so many fields? Just * and it’s over.

Note that my MySQL server and client are on the same machine, so the query data is similar. Students who have the conditions can test that the client is separated from MySQL.

SELECT * Doesn't it smell good?

By the way, I would like to add here why SELECT * should be banned. Is it simple and brainless, isn't it fragrant?

Two main points:

  1. Using "SELECT * " in the database requires parsing more objects, fields, permissions, attributes, and other related content. In the case of complex SQL statements and more hard parsing, it will cause a heavy burden on the database.
  2. Increase network overhead, * Sometimes useless and large text fields such as log and IconMD5 are mistakenly included, and the data transmission size will increase geometrically. Especially if MySQL and the application are not on the same machine, this overhead is very obvious.

3.2 Optimize the problem of large offset

3.2.1 Using subqueries

We can locate the id of the offset position first, and then query the data

SELECT id FROM `user_operation_log` LIMIT 1000000, 1;
SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 10;

The query results are as follows:
insert image description here
The efficiency of this query is not ideal! ! ! Strange, id is the primary key, the primary key index should not be so slow to query? ? ?

First EXPLAIN analyzes the sql statement:

EXPLAIN SELECT id FROM `user_operation_log` LIMIT 1000000, 1;
EXPLAIN SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 10;

Strange, the index is gone, and it is the primary key index, as follows.
insert image description here
insert image description here
With one hundred thousand why and ten million unwillingness, try to add a unique index to the primary key

ALTER TABLE `big_data`.`user_operation_log` 
ADD UNIQUE INDEX `idx_id`(`id`) USING BTREE;

Since the amount of data is 1000W, it takes a while to add the index. After all, creating an index of 1000W data is not so fast for general machines.

Then execute the above query again, the result is as follows:
insert image description hereGod, the difference in query efficiency is more than ten times! ! !

Let’s analyze EXPLAIN again:
insert image description here
insert image description here
the hit indexes are different, and the query that hits the unique index is more than ten times more efficient.

in conclusion:

For large table queries, don't believe too much in how much performance improvement the primary key index can bring, honestly add the corresponding index according to the query field! ! !

But the above method is only applicable to the situation where the id is increasing . If the id is not increasing, such as the id generated by the snowflake algorithm, the following method must be followed:

Notice:

  1. Some mysql versions do not support the use of limit in the in clause, so multiple nested selects are used
  2. But the disadvantage of this is that paging queries can only be placed in subqueries
SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t);

The time spent on the query is as follows:
insert image description here
EXPLAIN

EXPLAIN SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t);

insert image description here

3.2.2 Use id restriction method

This method is more demanding, the id must be continuously increasing (note that it is continuously increasing, not just increasing), and the range of the id must be calculated, and then use between, the sql is as follows

SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100;
SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100;

insert image description here
It can be seen that the query efficiency is quite good

Note: The LIMIT here is to limit the number of entries, no offset is used

Or EXPLAIN analysis

EXPLAIN SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100;
EXPLAIN SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100;

insert image description here
insert image description here
Therefore, for paging queries, large offsets lead to slow queries:

First create a unique index for the queried field. According to business needs, first locate the query range (corresponding to the range of the primary key id, such as greater than, less than, IN) When querying, use the range determined in step 2 as the query condition

Guess you like

Origin blog.csdn.net/CXikun/article/details/130242734