SQL Optimization Millions of Data Real Case Tutorial 0 Basics (A must-read for beginners)

Prerequisite preparation: This case prepared 1 million data for SQL performance testing. The database used is MySQL.

A total of 14 common SQL optimization methods are introduced, and each optimization method has been tested in real terms.

Line by line explanation, easy to understand!

1. Prerequisite preparation

Prepare a student table data and a special student table data in advance for later testing.

1.1 Create table structure

Create a student table:

CREATE TABLE student (
  id int(11) unsigned NOT NULL AUTO_INCREMENT,
  name varchar(50) DEFAULT NULL,
  age tinyint(4) DEFAULT NULL,
  id_card varchar(20) DEFAULT NULL,
  sex tinyint(1) DEFAULT '0', 
  address varchar(100) DEFAULT NULL,
  phone varchar(20) DEFAULT NULL, 
  create_time timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  remark varchar(200) DEFAULT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Create another special student table:

CREATE TABLE special_student (
  id int(11) unsigned NOT NULL AUTO_INCREMENT,
  stu_id int(11) DEFAULT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

1.2 Create stored procedures

Insert 1 million pieces of data into the student table, manually start and submit the transaction, manually COMMIT the transaction after inserting 10,000 records, and finally COMMIT the remaining records again. This can make the insertion faster because there is no need to Each record is COMMIT, thereby reducing the number of IO times.

CREATE PROCEDURE insert_student_data()
BEGIN
  DECLARE i INT DEFAULT 0; 
  DECLARE done INT DEFAULT 0; 
  DECLARE continue HANDLER FOR NOT FOUND SET done = 1;
  START TRANSACTION;  
   WHILE i < 1000000 DO
     INSERT INTO student(name,age,id_card,sex,address,phone,remark)
     VALUES(CONCAT('姓名_',i), FLOOR(RAND()*100),
         FLOOR(RAND()*10000000000),FLOOR(RAND()*2),
         CONCAT('地址_',i), CONCAT('12937742',i),
         CONCAT('备注_',i));
     SET i = i + 1; 
     IF MOD(i,10000) = 0 THEN 
       COMMIT;
       START TRANSACTION;
     END IF;     
   END WHILE;    
   COMMIT;
END

Execute the stored procedure of the student table:

CALL insert_student_data();

Randomly insert 100 IDs from the student table into the special student table:

CREATE PROCEDURE insert_special_student()
BEGIN
  DECLARE i INT DEFAULT 0; 
  WHILE i < 100 DO
    INSERT INTO special_student (stu_id) VALUES (FLOOR(RAND()*1000000));  
    SET i = i + 1;   
  END WHILE;
END

Execute the stored procedure for the special student table:

CALL insert_special_student();

2. Detailed introduction of SQL optimization cases

2.1 Return necessary rows

If the number is large, you can use the LIMIT clause to limit the number of rows returned

select id,name from student limit 10

2.2 limit optimization

In daily development work, our paging processing is generally as follows:

SELECT * FROM student LIMIT 900000,10

The execution result is shown in the figure:

It takes 0.56s. Optimization can be performed when the ID is auto-incremented. The optimized SQL is as follows:

SELECT * FROM student WHERE ID >= 900000 LIMIT 10

The execution results after optimization are shown in the figure:

It takes 0.02s and the speed is greatly improved!

2.3 Return necessary columns, avoid using SELECT *, and avoid returning table queries

Sometimes, for the sake of convenience, we will directly use SELECT * to find out all the data in the table at once:

SELECT * FROM student

The execution result is shown in the figure:

As you can see, the execution time took about 2 seconds, which is a long time!

In actual development, the data we display on the page may only have 2-3 fields. If we directly find out all of them, it will be a waste of fields and a loss of performance. This is because SELECT * will not use the covering index. , a large number of table return operations will occur, resulting in a significant reduction in SQL performance.

We have established a joint index above, so we can query only the index columns, which will greatly improve query efficiency. The optimized SQL is as follows:

SELECT name,address,phone FROM student

The execution results after optimization are shown in the figure:

It takes 0.780s, and the speed is greatly improved!

2.4 Conditions for or connection (note)

When using the OR operator to combine multiple conditions, if one of the conditions' columns does not have an index, the index involved will not be used.

To solve this problem, you can consider the following options:

Make sure all involved criteria columns have appropriate indexes to improve query performance.
For large tables, consider refactoring the query to split the OR operator into multiple independent queries and use UNION or UNION ALL to merge the results. This ensures that each subquery uses the appropriate index and avoids index failure caused by the OR operator.

2.5 Avoid using or conditions and use UNION or UNION ALL instead (disputed)

If we want to query students with a specified gender or specified ID number, execute the SQL as follows:

SELECT * FROM student WHERE sex = 0 OR id_card = '7121877527789'

The execution result is shown in the figure:

A total of nearly 500,000 pieces of data were queried, which took about 1.4 seconds. We switched to using the UNION ALL keyword query:

SELECT * FROM student WHERE sex = 0 
UNION ALL 
SELECT * FROM student WHERE id_card = '7121877527789'

The execution result after switching is as shown in the figure:

The speed did not increase, but became slower, so there is controversy

Analyze SQL:

Use the EXPLAIN keyword to analyze this SQL using the OR keyword:

EXPLAIN SELECT * FROM student WHERE SEX = 0 OR id_card = '7121877527789'

The execution result is shown in the figure:

Obviously, although it may be necessary to create an index on id_card, because the sex field is not indexed, a full table scan is still performed.

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN
SELECT * FROM student WHERE sex = 0 
UNION ALL 
SELECT * FROM student WHERE id_card = '7121877527789'

The execution result is shown in the figure:

Obviously the condition is that sex uses the entire table, but id_card uses the index, so a full table scan is still performed. Therefore, what is said on the Internet about UNION ALL replacing OR, my actual measurement here still feels that there is controversy!

2.6 Unless necessary, use the UNION keyword with caution and use UNION ALL instead.

For example, we query the information of all students based on gender. Although this operation is superfluous, just SELECT * directly. In order to demonstrate the detailed difference between these two keywords, the SQL executed using the UNION keyword is as follows:

SELECT * FROM student WHERE sex = 0
UNION 
SELECT * FROM student WHERE sex = 1

The execution result is shown in the figure:

I checked 1 million records and waited for about 32 seconds. If this speed is put on the system, I will check the data and wait until the baby vegetables are cold!

This is because after using UNION to execute SQL, it will help us get all the data and remove duplicate data. This is where the performance loss occurs. On the contrary, UNION ALL will help us get all the data but keep the duplicate data.

We use the UNION ALL keyword instead, and the optimized SQL is as follows:

SELECT * FROM student WHERE sex = 0
UNION ALL
SELECT * FROM student WHERE sex = 1

The execution result after replacement is shown in the figure:

Similarly, when querying 1 million pieces of data, the execution speed here is greatly improved, and it only takes about 3 seconds!

The speed is improved a lot!

2.7 LIKE statement optimization

We usually use a lot of LIKE keywords for fuzzy matching in daily development, but in some cases the index will be invalid, resulting in slower query efficiency, for example:

As long as the ID card field contains 50, it will be found. Execute the SQL as follows:

SELECT * FROM student WHERE id_card like '%50%'

The execution result is shown in the figure:

It took about 0.8s.

As long as the ID number ends with 50, it will be found. Execute the SQL as follows:

SELECT * FROM student WHERE id_card like '%50'

The execution result is shown in the figure:

It took about 0.4s.

As long as the ID number starts with 50, it will be found. Execute the SQL as follows:

SELECT * FROM student WHERE id_card like '50%'

The execution result is shown in the figure:

This execution is very fast, about 0.08s.

Analyze SQL:

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE id_card like '%50%'

The execution result is shown in the figure:

Obviously a full table scan was performed!

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE id_card like '%50'

The execution result is shown in the figure:

Still went through the full table scan!

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE id_card like '50%'

The execution result is shown in the figure:

This time I went to the index! much faster

2.8 Try to avoid using !=, which will cause index failure

Try to avoid using != or <> operators, and analyze the SQL directly below:

SQL analysis:

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE id_card != '5031520645'

The execution result is shown in the figure:

Although we created an index for the id_card field, we still performed a full table scan!

2.9 Try to avoid using NULL values. IS NOT NULL will cause index failure, but IS NULL will not.

In order to ensure that there are no NULL values, we can set a default value and analyze the SQL directly below:

SQL analysis:

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE id_card IS NOT NULL

The execution result is shown in the figure:

Still went through the full table scan.

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE id_card IS NULL

The execution result is shown in the figure:

This is the index!

2.10 Use small tables to drive large tables and avoid large tables driving small tables.

To put it simply and concisely, it means that the data found in the small table can be used to query the data in the large table. For example, if we want to query the information of special students in the student table, we can use the small table special_student to drive the large table student. The SQL is as follows:

SELECT * FROM student WHERE id 
IN (SELECT stu_id FROM special_student)

The execution result is shown in the figure:

It only took 0.02s, which is very impressive! Because the subquery statement in the IN keyword has a small amount of data, the query speed will be very fast!

2.11 Avoid unquoting strings, causing index failure

If the string is not quoted when querying conditions or creating an index, the index will become invalid.

To query students with a specified ID number, if we usually neglect to add single quotes to the ID number, execute the SQL as follows:

SELECT * FROM student WHERE id_card = 5040198345

The execution result is shown in the figure:

It takes about 0.4s.

Add single quotes to the ID number, and the optimized SQL is as follows:

SELECT * FROM student WHERE id_card = '5040198345'

The execution result is shown in the figure:

It takes about 0.02s, which is obviously much faster this time!

Analyze SQL:

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE id_card = 5040198345

The execution result is shown in the figure:

The index of id_card may be used, but a full table scan is still performed!

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE id_card = '5040198345'

The execution result is shown in the figure:

Adding quotation marks and removing the index makes it much faster!

2.12 Avoid operating fields on index columns, which may cause index failure

In order to avoid index failure, you should try to avoid performing operations on index columns during query conditions or index creation. If you really need to use operations, you can consider the following solutions:

Reversing operations on index columns: If the operation is reversible, you can maintain the validity of the index by applying the operation to the query parameters instead of the index columns.
Using function indexes: Some database management systems provide the function of function indexes, which can create indexes based on specific function operations to meet specific query needs.

2.13 Follow the leftmost matching principle (important)

Above we established a composite index in the order of name, address and phone, which is equivalent to establishing three indexes: (name), (name, address) and (name, address, phone). If the where condition of our query violates the established order, the composite index will be invalid, and the SQL analysis will be performed directly below:

Analyze SQL:

Use the EXPLAIN keyword to execute this SQL:

EXPLAIN SELECT * FROM student WHERE name = '姓名_4' and phone = '7121877527' and address = '地址_4'

The execution result is as shown in the figure:

Why is it that the compound index is still used despite violating the leftmost matching principle? The possible reasons are as follows:

1. The performance of filtering through index is good enough, so we still choose to use index.

2. The first few fields in the joint index have better filtering effects, so we still choose to use the index.

A possible execution plan is roughly:

1. Prioritize filtering through the phone field to reduce the number of records to be scanned.
2. Then continue filtering through the address field and reduce some records.
3. Finally, after filtering through the name field, there are very few records left to scan.
4. Although the leftmost match is violated, the interpreter may think that it is more efficient to still use the index.

So in general, the interpreter will make trade-offs based on the actual situation. Even if it violates the leftmost matching principle, it may choose to use the index. But this is not a good query optimization. It is best to strictly adhere to the leftmost matching principle.

The following is SQL that strictly adheres to the leftmost matching principle:

SELECT * FROM student WHERE name = '姓名_4' 
SELECT * FROM student WHERE name = '姓名_4' and address = '地址_4' 
SELECT * FROM student WHERE name = '姓名_4' and address = '地址_4' and phone = '7121877527'

2.14 Improve the efficiency of GROUP BY

We usually use the GROUP BY keyword more or less when writing SQL. Its main function is to remove duplicates and group. Usually it is used together with HAVING, which means that the data is filtered according to certain conditions after grouping. The regularly executed SQL is as follows:

SELECT age,COUNT(1) FROM student GROUP BY age HAVING age > 18

The execution result is shown in the figure:

The total time consumption is about 0.53s, but it can still be optimized. We can narrow the scope of filtering before grouping, and then group again. The optimized SQL is as follows:

SELECT age,COUNT(1) FROM student where age > 18 GROUP BY age

The execution result is shown in the figure:

It takes about 0.51s. Although it is not obvious, it is also a good idea.