[MySQL]-[Index Optimization and Query Optimization]


Insert image description here

data preparation

500,000 entries are inserted into student lists and 10,000 entries are inserted into class lists.
Step 1: Create table

CREATE TABLE `class` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`className` VARCHAR(30) DEFAULT NULL,
`address` VARCHAR(40) DEFAULT NULL,
`monitor` INT NULL ,
PRIMARY KEY (`id`)
) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
CREATE TABLE `student` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`stuno` INT NOT NULL ,
`name` VARCHAR(20) DEFAULT NULL,
`age` INT(3) DEFAULT NULL,
`classId` INT(11) DEFAULT NULL,
PRIMARY KEY (`id`)
#CONSTRAINT `fk_class_id` FOREIGN KEY (`classId`) REFERENCES `t_class` (`id`)
) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

Step 2: Set parameters.
Command on: Allow creation of functions. Settings: set global log_bin_trust_function_creators=1; # 不加global只是当前窗口有效。
Step 3: Create functions
. Ensure that each piece of data is different.

#随机产生字符串
DELIMITER //
CREATE FUNCTION rand_string(n INT) RETURNS VARCHAR(255)
BEGIN
DECLARE chars_str VARCHAR(100) DEFAULT
'abcdefghijklmnopqrstuvwxyzABCDEFJHIJKLMNOPQRSTUVWXYZ';
DECLARE return_str VARCHAR(255) DEFAULT '';
DECLARE i INT DEFAULT 0;
WHILE i < n DO
SET return_str =CONCAT(return_str,SUBSTRING(chars_str,FLOOR(1+RAND()*52),1));
SET i = i + 1;
END WHILE
RETURN return_str;
END //
DELIMITER ;
#假如要删除
#drop function rand_string;

Randomly generate class numbers

#用于随机产生多少到多少的编号
DELIMITER //
CREATE FUNCTION rand_num (from_num INT ,to_num INT) RETURNS INT(11)
BEGIN
DECLARE i INT DEFAULT 0;
SET i = FLOOR(from_num +RAND()*(to_num - from_num+1)) ;
RETURN i;
END //
DELIMITER ;
#假如要删除
#drop function rand_num;

Step 4: Create stored procedure

#创建往stu表中插入数据的存储过程
DELIMITER //
CREATE PROCEDURE insert_stu( START INT , max_num INT )
BEGIN
DECLARE i INT DEFAULT 0;
SET autocommit = 0; #设置手动提交事务
REPEAT #循环
SET i = i + 1; #赋值
INSERT INTO student (stuno, name ,age ,classId ) VALUES
((START+i),rand_string(6),rand_num(1,50),rand_num(1,1000));
UNTIL i = max_num
END REPEAT;
COMMIT; #提交事务
END //
DELIMITER ;
#假如要删除
#drop PROCEDURE insert_stu;

Create a stored procedure to insert data into the class table

#执行存储过程,往class表添加随机数据
DELIMITER //
CREATE PROCEDURE `insert_class`( max_num INT )
BEGIN
DECLARE i INT DEFAULT 0;
SET autocommit = 0;
REPEAT
SET i = i + 1;
INSERT INTO class ( classname,address,monitor ) VALUES
(rand_string(8),rand_string(10),rand_num(1,100000));
UNTIL i = max_num
END REPEAT;
COMMIT;
END //
DELIMITER ;
#假如要删除
#drop PROCEDURE insert_class;

Step 5: Call the stored procedure
class

#执行存储过程,往class表添加1万条数据
CALL insert_class(10000);

stu

#执行存储过程,往stu表添加50万条数据
CALL insert_stu(100000,500000);

Step 6: Delete the index on a table
and create a stored procedure

DELIMITER //
CREATE PROCEDURE `proc_drop_index`(dbname VARCHAR(200),tablename VARCHAR(200))
BEGIN
DECLARE done INT DEFAULT 0;
DECLARE ct INT DEFAULT 0;
DECLARE _index VARCHAR(200) DEFAULT '';
DECLARE _cur CURSOR FOR SELECT index_name FROM
information_schema.STATISTICS WHERE table_schema=dbname AND table_name=tablename AND
seq_in_index=1 AND index_name <>'PRIMARY' ;
#每个游标必须使用不同的declare continue handler for not found set done=1来控制游标的结束
DECLARE CONTINUE HANDLER FOR NOT FOUND set done=2 ;
#若没有数据返回,程序继续,并将变量done设为2
OPEN _cur;
FETCH _cur INTO _index;
WHILE _index<>'' DO
SET @str = CONCAT("drop index " , _index , " on " , tablename );
PREPARE sql_str FROM @str ;
EXECUTE sql_str;
DEALLOCATE PREPARE sql_str;
SET _index='';
FETCH _cur INTO _index;
END WHILE;
CLOSE _cur;
END //
DELIMITER ;

Execute stored procedure

CALL proc_drop_index("dbname","tablename");

Index failure case

Insert image description here

Full value matches my favorite

SQL statements that often appear in the system are as follows:

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age=30;
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age=30 AND classId=4;
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age=30 AND classId=4 AND NAME = 'abcd';

Execute before indexing: (pay attention to execution time)

SELECT SQL_NO_CACHE * FROM student WHERE age=30 AND classId=4 AND NAME = 'abcd';#0.149s

Create index:

CREATE INDEX idx_age ON student(age);
CREATE INDEX idx_age_classid ON student(age,classId);
CREATE INDEX idx_age_classid_name ON student(age,classId,NAME);

Execute the above select statement again and find that the idx_age_classid_name index is used. Because using the first two indexes will also perform a table return, the efficiency is low.

Conclusion: Try to create a joint index for all fields in the where condition, and the index order remains unchanged.

Insert image description here
Insert image description here
Insert image description here
Best left prefix: If you create a joint index for fields a, b, and c, then the index can only be used in the following situations:

  1. Search a first, then b, then c
  2. Search a first, then search b
  3. Retrieve a

Insert image description here

Please add image description

Calculations, functions, type conversions (automatic or manual) cause index failure

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE student.name LIKE 'abc%';
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE LEFT(student.name,3) = 'abc'; 
# 创建索引
CREATE INDEX idx_name ON student(NAME);

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here

Type conversion causes index failure

Please add image description

Conclusion: When designing entity class attributes, they must correspond to the database field type, otherwise type conversion will occur.

The column index on the right side of the range condition is invalid

Insert image description here
Insert image description here
Insert image description here

Not equal to (!= or <>) index invalid

Insert image description here

is null can use index, is not null cannot use index

Insert image description here
is null is equal to a certain value, and is not null is equal to not equal to a certain value. When using is null, the database directly looks for null. When using is not null, the database will take out the data one by one.

like index starting with wildcard character % is invalid

The starting letter is not sure, how can I find it in the b+ tree? So we can only scan the whole table
Insert image description here
Insert image description here

There are non-indexed columns before and after OR, and the index is invalid.

Insert image description here
Insert image description here
Insert image description here

The character sets of databases and tables use utf8mb4 uniformly.

The unified use of utf8mb4 (supported by version 5.5.3 and above) has better compatibility, and the unified character set can avoid garbled characters caused by character set conversion. Different character sets need to be converted before comparison (conversion function will be used), which will cause index failure.

Exercises and general advice

Insert image description here
Insert image description here
General advice:

  1. For single-column indexes, try to choose an index with better filtering capabilities for the current query. If you use many fields in your where condition, it is best to create a joint index for these fields.
  2. When selecting a combined index, the field with the best filterability in the current query should be placed earlier in the index field order, the better. If there are many conditions in your where condition, one of which is: [Gender=Female], if If it is ranked as the first condition, then only 50% of the data can be filtered. However, if there is a condition that can filter 90% of the data, it is recommended to put this condition before the condition [gender=female].
  3. When choosing a combined index, try to choose an index that can include more fields in the where clause of the current query.
  4. When selecting a combined index, if a certain field may appear in a range query, try to put this field at the end of the index order.

Related query optimization

data preparation

Use left outer join

Right outer join is the same as left outer join, so I won’t go into details here
. Since no index has been added yet, it is a full table scan. Assume that type has 20 pieces of data and book has 30 pieces of data. First, fetch a piece of data from type, and then search for qualified data in the book table according to the connection conditions. It will be searched 30 times, and type will fetch data 20 times, so a total of execution 600 times, similar to a nested loop. The book table is always traversed, so the optimizer uses join buff to cache the data to improve retrieval speed.

Because it is a left outer join, all the data in the left table must be included. The main purpose is to filter in the right table, so the fields in the right table must be indexed.
Please add image description
Please add image description
Insert image description here

Use inner join

in conclusion:SELECT * FROM type INNER JOIN book ON type.card=book.card;

  1. If neither type.card nor book.card has an index, the small table drives the large table
  2. If type.card has no index and book.card has an index, book is a driven table.
  3. If type.card has an index and book.card does not have an index, then type is a driven table
  4. If both type.card and book.card have indexes, the small table drives the large table

The principle of join statement

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
It can be seen from the number of read records that because a+b*a=a(b+1), the greater impact is the number of a tables, so the smaller the number of a tables, the better, so small tables drive large tables in the join
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
buffer Place the fields of where condition and select field
Insert image description here
straight_join: The optimizer should not destroy the order of the driving table and the driven table. The field on the left of straight_join is the driving table, and the field on the right is the driven table. The first method is recommended because of
the A t1 table has only one field that needs to be placed in the buffer, but a t2 table needs to put all the fields in the table in the buffer. Although the t1 table may actually have many more fields than the t2 table, in this SQL statement, it is still t1 as a driving table
Insert image description here
Insert image description here
Insert image description here

subquery optimization

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here

Sorting optimization

Sorting optimization

Insert image description here

test

Insert image description here
In process one, no index has been added, so the index will definitely not be used.
Insert image description here
In process two, if limit is not added, the index will be invalid because you have to query all fields. If you use index: first sort according to age and classid, then return to the table and query All data obtains other fields and values ​​​​of each piece of data, which is less efficient. It is better to directly query the entire table in one go.

EXPLAIN SELECT SQL NO CACHE age, classid FROM student ORDER BY age, classid;
# 也使用上了索引,因为查询的字段就是索引,不用回表

Then add limit, use the index, first sort according to age and classid, take the first ten items, and then return these 10 items to the table. The efficiency is higher than the full table query. According to the best left prefix principle, the first two are invalid, and the last three are invalid
Insert image description here
. Effective.
Insert image description here
Except for the last one, which uses the index (numbered backwards), the rest do not use the
Insert image description here
first and second indexes. where uses the index, and order by does not use the index, because where is executed first, and if most of the data is filtered, the remaining There is not much data left, so the index will not be used when performing order by.
The third item does not use an index because classid does not comply with the optimal left prefix principle. There is no index.
The tenth item uses a joint index of age and classid. He first uses the age and classid joint index. Sort, then filter based on where, and then take the top 10 items
Insert image description here

Case practice

Insert image description here
Insert image description here
Plan 1 is built in this way because stuno is a field that considers the range, so it is excluded. The execution result of plan 1:
Insert image description here
only where index is used.
Insert image description here
The execution result of plan 2:
Insert image description here
filesort is used, and only age and stud are used for the index. This is Because after the where condition is filtered, there are not many pieces of data, and there is no need to use the order by index.
These three solutions
Insert image description here
can actually be used.

filesort algorithm: two-way sorting and one-way sorting

Insert image description here
Two-way sorting: If a certain SQL is sorted according to age: order by age, the first disk scan: find the age field, load all the age columns into the memory, and then sort the age in the memory. After sorting, Then find the complete piece of data in the table based on age, and then load the complete piece of data into memory.
Single-way sorting: read all the age columns into the disk and then sort them
. It is recommended to use index sorting, but if you must use filesort, you can Optimize from the following aspects:
Insert image description here
Insert image description here

group by optimization

Insert image description here

Optimize paging queries

Insert image description here
Insert image description here

covering index

1. What is a covering index: When we talked about the creation of a non-clustered index earlier (a non-clustered index was established for the c2 column at that time), when searching, first find the primary key based on the non-clustered index, and then use the primary key to To search for the rest of the content on the clustered index, you need to use two b+ trees. This process is also called table return, which means returning to the clustered index to search. If the fields I want to check are all in non-clustered indexes, then there is no need to return the table. This is a covering index, that is, an index contains data that satisfies the query results.

To put it simply, the index column + primary key contains the columns queried from SELECT to FROM.

2. Case:

  1. The index will not be used to execute row 530, as we have mentioned before, but the index will be used to execute row 533. Because there is index coverage, there is no need to return to the table, and the execution efficiency is higher. So the rules we talked about earlier are not absolute.
    Please add image description
  2. The index will not be used to execute row 537, but the index will be used to execute row 539. Because there is index coverage, there is no need to return the table, and the execution efficiency is higher.
    Please add image description
    3. The pros and cons of covering the index
    Please add image description
  3. Benefit: Avoid secondary query (table return) of innodb table index
  4. Turn random IO into sequential IO: a non-clustered index has been established for the c2 column. If you now need to query data in the range of 2<c2<20, you will definitely find a lot of data in the leaf nodes of the non-clustered index, and these The data is all continuous (this is sequential io). If we need to return the table, we will take the primary key of this bunch of data and query it in the clustered index. This bunch of primary keys may not be continuous, then we will query it in the clustered index. When querying, it is random io. Therefore, when returning to the table is cut off, random IO is also cut off.

Index pushdown (ICP)

1. Case:

  1. Case 1: key1 is a non-primary key.
    Please add image description
    The underlying implementation: If there are 10,000 pieces of data in the s1 table, first create a non-clustered index for key1, and then query the key1>'z'non-clustered index. If 1,000 pieces of data are found, then query key1 like '%a'these 1,000 pieces of data. If 100 pieces of data are found, then return the 100 pieces of data to the table
  2. Example 2: Create a table and insert data:
    Please add image description
    execute the following statement:
    Please add image description
    Index pushdown often appears in joint indexes, as in this example, the underlying implementation: If there are 10,000 pieces of data in the people table, first search for the zipcode='000001'non-clustered index, and if it is found 1000 items, and then lastname like '%张%'query from these 1000 items of data. If 100 items are found, because no index has been created for address, then the 100 items of data will be returned to the table to query the data that meets the conditions. This is index pushdown.
    Obviously, index pushdown reduces the number of times to return to the table, and also reduces the number of random IOs.
    2. Opening and closing of index pushdown:
    Please add image description
    the last word is condition
    3. Performance comparison of opening/closing index pushdown
    Create a stored procedure, and Add 1,000,000 pieces of data to the people table to test the performance of icp on and off.
    Please add image description
    First Please add image description
    , 4. Conditions for using ICP
    Please add image description
    , icp is meaningful only when the table needs to be returned.

Other query optimization strategies

The difference between exist and in

Insert image description here
in is to send a piece of data from the brackets for external use, and exist is to send a piece of data from the outside to the brackets for execution. Therefore, if the outer table is small and the inner table is large, use exist. The outer table is large and the inner table is large. Small use in.

COUNT(*) and COUNT(specific fields) efficiency

This question only considers how many pieces of data there are in the mysql statistics table
Please add image description

About SELECT(*)

Please add image description

Impact of LIMIT 1 on optimization

Please add image description

Use COMMIT more

Please add image description

How to design the primary key of Taobao database

Problem with self-increasing ID

The auto-increment ID is used as the primary key, which is simple and easy to understand. Almost all databases support the auto-increment type, but the implementation is different. In addition to being simple, self-increasing IDs have other disadvantages. Overall, there are the following problems:

  1. Low reliability: There is a problem with auto-increment ID backtracking, which was not fixed until the latest version of MySQL 8.0.
  2. The security is not high: the interface exposed to the outside world can easily guess the corresponding information. For example: With an interface like /User/1/, it is very easy to guess the value of the user ID and the total number of users. It is also very easy to crawl data through the interface.
  3. Poor performance: The performance of auto-increment ID is poor and needs to be generated on the database server side.
  4. More interactions: The business needs to execute a similar last_insert_id()function one more time to know the auto-increment value just inserted, which requires one more network interaction. In a massively concurrent system, one more SQL statement will cause another performance overhead.
  5. Local uniqueness: The most important point is that the auto-incrementing ID is locally unique, unique only in the current database instance, not globally unique, and unique among any servers. For current distributed systems, this is a nightmare.

Business fields as primary keys (understand)

In order to uniquely identify a member's information, a primary key needs to be set for the member information table. So, how to set the primary key for this table to achieve our ideal goal? Here we consider the business field as the primary key. The table data is as follows:
Please add image description
In this table, which field is more appropriate?

  1. Select the card number (cardno): The membership card number (cardno) seems more suitable, because the membership card number cannot be empty, and it is unique and can be used to identify a membership record. Different membership card numbers correspond to different members, and the field "cardno" uniquely identifies a certain member. If this is the case, the membership card number corresponds to the member one-to-one, and the system can operate normally. But the actual situation is that membership card numbers may be reused. For example, Zhang San moved away from his original address due to a job change and no longer went to the merchant's store to make purchases (the membership card was returned), so Zhang San was no longer a member of the merchant's store. However, the merchant did not want to leave the membership card empty, so it sent the membership card with the card number "10000001" to Wang Wu. From a system design perspective, this change only modifies the member information that the card number is "10000001" in the member information table, and will not affect data consistency. In other words, if you modify the membership information with the membership card number "10000001", each module of the system will obtain the modified membership information. There will be no "some modules obtain the membership information before the modification, and some modules obtain the modified membership information". The resulting member information is inconsistent, resulting in data inconsistency within the system. Therefore, there is no problem from the information system level. But from the business perspective of using the system, there are big problems, which will affect merchants. For example, we have a sales flow sheet (trans) that records all sales flow details. On December 1, 2020, Zhang San bought a book in the store and spent 89 yuan. Then, there will be a transaction record of Zhang San buying books in the system, as shown below:
    Please add image description
    If the membership card "10000001" is sent to Wang Wu again, we will change the membership information table. The result of this query is: Wang Wu bought a book on December 1, 2020, and spent 89 yuan. Obviously wrong.
    Conclusion: Never use the membership card number as the primary key.
  2. Select the member’s phone number or ID number
    (1) In actual operations, there are cases where mobile phone numbers are taken back by operators and reissued to others.
    (2) The ID number is personal privacy and customers may not be willing to give it to you. If it is mandatory for members to register their ID numbers, many customers will be driven away. In fact, customer phone numbers also have this problem, which is why we allow both the ID number and phone number to be empty when designing the member information table.

So, it is recommendedTry not to use business-related fields as primary keys. After all, as project design technicians, none of us can predict which business fields will be repeated or reused during the entire life cycle of the project due to the business needs of the project.

Taobao’s primary key design

In Taobao's e-commerce business, order service is a core business. Excuse me, how is the primary key of the order table designed on Taobao? "Is it a self-increasing ID? Open Taobao and take a look at the order information:

Experience: When first starting to use MySQL, a common mistake that many people make is to use business fields as primary keys, assuming that they understand the business requirements. However, the actual situation is often unexpected, and the cost of changing the primary key settings is very high.
Please add image description
As you can see from the picture above, the order number is not an auto-incrementing ID! Let’s take a closer look at the above 4 order numbers:

1550672064762308113
1481195847180308113
1431156171142308113
1431146631521308113

The order number is 19 digits long, and the last 5 digits of the order are all the same, 08113. And the first 14 digits of the order number are monotonically increasing. A bold guess is that Taobao’s order ID design should be: 订单ID = 时间 + 去重字段 + 用户ID后6位尾号. Such a design can be globally unique and is extremely friendly to distributed system queries.

Recommended primary key design

Non-core business: The primary key of the corresponding table automatically increments ID, such as alarms, logs, monitoring and other information.
Core business: The primary key design should be at least globally unique and monotonically increasing. Global uniqueness ensures that it is unique between systems. Monotonically increasing is expected to not affect database performance during insertion.
The simplest primary key design is recommended here: UUID. Characteristics of UUID: globally unique, occupies 36 bytes, data is out of order, and insertion performance is poor.

Get to know uuid

The UUID composition of the MySQL database: UUID = 时间+UUID版本(16字节)- 时钟序列(4字节) - MAC地址(12字节), let's take the UUID value e0ea12d4-6473-11eb-943c-00155dbaa39das an example:
Please add image description
The uuid in mysql is stored as a string

  1. Why is UUID globally unique?
    The time part in UUID occupies 60 bits and stores a timestamp similar to TIMESTAMP, but it represents the count from 1582-10-15 00:00:00.00 to the present 100ns. It can be seen that the time precision of UUID storage is higher than that of TIMESTAMPE, and the probability of duplication in the time dimension is reduced to 1/100ns.
    The clock sequence is to avoid the possibility of time duplication caused by clocks being set back. The MAC address is used to be globally unique.
  2. Why do UUIDs take up 36 bytes?
    UUIDs are stored as strings and are designed with useless "-" strings, so a total of 36 bytes are required.
  3. Why are UUIDs randomly disordered?
    Because in the design of UUIDs, the low bit of time is placed first, and this part of the data is always changing and disordered.

Modified UUID

If the high and low bits of time are interchanged, time will be monotonically increasing, and it will become monotonically increasing. MySQL 8.0 can change the storage method of low time bit and high time bit, so that UUID is an ordered UUID.
MySQL 8.0 also solves the space occupation problem of UUID, removes the meaningless "-" string in the UUID string, and saves the string in binary type, thus reducing the storage space to 16 bytes.
The above functions can be achieved through the uuid_to_bin function provided by MySQL8.0. Similarly, MySQL also provides the bin_to_uuid function for conversion:

The snowflake algorithm is commonly used now - from the barrage

Guess you like

Origin blog.csdn.net/CaraYQ/article/details/129168300