MySQL performance optimization five MySQL index optimization practice two

One page query optimization

示例表:
CREATE TABLE `employees` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(24) NOT NULL DEFAULT '' COMMENT '姓名',
  `age` int(11) NOT NULL DEFAULT '0' COMMENT '年龄',
  `position` varchar(20) NOT NULL DEFAULT '' COMMENT '职位',
  `hire_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '入职时间',
  PRIMARY KEY (`id`),
  KEY `idx_name_age_position` (`name`,`age`,`position`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='员工记录表';

In many cases, the paging function of our business system may be implemented with the following sql

select * from employees limit 10000,10;

It means to fetch 10 rows starting from row 10001 from the table employees. It seems that only 10 records are queried, but actually this SQL reads 10010 records first, then discards the first 10000 records, and then reads the next 10 desired data. Therefore, if you want to query the relatively late data in a large table, the execution efficiency is very low.

Common Paging Scenario Optimization Techniques

1.1 Paging query based on self-incrementing and continuous primary key sorting

Let's first look at an example of paging queries sorted by auto-incrementing and continuous primary keys:

select * from employees limit 90000,5;

insert image description here
This SQL means to query the five rows of data starting from 90001, without adding a separate order by, which means sorting by the primary key. Let's look at the table employees again, because the primary key is self-incrementing and continuous, so it can be rewritten to query the five rows of data starting from 90001 according to the primary key, as follows:

select * from employees where id > 90000 limit 5;

insert image description here
The results of the query are consistent. Let's compare the execution plan again:

EXPLAIN select * from employees limit 90000,5;

insert image description here

EXPLAIN select * from employees where id > 90000 limit 5;

insert image description here
Obviously, the rewritten SQL has removed the index, and the number of scanned rows is greatly reduced, and the execution efficiency is higher.
However, this rewritten SQL is not practical in many scenarios, because some records in the table may be deleted and the primary key is vacant, resulting in inconsistent results, as shown in the experiment in the figure below (delete a previous record first, and then test the original SQL and optimized SQL):
insert image description here
insert image description here
The results of the two SQLs are not the same, so if the primary key is not continuous, the optimization method described above cannot be used.
In addition, if the original SQL is an order by non-primary key field, rewriting it according to the above method will lead to inconsistent results of the two SQLs. So this rewriting must satisfy the following two conditions:

  • The primary key is self-incrementing and continuous
  • The results are sorted by primary key

1.2 Paging query sorted by non-primary key fields

Look at another paging query sorted by non-primary key fields, the SQL is as follows:

 select * from employees ORDER BY name limit 90000,5;

insert image description here

EXPLAIN select * from employees ORDER BY name limit 90000,5;

insert image description here
It is found that the index of the name field is not used (the value corresponding to the key field is null). The specific reason is mentioned in the previous section: the cost of scanning the entire index and finding rows that are not indexed (may have to traverse multiple index trees) is more expensive than scanning the entire table is more expensive, so the optimizer forgoes using the index.

Knowing the reason for not using the index, how to optimize it?
In fact, the key is to make the fields returned during sorting as few as possible , so the sorting and paging operations can first find out the primary key, and then find the corresponding records according to the primary key. The SQL is rewritten as follows

select * from employees e inner join (select id from employees order by name limit 90000,5) ed on e.id = ed.id;

insert image description here
The required result is consistent with the original SQL, and the execution time has been reduced by more than half. Let’s compare the execution plan of SQL before and after optimization: the
insert image description here
original SQL uses filesort sorting, while the optimized SQL uses index sorting.

Two Join associated query optimization

-- 示例表:
CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

create table t2 like t1;

-- 插入一些示例数据
-- 往t1表插入1万行记录
drop procedure if exists insert_t1; 
delimiter ;;
create procedure insert_t1()        
begin
  declare i int;                    
  set i=1;                          
  while(i<=10000)do                 
    insert into t1(a,b) values(i,i);  
    set i=i+1;                       
  end while;
end;;
delimiter ;
call insert_t1();

-- 往t2表插入100行记录
drop procedure if exists insert_t2; 
delimiter ;;
create procedure insert_t2()        
begin
  declare i int;                    
  set i=1;                          
  while(i<=100)do                 
    insert into t2(a,b) values(i,i);  
    set i=i+1;                       
  end while;
end;;
delimiter ;
call insert_t2();

2.1 There are two common algorithms for mysql table association

  • Nested-Loop Join Algorithm
  • Block Nested-Loop Join Algorithm

The nested loop join Nested-Loop Join (NLJ) algorithm
reads rows from the first table (called the driving table) cyclically one row at a time, fetches the associated fields in this row of data, and uses the associated fields in another table (Driven table) Take out the rows that meet the conditions, and then take out the result set of the two tables.

EXPLAIN select * from t1 inner join t2 on t1.a= t2.a;

insert image description here
You can see this information from the execution plan:

  • The driving table is t2, and the driven table is t1. The driver table is executed first (if the id of the execution plan result is the same, the sql will be executed in order from top to bottom); the optimizer will generally choose a small table as the driver table first , use the where condition to filter the driver table, and then follow the driven table Tables for relational queries. Therefore, when using inner join, the table in front is not necessarily the driving table.
  • When using left join, the left table is the driving table, and the right table is the driven table. When using right join, the right table is the driving table, and the left table is the driven table. When using join, mysql will choose a relatively small amount of data The large table is used as the driving table, and the large table is used as the driven table.
  • The NLJ algorithm is used. In a general join statement, if it does not appear in the execution plan Extra Using join buffer, it means that the join algorithm used is NLJ.

The general process of the above sql is as follows:

  1. Read a row of data from table t2 (if the t2 table has query filter conditions, use the condition to filter first, and then take out a row of data from the filter result);
  2. From the data in step 1, take out the associated field a and search it in table t1;
  3. Take out the rows that meet the conditions in table t1, merge them with the results obtained in t2, and return them to the client as the result;
  4. Repeat the above 3 steps.

The whole process will read all the data in the t2 table ( scanning 100 rows ), then traverse the value of field a in each row of data, and scan the corresponding row in the t1 table according to the value index of a in the t2 table (scanning 100 times of the t1 table Index, 1 scan can be considered to scan only one row of complete data in table t1 in the end, that is, a total of 100 rows in table t1 are also scanned ). So the whole process scans 200 rows .
If the associated fields of the driven table are not indexed, the performance of using the NLJ algorithm will be relatively low (detailed explanation below) , and mysql will choose the Block Nested-Loop Join algorithm.

2.2 Block-based nested loop connection Block Nested-Loop Join (BNL) algorithm

Read the data of the driving table into the join_buffer, then scan the driven table , and compare each row of the driven table with the data in the join_buffer.

EXPLAIN select * from t1 inner join t2 on t1.b= t2.b;

insert image description here
Using join buffer (Block Nested Loop) in Extra indicates that the association query uses the BNL algorithm.

The general process of the above sql is as follows:

  1. Put all the data of t2 into join_buffer
  2. Take out each row in table t1 and compare it with the data in join_buffer
  3. Return the data that satisfies the join condition

The whole process performs a full table scan on tables t1 and t2, so the total number of rows scanned is 10000 (the total amount of data in table t1) + 100 (the total amount of data in table t2) = 10100. And the data in join_buffer is unordered, so 100 judgments are made for each row in table t1, so the number of judgments in memory is 100 * 10000 = 1 million times .
In this example, the table t2 has only 100 rows. What if the table t2 is a large table and the join_buffer cannot fit it? ·
The size of the join_buffer is set by the parameter join_buffer_size, and the default value is 256k. If you can't fit all the data in table t2, the strategy is very simple, that is, put it in segments .
For example, the t2 table has 1000 rows of records, and the join_buffer can only store 800 rows of data at a time, so the execution process is to put 800 rows of records in the join_buffer first, then take the data from the t1 table and compare it with the data in the join_buffer to get some results, and then clear the join_buffer. Then put the remaining 200 rows into the t2 table, and compare the data from the t1 table with the data in the join_buffer. So I scanned the t1 table one more time.

The associated fields of the driven table are not indexed. Why choose to use the BNL algorithm instead of Nested-Loop Join?
If the second sql above uses Nested-Loop Join, then the number of scanned rows is 100 * 10000 = 1 million times, which is a disk scan.
Obviously, the number of disk scans with BNL is much less. Compared with disk scans, BNL's memory calculations will be much faster.
Therefore, MySQL generally uses the BNL algorithm for associated queries that do not have indexes on the associated fields of the driven table. If there is an index, the NLJ algorithm is generally selected. In the case of an index, the NLJ algorithm has higher performance than the BNL algorithm.

2.3 Optimization for associated sql

  • Add an index to the associated field , let MySQL try to choose the NLJ algorithm when doing the join operation, because the driver table needs to be queried, so the filtering conditions should also use the index as much as possible to avoid full table scanning. In short, try to use the filtering conditions that can use the index as much as possible index
  • The small table drives the large table . If you know which table is a small table when writing multi-table connection SQL, you can use the straight_join writing method to fix the connection driving method, saving the time for the mysql optimizer to judge by itself.

Straight_join explanation : The straight_join function is similar to join, but it allows the left table to drive the right table, and can change the order in which the table optimizer executes the join table query.
For example: select * from t2 straight_join t1 on t2.a = t1.a; means specifying mysql to select the t2 table as the driving table.

  • straight_join is only applicable to inner join, not to left join, right join. (Because left join and right join already represent the execution order of the specified table)
  • Let the optimizer judge as much as possible, because in most cases the mysql optimizer is smarter than humans. You must be cautious when using straight_join, because in some cases, the artificially specified execution order is not necessarily more reliable than the optimization engine.

The definition of small tables is clear.
When deciding which table is the driving table, the two tables should be filtered according to their respective conditions. After the filtering is completed , the total data volume of each field participating in the join is calculated . The table with a small data volume is "Small table" should be used as driving table.

In and exsits optimization
principle: small tables drive large tables , that is, small data sets drive large data sets

in: When the data set of table B is smaller than the data set of table A, in is better than exists

select * from A where id in (select id from B)  
#等价于:
  for(select id from B){
      select * from A where A.id = B.id
    }

exists: When the data set of table A is smaller than the data set of table B, exists is better than in.
  Put the data of main query A into sub-query B for conditional verification, and determine the main query based on the verification result (true or false) Whether the data is retained

select * from A where exists (select 1 from B where B.id = A.id)
#等价于:
    for(select * from A){
      select * from B where B.id = A.id
    }
    
#A表与B表的ID字段应建立索引

1. EXISTS (subquery) only returns TRUE or FALSE, so SELECT * in the subquery can also be replaced by SELECT 1. The official statement is that the SELECT list will be ignored during actual execution, so there is no difference. 2. The actual execution process of the EXISTS subquery may
be It has been optimized rather than the item-by-item comparison we understand.
3. The EXISTS subquery can often be replaced by JOIN. Which is the best requires specific analysis of specific issues.

Three count(*) query optimization

-- 临时关闭mysql查询缓存,为了查看sql多次执行的真实时间
mysql> set global query_cache_size=0;
mysql> set global query_cache_type=0;

mysql> EXPLAIN select count(1) from employees;
mysql> EXPLAIN select count(id) from employees;
mysql> EXPLAIN select count(name) from employees;
mysql> EXPLAIN select count(*) from employees;

Note: the above 4 SQLs only count data rows whose fields are null based on the count of a certain field. The
insert image description here
execution plans of the four SQLs are the same, indicating that the execution efficiency of the four SQLs should be similar.
The fields have indexes:

#字段有索引,count(字段)统计走二级索引,二级索引存储数据比主键索引少,所以count(字段)>count(主键 id) 
count(*)count(1)>count(字段)>count(主键 id)

Fields are not indexed:

# 字段没有索引count(字段)统计走不了索引,count(主键 id)还可以走主键索引,所以count(主键 id)>count(字段)
count(*)count(1)>count(主键 id)>count(字段) 

The execution process of count(1) is similar to that of count(field), but count(1) does not need to take out field statistics, and uses the constant 1 for statistics, and count(field) also needs to take out fields, so in theory count(1) is better than count( field) will be a bit faster.
count(*)It is an exception. Mysql does not take out all the fields, but optimizes it specially. It does not take values ​​and accumulates by row. The efficiency is very high, so there is no need to use count (column name) or count (constant) to replace count ( *).

Why for count(id), mysql finally chooses the secondary index instead of the primary key clustered index? Because the secondary index stores less data than the primary key index, the retrieval performance should be higher, and mysql has been optimized internally (it should be optimized in version 5.7).

3.1 Common optimization methods

1. Query the total number of rows maintained by mysql itself.
For the table of the myisam storage engine, the performance of the count query without the where condition is very high, because the total number of rows of the table of the myisam storage engine will be stored on the disk by mysql, and the query does not need to be calculated.
insert image description here
For the table mysql of the innodb storage engine, the total number of record rows of the table will not be stored (because of the MVCC mechanism, which will be discussed later), and the query count needs to be calculated in real time

2. For show table status,
if you only need to know the estimated value of the total number of rows in the table, you can use the following SQL query, which has high performance.
insert image description here
3. Maintain the total number in Redis
when inserting or deleting table data rows, and at the same time maintain the key of the total number of rows in the table in Redis Count value (using the incr or decr command), but this method may not be accurate, it is difficult to ensure the transaction consistency of table operations and redis operations

4. Increase the database count table.
When inserting or deleting table data rows, maintain the count table at the same time, so that they can be operated in the same transaction

Four MySQL data type selection

In MySQL, choosing the correct data type is critical to performance. Generally, the following two steps should be followed:
(1) Determine the appropriate large type: number, string, time, binary;
(2) Determine the specific type: signed or not, value range, variable length and fixed length, etc.
In terms of MySQL data type settings, try to use smaller data types, because they usually have better performance and consume less hardware resources. Also, try to define the field as NOT NULL and avoid using NULL.

4.1 Numeric types

type size range (signed) range (unsigned) use
TINYINT 1 byte (-128, 127) (0, 255) small integer value
SMALLINT 2 bytes (-32 768, 32 767) (0, 65 535) big integer value
MEDIUMINT 3 bytes (-8 388 608, 8 388 607) (0, 16 777 215) big integer value
INT or INTEGER 4 bytes (-2 147 483 648, 2 147 483 647) (0, 4 294 967 295) big integer value
BIGINT 8 bytes (-9 233 372 036 854 775 808, 9 223 372 036 854 775 807) (0, 18 446 744 073 709 551 615) extremely large integer value
FLOAT 4 bytes (-3.402 823 466 E+38, 1.175 494 351 E-38),0,(1.175 494 351 E-38,3.402 823 466 351 E+38) 0, (1.175 494 351 E-38, 3.402 823 466 E+38) single-precision floating-point value
DOUBLE 8 bytes (1.797 693 134 862 315 7 E+308, 2.225 073 858 507 201 4 E-308), 0, (2.225 073 858 507 201 4 E-308, 1.797 693 134 862 315 7 E+308) 0, (2.225 073 858 507 201 4 E-308, 1.797 693 134 862 315 7 E+308) double-precision floating-point value
DECIMAL For DECIMAL(M,D), if M>D, it is M+2, otherwise it is D+2 depends on the value of M and D depends on the value of M and D decimal value

optimization suggestion

  1. If the plastic data has no negative numbers, such as an ID number, it is recommended to specify it as an unsigned type, and the capacity can be doubled.
  2. It is recommended to use TINYINT instead of ENUM, BITENUM, SET.
  3. Avoid using the display width of an integer (see the end of the document), that is, do not use INT(10) to specify the display width of the field, use INT directly.
  4. DECIMAL is most suitable for storing data that requires high accuracy and is used for calculations, such as prices. But when using the DECIMAL type, pay attention to the length setting.
  5. It is recommended to use the integer type to operate and store real numbers. The method is to multiply the real numbers by the corresponding multiples and then operate.
  6. Integer is usually the best data type because it is fast and can use AUTO_INCREMENT.

4.2 Date and time

type size (bytes) scope Format use
DATE 3 1000-01-01 to 9999-12-31 YYYY-MM-DD date value
TIME 3 '-838:59:59' to '838:59:59' HH:MM:SS time value or duration
YEAR 1 1901 to 2155 YYYY year value
DATETIME 8 1000-01-01 00:00:00 to 9999-12-31 23:59:59 YYYY-MM-DD HH:MM:SS Mixed date and time values
TIMESTAMP 4 1970-01-01 00:00:00 to 2038-01-19 03:14:07 YYYYMMDDhhmmss Mixed date and time values, timestamps

optimization suggestion

  1. The smallest time granularity that MySQL can store is seconds.
  2. It is recommended to use the DATE data type to save the date. The default date format in MySQL is yyyy-mm-dd.
  3. Use MySQL's built-in types DATE, TIME, and DATETIME to store time instead of strings.
  4. When the data format is TIMESTAMP and DATETIME, you can use CURRENT_TIMESTAMP as the default (after MySQL5.6), and MySQL will automatically return the exact time of record insertion.
  5. TIMESTAMP is a UTC timestamp, relative to the time zone.
  6. The storage format of DATETIME is an integer of YYYYMMDD HH:MM:SS, which has nothing to do with the time zone. What you save is what you read out.
  7. Unless there are special needs, the general company recommends using TIMESTAMP, which is more space-saving than DATETIME, but companies like Ali generally use DATETIME, because there is no need to consider the future time limit of TIMESTAMP.
  8. Sometimes people store Unix timestamps as integer values, but this usually does no good, the format is not convenient to deal with, and we don't recommend it.

4.3 Strings

type size use
CHAR 0-255 bytes Fixed-length string, char(n) When the number of inserted characters is less than n (n represents the number of characters), insert spaces for supplementary storage. When searching, trailing spaces will be removed.
VARCHAR 0-65535 bytes Variable-length string, n in varchar(n) represents the maximum number of characters, and no spaces will be added when the number of inserted characters is less than n
TINYBLOB 0-255 bytes Binary string of no more than 255 characters
TINYTEXT 0-255 bytes short text string
BLOB 0-65 535 bytes Long text data in binary form
TEXT 0-65 535 bytes long text data
MEDIUMBLOB 0-16 777 215 bytes Medium-length text data in binary form
MEDIUMTEXT 0-16 777 215 bytes medium length text data
LUNG BLOB 0-4 294 967 295 bytes Very large text data in binary form
LONGTEXT 0-4 294 967 295 bytes extremely large text data

optimization suggestion

  1. Use VARCHAR if the length of the string is quite different; use CHAR if the string is short and all values ​​are close to the same length.
  2. CHAR and VARCHAR are suitable for any combination of alphanumeric characters including names, zip codes, phone numbers, and no more than 255 characters in length. Don't use the VARCHAR type to save the numbers to be used for calculation, because it may cause some calculation-related problems. In other words, the accuracy and completeness of calculations may be affected.
  3. Try to use BLOB and TEXT as little as possible. If you really want to use them, you can consider storing BLOB and TEXT fields in a separate table and linking them with id.
  4. The BLOB family stores binary strings, independent of the character set. The TEXT series stores non-binary character strings, which are related to character sets.
  5. Neither BLOB nor TEXT can have a default value.

PS: INT display width
We often use commands to create data tables, and at the same time specify a length, as follows. However, the length here is not the maximum length stored in the TINYINT type, but the maximum length displayed.

CREATE TABLE `user`(
    `id` TINYINT(2) UNSIGNED
);

Here it means that the type of the id field of the user table is TINYINT, and the maximum value that can be stored is 255. Therefore, when storing data, if the stored value is less than or equal to 255, such as 200, although it exceeds 2 digits, it does not exceed the length of the TINYINT type, so it can be saved normally; if the stored value is greater than 255, such as 500, then MySQL will automatically save The maximum value of TINYINT type is 255.
When querying data, no matter what value the query result is, it will be output according to the actual value. The function of 2 in TINYINT(2) here is that when it is necessary to fill in 0 before the query result, it can be realized by adding ZEROFILL to the command, such as:

`id` TINYINT(2) UNSIGNED ZEROFILL

In this way, if the query result is 5, the output is 05. If TINYINT(5) is specified, the output is 00005. In fact, the actual stored value is still 5, and the stored data will not exceed 255, but the MySQL output data is filled with 0 in front.
In other words, in the MySQL command, the type length of the field TINYINT(2), INT(11) will not affect the insertion of data, it is only useful when using ZEROFILL, so that the query result is filled with 0.

Guess you like

Origin blog.csdn.net/qq_33417321/article/details/121215041