MySQL index optimization and query optimization

1. Index failure case

1 full value match

2 Best Left Prefix Rule

3 Primary key insertion order

If this data page is full, we need to split the current page into two pages and move some records in this page to the newly created page. Page splits and record shifts mean: performance loss! So if we want to avoid such unnecessary performance loss as much as possible, it is best to increase the primary key value of the inserted record in sequence, so that such performance loss will not occur.
So we suggest: Let the primary key have AUTO_INCREMENT and let the storage engine generate the primary key for the table by itself instead of manually inserting it.

4 Calculations, functions, and type conversions (automatic or manual) cause index failure

5 Type conversion causes index failure

6 The column index on the right side of the range condition is invalid

7 Not equal to (!= or <>) index invalid

8 is null can use index, is not null cannot use index

9 like index starting with wildcard character % is invalid

10 There are non-indexed columns before and after OR, and the index is invalid.

11 The character sets of databases and tables use utf8mb4 uniformly.

The unified use of utf8mb4 (supported by version 5.5.3 or above) has better compatibility, and the unified character set can avoid garbled characters caused by character set conversion. Different character sets need to be converted before comparison, which will cause index failure.

2. Principle of join statement

EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.a=t2.a);

If you use the join statement directly, the MySQL optimizer may select table t1 or t2 as the driving table, which will affect the execution process of our analysis of the SQL statement. In order to facilitate the analysis of performance issues during execution, use straight_join instead to let MySQL use a fixed connection method to execute the query, so that the optimizer will only join in the way we specify. In this statement, t1 is the driving table and t2 is the driven table.

 It can be seen that in this statement, there is an index on field a of the driven table t2, and the join process uses this index, so the execution flow of this statement is as follows: 1. Read a row of data R from table
t1 ;
2. From the data row R, take out the a field and search it in table t2;
3. Take out the rows that meet the conditions in table t2 and form a row with R as part of the result set;
4. Repeat steps 1 to 3, The loop ends until the end of table t1.
This process is to first traverse table t1, and then go to table t2 to find records that meet the conditions based on the a value in each row of data taken from table t1. Formally, this process is similar to the nested query when we write a program, and can use the index of the driven table, so we call it "Index Nested-Loop Join", or NLJ for short.
Its corresponding flow chart is as follows:

In this process:
1. A full table scan is performed on the driver table t1. This process requires scanning 100 rows;
2. For each row R, table t2 is searched based on the a field, using a tree search process. Since the data we construct has a one-to-one correspondence, only one row is scanned in each search process, and a total of 100 rows are scanned;
3. Therefore, the total number of scanned rows in the entire execution process is 200. 

Conclusion:
Using the join statement, the performance is better than forcibly splitting it into multiple single tables to execute SQL statements;
 if you use the join statement, you need to use the small table as the driving table.

Ensure that the JOIN field of the driven table has an index created.

For fields that require JOIN, the data types must be kept absolutely consistent.
When performing LEFT JOIN, select the small table as the driving table and the large table as the driven table. Reduce the number of outer loops.
When INNER JOIN, MySQL will automatically select the table with a small result set as the driving table.
If you can directly associate multiple tables, try to associate them directly without using subqueries. (Reduce the number of queries)
It is not recommended to use subqueries. It is recommended to split the subquery SQL and combine the program for multiple queries, or use JOIN instead of subqueries.
The derived table cannot create an index.

 3. Subquery optimization

MySQL supports subqueries starting from version 4.1. You can use subqueries to perform nested queries of SELECT statements, that is, the results of one SELECT query serve as the conditions for another SELECT statement. Subqueries can complete many SQL operations that logically require multiple steps to complete in one go.
Subquery is an important function of MySQL, which can help us implement more complex queries through a SQL statement. However, the execution efficiency of subqueries is not high. Reasons:
① When executing a subquery, MySQL needs to create a temporary table for the query results of the inner query statement, and then the outer query statement queries records from the temporary table. After the query is completed, these temporary tables are revoked. This will consume too much CPU and IO resources and generate a large number of slow queries.
② The temporary table stored in the result set of the subquery will not have an index, whether it is a memory temporary table or a disk temporary table, so query performance will be affected to a certain extent.
③ For subqueries that return a larger result set, the impact on query performance will be greater.
In MySQL, you can use join (JOIN) queries instead of subqueries. Join queries do not need to create temporary tables and are faster than subqueries. If indexes are used in the query, the performance will be better.

Conclusion: Try not to use NOT IN or NOT EXISTS, use LEFT JOIN xxx ON xx WHERE xx IS NULL instead

4. Sorting optimization

1. In SQL, you can use indexes in the WHERE clause and ORDER BY clause to avoid full table scans in the WHERE clause and FileSort sorting in the ORDER BY clause. Of course, in some cases, full table scan or FileSort sorting is not necessarily slower than indexing. But in general, we still have to avoid it to improve query efficiency.
2. Try to use Index to complete ORDER BY sorting. If WHERE and ORDER BY are followed by the same column, use a single index column;
if they are different, use a joint index.
3. When Index cannot be used, the FileSort method needs to be tuned.

5. filesort algorithm: two-way sorting and single-way sorting

Two-way sorting (slow)
MySQL 4.1 used two-way sorting before, which literally means scanning the disk twice to finally get the data, read the row pointer and order
by column, sort them, and then scan the sorted list, according to The values ​​in the list are re-read from the list and
the corresponding data is output.

Get the sorting fields from the disk, sort them in the buffer, and then get other fields from the disk to get a batch of data. The disk needs to be scanned twice. Since IO is very time-consuming, the second type appeared after mysql4.1. The improved algorithm is one-way sorting.
Single-pass sorting (fast) reads
all the columns required for the query from the disk, sorts them in the buffer according to the order by column, and then scans the sorted list for output. It is more efficient and avoids a second read. data. And turns random IO into sequential IO, but it will use more space because it saves each row in memory.

Conclusion and derived issues
Since single-channel comes out last, it is generally better than dual-channel, but there are problems with using single-channel.
Optimization strategy
1. Try to increase sort_buffer_size
2. Try to increase max_length_for_sort_data
3. Select * is a taboo when ordering by. It's best to query only the fields you need.

6. GROUP BY optimization

The principle of using indexes in group by is almost the same as that in order by. Group by can use the index directly even if there is no filtering condition that uses the index.
group by sorts first and then groups, following the best left prefix rule for index construction.
When index columns cannot be used, increase the settings of the max_length_for_sort_data and sort_buffer_size parameters.
The efficiency of where is higher than that of having. If the conditions can be written in where, do not write them in having.
Reduce the use of order by, sort without sorting, or put the sorting in the terminal. Statements such as Order by, group by, and distinct consume more CPU, and the CPU resources of the database are extremely precious.
Contains query statements such as order by, group by, and distinct. The result set filtered by the where condition should be kept within 1,000 rows, otherwise SQL will be very slow.

7. Optimize paging queries

Complete the sorting and paging operation on the index, and finally associate the other column contents required by the original table query according to the primary key.

8. Prioritize covering indexes

Indexes are an efficient way to find rows, but a database can also use indexes to find data for a column, so it doesn't have to read the entire row. After all, index leaf nodes store the data they index; when the desired data can be obtained by reading the index, there is no need to read the rows. An index that contains data that satisfies the query results is called a covering index.

A form of non-clustered composite index that includes all columns used in the SELECT, JOIN and WHERE clauses of the query (that is, the fields to be indexed happen to cover the fields involved in the query conditions).

To put it simply, the index column + primary key contains the columns queried from SELECT to FROM.

Pros and cons of covering indexes

Benefits:
1. Avoid secondary query of Innodb table index (table return)
2. Can turn random IO into sequential IO to speed up query efficiency
Disadvantages:
There is always a cost to the maintenance of index fields. Therefore, there are trade-offs to consider when building redundant indexes to support covering indexes. This is the job of the business DBA, or business data architect

9. Prefix index

The impact of prefix indexes on covering indexes

Using a prefix index eliminates the need for a covering index to optimize query performance. This is also
a factor you need to consider when choosing whether to use a prefix index.

10. Index pushdown

Index Condition Pushdown (ICP) is a new feature in MySQL 5.6. It is an optimization method that uses indexes to filter data at the storage engine layer. ICP can reduce the number of times the storage engine accesses the base table and the number of times the MySQL server accesses the storage engine.

10.1 Process without using ICP index scanning:

Storage layer: Only the entire row of records corresponding to the index records that meet the index key conditions are taken out and returned to the server layer. Server layer
: The returned data is filtered using the subsequent where conditions until the last row is returned.

10.2 Process of using ICP scanning:

Storage layer:
First determine the index record interval that satisfies the index key conditions, and then use index filter on the index to filter. Only the index records that meet the indexfilter conditions are returned to the table and the entire row of records is returned to the server layer. Index records that do not meet the index filter conditions are discarded and will not be returned to the table or server layer.
Server layer:
Use table filter conditions for final filtering of the returned data.

 

Cost difference before and after use. Before use
, the storage layer returned many rows of records that needed to be filtered out by the index filter.
After using ICP, records that did not meet the index filter conditions were directly removed, eliminating the need for them to be returned to the table and passed to the server layer. the cost of.
The acceleration effect of ICP depends on the proportion of data filtered out by ICP in the storage engine.

 10.3 Conditions of use of ICP

Conditions for using ICP:
① It can only be used for secondary index (secondary index)
② The type value (join type) in the execution plan displayed by explain is range, ref, eq_ref or ref_or_null.
③ Not all where conditions can be filtered by ICP. If the field of the where condition is not in the index column, the records of the entire table still have to be read to the server for where filtering.
④ ICP can be used for MyISAM and InnnoDB storage engines
⑤ MySQL version 5.6 does not support the ICP function of partition tables, and version 5.7 starts to support it.
⑥ When SQL uses covering index, ICP optimization method is not supported.

 The optimization of index pushdown on non-primary key indexes can effectively reduce the number of table returns and greatly improve query efficiency. In daily work, index pushdown can be achieved by optimizing the index according to business conditions to improve business throughput.

11. Ordinary index vs unique index

From a performance perspective, should you choose a unique index or a normal index? What is the basis for selection?
Assume that we have a table with the primary key column as ID. There is field k in the table, and there is an index on k. Assume that the values ​​in field k are not repeated.

The table creation statement for this table is:


mysql> create table test(
id int primary key,
k int not null,
name varchar(16),
index (k)
)engine=InnoDB;

The (ID,k) values ​​of R1~R5 in the table are (100,1), (200,2), (300,3), (500,5) and (600,6) respectively.

11.1 Query process

Assume that the statement to execute the query is

select id from test where k=5.
For a normal index, after finding the first record (5,500) that meets the condition, you need to find the next record until you encounter the first record that does not meet the k=5 condition.
For a unique index, since the index defines uniqueness, the search will stop after the first record that meets the condition is found
.
So, what is the performance gap caused by this difference? The answer is, minimally.

11.2 Update process

In order to illustrate the impact of ordinary indexes and unique indexes on update statement performance, let's introduce the change buffer.
When a data page needs to be updated, if the data page is in memory, it will be updated directly. If the data page is not yet in memory, InooDB will cache these update operations in the change buffer without affecting data consistency. , so there is no need to read this data page from disk. When the next query needs to access this data page, read the data page into memory, and then perform operations related to this page in the changebuffer. In this way, the correctness of the data logic can be ensured.
The process of applying the operations in the change buffer to the original data page and obtaining the latest results is called merge. In addition to triggering merge when accessing this data page, the system has background threads that merge regularly. During the normal shutdown of the database, the merge operation will also be performed.
If the update operation can be recorded in the change buffer first to reduce disk reads, the execution speed of the statement will be significantly improved. Moreover, reading data into memory requires occupying the buffer pool, so this method can also avoid occupying memory and improve memory utilization.
The change buffer cannot be used to update the unique index. In fact, only ordinary indexes can be used.

11.3 Usage scenarios of change buffer

1. How to choose between ordinary index and unique index? In fact, there is no difference in query capabilities between these two types of indexes. The main consideration is
the impact on update performance. Therefore, it is recommended that you try to choose ordinary indexes.
2. In actual use, it will be found that the combined use of ordinary indexes and change buffers can
obviously optimize the update of tables with large amounts of data.
3. If all updates are immediately followed by queries for this record, then you should close the change buffer. In other cases, change buffer can improve update performance.
4. Since unique indexes cannot use the optimization mechanism of the change buffer, if the business is acceptable, it is recommended to give priority to non-unique indexes from a performance perspective. But what to do if "the business may not be guaranteed"?
First, business correctness takes priority. Our premise is that "the business code has guaranteed not to write duplicate data" when discussing performance issues. If the business cannot guarantee this, or the business requires the database to make constraints, then you have no choice but to create a unique index.
Then, in some "archive library" scenarios, you can consider using unique indexes. For example, online data only needs to be retained for half a year, and then historical data is stored in the archive. At this time, the archived data has ensured that there are no unique key conflicts. To improve archiving efficiency, you can consider changing the unique index in the table to a normal index.

12. Other query optimization strategies

12.1 The difference between EXISTS and IN:

In: is a Hash connection between the external table and the internal table , and exists is a loop loop on the external table, and the internal table is queried each time the loop loops.

When querying two tables of similar size, there is little difference between using In and exists.

If one of the two tables is smaller and the other is larger, then use exists for the larger subquery table and In for the smaller subquery table, which will be more efficient .

In other words  , IN is suitable for situations where the outer surface is large and the inner table is small; EXISTS is suitable for situations where the outer surface is small but the inner table is large, so the efficiency will be high.



12.2 COUNT(*) and COUNT (specific field) efficiency
count(*) includes all columns, which is equivalent to the number of rows. When counting the results, the column value will not be ignored if it is NULL.
count(1) includes ignoring all columns. Use 1 to represent the code line. When counting the results, the column value
is NULL and will not be ignored. When NULL, no statistics will be collected.

From an efficiency perspective, # COUNT (※) ≈ COUNT(1) > COUNT (field), and because COUNT (※) is the standard statistical syntax defined by SQL92, we recommend using COUNT(*).

12.3 Regarding SELECT(*)
in table queries, it is recommended to specify the fields. Do not use * as the field list of the query. It is recommended to use SELECT <field list>. Check the reason:
① During the parsing process, MySQL will query the data dictionary and convert "* "Converting all column names in order will consume a lot of resources and time.
② Covering index cannot be used


12.4 The impact of LIMIT 1 on optimization
is for SQL statements that scan the entire table. If you can be sure that there is only one result set, then when adding LIMIT 1, the scan will not continue when a result is found. This will Speed ​​up queries.
If the data table has established a unique index for the field, you can query through the index. If the entire table is not scanned, there is no need to add LIMIT 1.


12.5 Use COMMIT more
. Whenever possible, use COMMIT as much as possible in the program. In this way, the performance of the program will be improved and the demand will be reduced due to the resources released by COMMIT.
Resources released by COMMIT:
information on the rollback segment used to recover data,
locks obtained by program statements, space
in the redo/undo log buffer,
internal costs of managing the above three resources

Guess you like

Origin blog.csdn.net/shenBaoYun/article/details/125722554