Summary of MySQL index knowledge points

Author: fanili, Tencent WXG background development engineers

Know it, know why! This article introduces the data structure of the index, search algorithms, common index concepts and index failure scenarios.

What is an index?

In a relational database, an index is a single, physical storage structure that sorts the values ​​of one or more columns in a database table. It is a collection of one or more column values ​​in a table and a corresponding pointing table A list of logical pointers in the data page that physically identify these values. The function of the index is equivalent to the catalog of books, you can quickly find the content you need according to the page number in the catalog. (Baidu Encyclopedia)

The purpose of the index is to improve the search efficiency, sort the value set of the data table, and store it according to a certain data structure.

This article will start with a case and summarize the index knowledge from the data structure, classification, key concepts and how to use the index to improve search efficiency.

Start with a case

phenomenon

There is an existing historical SQL statement in the business that will overload the DB server when it is running, which will cause related services to block and fail to complete in time. The CPU monitoring curve is as follows:

Figure 1-CPU usage before optimization

From the DB's CPU usage curve, it can be seen that the business operation has been in a "sub-healthy" state (1), and problems may occur at any time as the business grows. This kind of problem (2) appeared in the early morning of November 11, when the DB CPU was always at 100% high load and there were a lot of slow query statements. Finally, the business is restored by killing the process to reduce the DB load and reducing the business process (3).

On the afternoon of November 11, the SQL statement of the business was optimized, and the optimization effect was as follows. The peak value of CPU usage during business operation has been greatly reduced (compared to 1, 2, and 3 in Figure 2); slow query statements can hardly be observed on the monitoring curve (compared to 1, 2, and 3 in Figure 3) ).

Figure 2-CPU usage before and after optimization
Figure 3-The number of slow queries before and after optimization

analysis

Table Structure

CREATE TABLE T_Mch******Stat (`FStatDate` int unsigned NOT NULL DEFAULT 19700101 COMMENT '统计日期',
`FMerchantId` bigint unsigned NOT NULL DEFAULT 0 COMMENT '商户ID',
`FVersion` int unsigned NOT NULL DEFAULT 0 COMMENT '数据版本号',
`FBatch` bigint unsigned NOT NULL DEFAULT 0 COMMENT '统计批次',
`FTradeAmount` bigint NOT NULL DEFAULT 0 COMMENT '交易金额'
PRIMARY KEY (`FStatDate`,`FMerchantId`,`FVersion`),
INDEX i_FStatDate_FVersion (`FStatDate`,`FVersion`))
DEFAULT CHARSET = utf8 ENGINE = InnoDB;

From the table building statement, we can know that the table has two indexes:

  1. The primary key index is a composite index consisting of the fields FStateDate, FMerchantId and FVersion;

  2. Ordinary index is a combined index, composed of fields FStateDate and FVersion;

SQL statement before optimization (partially trimmed) A:

SELECT SQL_CALC_FOUND_ROWS FStatDate,
    FMerchantId,
    FVersion,
    FBatch,
    FTradeAmount,
    FTradeCount
FROM T_Mch******Stat_1020
WHERE FStatDate = 20201020
    AND FVersion = 0
    AND FMerchantId > 0
ORDER BY FMerchantId ASC LIMIT 0, 8000

Explain the SQL to get the following results. The value of the Extra field is using where, indicating that the index is not used.

Optimized SQL statement (partially trimmed) B:

SELECT SQL_CALC_FOUND_ROWS a1.FStatDate,
    a1.FMerchantId,
    a1.FVersion,
    FBatch,
    FTradeAmount,
    FTradeCount
FROM T_Mch******Stat_1020 a1, (
    SELECT FStatDate, FMerchantId, FVersion
    FROM T_Mch******Stat_1020
    WHERE FStatDate = 20201020
        AND FVersion = 0
        AND FMerchantId > 0
        ORDER BY FMerchantId ASC LIMIT 0, 8000 ) a2
where a1.FStatDate = a2.FStatDate
    and a1.FVersion = a2.FVersion
    and a1.FMerchantId = a2.FMerchantId;

The key steps of optimization are:

  • Add a new subquery, the select field has only the primary key field;

The explain result of the SQL is as follows. The subquery uses an index, and the final online operation result also proves that the optimization effect is significant.

doubt

The optimized SQL statement B is much more complicated than the original SQL statement A (subqueries, temporary table associations, etc.). How can the efficiency be improved, counterintuitive? There are three questions:

  1. The query condition fields of SQL statement A are all in the primary key, are the primary key indexes used?

  2. Why can the subquery of SQL statement B use an index?

  3. What is the difference in the execution flow of the two statements before and after?

Index data structure

In MySQL, indexes are implemented at the storage engine layer, and different storage engines have different implementation methods according to the characteristics of their business scenarios. Here we will first introduce our common ordered arrays, hashes and search trees, and finally look at the B+ trees supported by Innodb's engine.

Ordered array

Array is an important data structure that will be introduced in any book on data structures and algorithms. The ordered array is as its literal meaning, the data is stored in the array in ascending order of Key. Very suitable for equivalent query and range query.

ID:1 ID:2 ...... ID:N
name2 name2 ...... nameN

In the case that the ID value is not repeated, the above array is stored in the ascending order of ID. At this time, if you need to query the name of a specific ID value, you can quickly get it by using the dichotomy, and the time complexity is O(logn).

// 二分查找递归实现方式
int binary_search(const int arr[], int start, int end, int key)
{
    if (start > end)
        return -1;

    int mid = start + (end - start) / 2;
    if (arr[mid] > key)
        return binary_search(arr, start, mid - 1, key);
    else if (arr[mid] < key)
        return binary_search(arr, mid + 1, end, key);
    else
        return mid;
}

The advantages of ordered arrays are obvious, as are their disadvantages. It is only suitable for static data. If data is newly inserted, data movement (actions such as new application for space, copy data, and release of space) will be required, which will consume resources.

Hash

A hash table is a structure that stores data in key-value (KV). We only need to enter the key K to find the corresponding value V. The idea of ​​hashing is to use a specific hash function to convert K to a position in the array, and then place the value V in this position of the array. If the same position is calculated for different Ks, a linked list will be pulled out and stored in this position. Hash tables are suitable for equivalent query scenarios, while corresponding range queries are incapable.

Binary search tree

Binary search tree, also called binary search tree, ordered binary tree or sorted binary tree, refers to an empty tree or a binary tree with the following properties:

  1. If the left subtree of any node is not empty, the values ​​of all nodes on the left subtree are less than the value of its root node;

  2. If the right subtree of any node is not empty, the values ​​of all nodes on the right subtree are greater than or equal to the value of its root node;

  3. The left and right subtrees of any node are also binary search trees respectively;

The advantage of binary search tree compared with other data structures is that the time complexity of search and insertion is lower, which is O(logn). In order to maintain O(logn) query complexity, the tree needs to be a balanced binary tree.

Search algorithm of binary search tree:

  1. If b is an empty tree, the search fails, otherwise:

  2. If x is equal to the value of the root node of b, the search is successful; otherwise:

  3. If x is less than the value of the root node of b, search the left subtree; otherwise:

  4. Find the right subtree.

Compared to ordered arrays and Hash, binary search trees perform very well at both ends of the search and insertion. Based on this continuous optimization, the N-ary tree was developed.

B+ tree

The Innodb storage engine supports B+ tree index, full-text index and hash index. The hash index supported by the Innodb storage engine is adaptive, and the Innodb storage engine automatically generates a hash index for the table according to the usage of the table, without human intervention. The B+ tree index is the most common index in relational databases, and it will also be the protagonist of this article.

data structure

In the previous article, I briefly introduced ordered arrays and binary search trees, and have a basic understanding of binary search and binary trees. The definition of B+ tree is relatively complicated. It does not need to be in-depth to understand the working mechanism of the index, only to understand the data organization form and search algorithm. We can simply think of B+ tree as a combination of N-ary tree and ordered array.

E.g:

3 advantages of B+ tree:

  1. Lower level, fewer IO times

  2. The leaf node needs to be queried every time, and the query performance is stable

  3. The leaf nodes form an ordered linked list, and the range query is convenient

Operation algorithm

  • Find

Traverse the tree from the root node from top to bottom, according to the pointer of the side to be searched according to the separation value; use binary search to determine the position within the node.

  • insert

  • delete

Note: The contents of the insert and delete two tables are from "MySQL Technology Insider-InnoDB Storage Engine"

Fill factor (innodb_fill_factor): The percentage of space on each B-tree page that is filled during index construction, and the remaining space is reserved for future index growth. It can be seen from the insert and delete operations that the value of the fill factor will affect the frequency of split and merge of the data page. Setting a smaller value can reduce the frequency of split and merge, but the index will take up more disk space; on the contrary, it will increase the frequency of split and merge, but it can reduce the disk space occupied. Innodb reserves 1/16 of space for clustered indexes by default to ensure subsequent insertion and upgrade of the index.

Innodb B+ tree index

The previous article introduced the basic data structure of the index. From the perspective of Innodb, we now understand how to use the B+ tree to build an index, how the index works, and how to use the index to improve search efficiency.

Clustered index and non-clustered index

The B+ tree index in the database can be divided into clustered index and non-clustered index. The difference between a clustered index and a non-clustered index is whether the leaf node is a complete row of data.

The Innodb storage engine table is an index-organized table, that is, the data in the table is stored in the order of the primary key. The clustered index is to construct a B+ tree according to the primary key of each table, and the leaf nodes store the complete row records of the table. The leaf node of a non-clustered index does not contain all the data of the row record. The content of the leaf node of the non-clustered index of the Innodb storage engine is the primary key index value.

How to create a clustered index if the data table does not have a primary key? When there is no primary key, Innodb will generate a 6-byte RowId field for each record of the data table, and will build a clustered index based on this.

Select statement to find the record process

The following example will show the organization of index data and the process of querying data by the Select statement.

  • Table building statement:

create table T (
    ID int primary key,
    k int NOT NULL DEFAULT 0,
    s varchar(16) NOT NULL DEFAULT '',
    index k(k)
) engine=InnoDB DEFAULT CHARSET=utf8;

insert into T values(100, 1, 'aa'),(200, 2, 'bb'),(300, 3, 'cc'),(500, 5, 'ee'),(600,6,'ff'),(700,7,'gg');
  • Index structure diagram

On the left is the clustered index established by the primary key ID, and its leaf nodes store complete table record information; on the right is the ordinary index established by the common field K, and the value of the leaf node is the primary key ID.

  • Select statement execution process

select * from T where k between 3 and 5;

The execution process is as follows:

  1. Find the record of k=3 on the K index tree, and obtain ID=300;

  2. Then go to the ID index tree to find R3 corresponding to ID=300;

  3. Take the next value k=5 in the k index tree, and get ID=500;

  4. Go back to the ID index tree and find R4 corresponding to ID=500;

  5. Take the next value k=6 in the k index tree. If the condition is not met, the loop ends.

An important concept is introduced in the above process of finding records: returning to the table , that is, returning to the process of searching the primary key index tree. Avoiding back-to-table operations is a conventional idea and an important method to improve the efficiency of SQL queries. So how to avoid returning to the table?

Note: This example comes from "45 Lectures on MySQL Actual Combat"

Covering index

MySQL 5.7, table building statement:

CREATE TABLE `employees` (
  `emp_no` int(11) NOT NULL,
  `birth_date` date NOT NULL,
  `first_name` varchar(14) NOT NULL,
  `last_name` varchar(16) NOT NULL,
  `gender` enum('M','F') NOT NULL,
  `hire_date` date NOT NULL,
  PRIMARY KEY (`emp_no`),
  KEY `i_first_name` (`first_name`),
  KEY `i_hire_date` (`hire_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
  • SQL statement A

explain select * from employees where hire_date > '1990-01-14';

explain result:

  • SQL statement B

explain select emp_no from employees where hire_date > '1990-01-14';

explain result:

  • analysis

From the results of the two explanations, we can see that the extra of SQL statement A is using where, and the extra of SQL statement B is using where; using index. This shows that A does not use an index, and B uses an index.

Index K contains the value of the field ID required by the query statement, and there is no need to go back to the primary key index tree to find it again, that is, "cover" our query requirements, which we call a covering index. Covering indexes can reduce the number of tree searches and significantly improve query performance.

Leftmost match

  • SQL statement A

explain select * from employees where hire_date > '1990-01-14' and first_name like '%Hi%';

  • SQL statement B

explain select * from employees where hire_date > '1990-01-14' and first_name like 'Hi%';
  • analysis

The SQL statement A in the above test uses an extreme method: first_name like'%Hi%', adding fuzzy matching before and after making the SQL statement unable to use the index; when the leftmost'%' is removed, SQL statement B uses the index . The leftmost match can be the leftmost N characters of the string index, or the leftmost M field of the joint index. Proper planning and using the leftmost match can reduce indexes, thereby saving disk space.

Index push down

What is index push down? Let's start with the following set of comparative tests, which will execute the same SQL statement in MySQL 5.5 and MySQL 5.7:

select * from employees where hire_date > '1990-01-14' and first_name like 'Hi%';
  • Execute explain in MySQL 5.5, the value of extra field shows that no index is used

It takes 0.12s to execute the query

  • Execute explain in MySQL 5.7, the value of extra field shows that index pushdown is used

It takes 0.02s to execute the query

  • Index push down

The extra field value in the explain result contains using index condition, indicating that index pushdown is used. The index push function is supported since version 5.6. Before version 5.6, the i_first_name index was not used. You need to go to the primary key index table every time to get the complete record value for comparison. Starting from version 5.6, due to the existence of the index i_first_name, the first_name value of the index can be directly used for filtering, so that records that do not meet the "first_name like'Hi%'" condition do not need to return to the table.

MRR optimization

MySQL 5.6 version began to support Multi-Range Read (MRR) optimization. The purpose of MRR optimization is to reduce random access to disks and convert random access to more sequential data access. It can bring about IO-bound SQL query statements. Performance is greatly improved. Let's take a look at the comparison test first. The following test statements are executed in the same MySQL instance, and the mysql service is restarted before execution to ensure that the cache is not warmed up.

  • Turn off MRR

SET @@optimizer_switch='mrr=off';
select * from employees where hire_date > '1990-01-14' and first_name like 'Hi%';

The execution time is less than 0.90s

  • Turn on MRR

 SET @@optimizer_switch='mrr=on,mrr_cost_based=off';
 select * from employees where hire_date > '1990-01-14' and first_name like 'Hi%';
  • analysis

From the test results, it can be found that when mrr is turned off to on, the time taken is reduced from 0.90s to 0.03s, and the query rate is increased by 30 times.

Common index failure scenarios

Once an index is established in a MySQL table, will the SQL query statement necessarily use the index? Not necessarily, there are scenarios where the index fails. We add a composite index to the employees table, and subsequent examples are based on this table for analysis and testing.

alter table employees add index i_b_f_l(birth_date, first_name, last_name)
alter table employees add index i_h(hire_date);

Failure scenario

  • Range query (>,<,<>)

explain select * from employees where hire_date > '1989-06-02';
  • Inconsistent query condition types

alter table employees add index i_first_name (first_name);
explain select * from employees where first_name = 1;
  • The query condition uses a function

explain select * from employees where CHAR_LENGTH(hire_date) = 10;
  • Fuzzy query

explain select * from employees where hire_date  like  '%1995';
  • Do not use the first field of the composite index as a condition

explain select * from employees where last_name = 'Kalloufi' and first_name = 'Saniya';

Why does it fail?

  • Sequential reading is better than discrete reading performance

    Will the range query definitely cause the index to fail?

    Not! Change the query conditions slightly and look at the comparison result of explain. You can see that the new statement uses index pushdown, indicating that the index is not invalid. why?

    In the case of not using a covering index, the optimizer will only choose to use a nonclustered index when the amount of data is small. Subject to the characteristics of traditional mechanical disks, the performance of reading data rows sequentially through a clustered index is better than reading data rows discretely through a non-clustered index. Therefore, the optimizer will choose a clustered index even if there is a non-clustered index, but the amount of data accessed may reach 20% of the number of records sent. Of course, Force index can also be used to force the index.

explain select * from employees where hire_date > '1999-06-02';
  • Cannot use B+ index to find quickly

    The basic element of the B+ tree index supporting fast query is because its index key values ​​are stored in an orderly manner, from small to large from left to right, so that you can quickly check in each level of nodes and enter the next level, and finally The leaf node finds the corresponding value.

    The use of functions will make MySQL unable to use the index for fast query, because the function operation on the index field will destroy the order of the index value, so the optimizer chooses not to use the index. The inconsistent query condition type is actually the same, because it uses implicit type conversion*.

Fuzzy matching and not using the first field of the combined index as the query condition are unable to quickly locate the index position, which makes the index unable to be used. Fuzzy matching When the query condition is where A ike'a%' and a is the leftmost prefix of A, it is possible to use the index (leftmost matching). Whether to use it depends on the optimizer's evaluation of the query data volume.

Back to the original case

Let us go back to the case at the beginning of the article and try to answer the 3 questions raised at that time.

-- A语句
SELECT FStatDate, FMerchantId, FVersion, FBatch, FTradeAmount, FTradeCount FROM T_Mch******Stat_1020 WHERE FStatDate = 20201020     AND FVersion = 0     AND FMerchantId > 0 ORDER BY FMerchantId ASC LIMIT 0, 8000;

-- B语句
SELECT SQL_CALC_FOUND_ROWS a1.FStatDate,
    a1.FMerchantId,
    a1.FVersion,
    FBatch,
    FTradeAmount,
    FTradeCount
FROM T_Mch******Stat_1020 a1, (
    SELECT FStatDate, FMerchantId, FVersion
    FROM T_Mch******Stat_1020
    WHERE FStatDate = 20201020
        AND FVersion = 0
        AND FMerchantId > 0
        ORDER BY FMerchantId ASC LIMIT 0, 8000 ) a2
where a1.FStatDate = a2.FStatDate
    and a1.FVersion = a2.FVersion
    and a1.FMerchantId = a2.FMerchantId;

The query condition fields of SQL statement A are all in the primary key, are the primary key indexes used?

The primary key index is actually used: the range query of the index only needs to read and parse all the records one by one to cause slow query.

Why can the subquery of SQL statement B use an index?

  1. In the previous article, we introduced a clustered index, the index key value is the primary key.

  2. The difference between the two SQL statements is that the Select field of the subquery statement of the B statement is included in the primary key field, while the A statement has other fields (such as FBatch and FTradeAmount, etc.). In this case, only the key value of the primary key index can meet the field requirements of the B statement; the A statement needs to take the entire row of records one by one for analysis.

What is the difference in the execution flow of the two statements before and after?

  • The execution process of SQL statement A:

  1. Scan the index table one by one and compare query conditions

  2. Read the entire row of data and return if it meets the query conditions

  3. Go back to step a until the comparison of all index records is completed

  4. Sort all returned records (complete records) that meet the criteria

  5. Select the first 8000 data to return

  • The execution process of SQL statement B:

  1. Scan the index table one by one and compare query conditions

  2. If it meets the query conditions, take the relevant field value from the index key and return

  3. Go back to step a until the comparison of all index records is completed

  4. Sort all returned records that meet the conditions (each record has only 3 primary keys)

  5. Select the first 8000 data to return to form a temporary table

  6. Associate the temporary table with the main table, use the primary key equality comparison to query 8000 data

  • Comparing the execution process of the two SQL statements, it can be found that the differences are concentrated in steps 2 and 4. In step 2, SQL statement A needs to randomly read the entire row of data and analysis is very resource intensive; step 4 involves MySQL's sorting algorithm, which will also affect the execution efficiency. In terms of sorting effect, SQL statement B is better than SQL statement A.

Glossary

  • Primary key index

As the name suggests, this type of index is composed of the primary key of the table, sorted from small to large from left to right. An Innodb storage table has only one primary key index table (clustered index).

  • Normal index

The most common kind of index, there is no special restriction.

  • Unique index

The fields of the index cannot have the same value, but null values ​​are allowed.

  • Composite index

Indexes composed of multiple column fields are often set to improve query efficiency.

to sum up

At the beginning of the article, several common index data structures are introduced, such as ordered arrays suitable for static data, hash indexes suitable for KV structures, and search binary trees that take into account query and insertion performance; then, they introduce Innodb's common index implementation method B+ tree And the Select statement uses the B+ tree index to find the execution process of the record. In this part, we understand several key concepts, back to the table, covering index, leftmost matching, index push down and MMR; after that, we also summarized the index failure scenarios and The reason behind. Finally, we return to the original case and analyze the difference in the use of indexes in SQL statements before and after optimization, which leads to differences in execution efficiency.

This article introduces some superficial knowledge of the index, hoping to help readers a little. As a summary of the phased learning, the article's knowledge of MySQL index is basically shallow, and it needs to be used and studied in depth in the future.

How to relieve worries? Only learning.

Bibliography and information

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/111188936