Understand the principles and characteristics of Mysql index | JD Logistics Technical Team

As a developer, when encountering SQL that takes a long time to execute, basically everyone will say "add an index." But what is an index and what are its characteristics? Let’s briefly discuss it with you.

1 How indexes work and how to speed up queries

An index is like a table of contents in a book, a database object that improves the speed of data access to database tables. When our request comes in, if there is a directory, we will quickly locate the chapter, and then find the data from the chapter. If there is no catalog, it is like finding a needle in a haystack, which is obviously difficult. This is the culprit we often encounter, full table scan.

The basic information contained in an index record includes: key value (that is, the values ​​of all fields specified when you define the index) + logical pointer (pointing to the data page or another index page). Normally, because index records only contain index field values ​​(and 4-9 bytes of pointers), index entities are much smaller than real data rows, and index pages are much denser than data pages. An index page can store a larger number of index records, which means that there is a big advantage in I/O when searching in the index. Understanding this helps to essentially understand the advantages of using indexes, which is also most of the performance Optimization needs to enter the point.

1) Access data without index:

2) Access data using a balanced binary tree structure index:

The first picture does not use an index. We will perform a sequential search and match one by one according to the order of the data. It takes 5 searches to find the required data. The second picture uses a simple balanced binary tree index. After that, we only use 3 times, this is when the amount of data is small, the effect is more obvious when the amount of data is large, so in summary, the purpose of creating an index is to speed up data search;

2 Components and types of indexes

There are many common ways to implement indexes, such as hash, array, and tree. The following will introduce to you the differences in the use of these models.

2.1 hash

The idea of ​​hashing is simple. It is to use the hash function algorithm to calculate the corresponding value of the key we inserted (in the past, the remainder was usually taken, just like the calculation method of hashmap, shift XOR, etc.), and put this value in a position. This location is called a hash slot. The corresponding disk location pointer is placed in the hash slot. To summarize in one sentence, a hash index stores the hash value of the index field and the disk file pointer where the data is located.

But it is inevitable that no matter what algorithm, when the amount of data is large, different data will inevitably be placed in the same hash slot. For example, "Wu" and "武" in the dictionary have the same pronunciation. When you look up the dictionary, you can only go down the order. The same is true for index processing. A linked list will be pulled out and traversed sequentially when needed.

  • Disadvantages: Unordered index, interval query performance is low, because interval query will cause multiple disk accesses, and multiple IO time-consuming is difficult to accept.
  • Advantages: insert is fast, you only need to add it later.
  • Scenario: equivalent query, such as memcached. Not suitable for columns with large amounts of repeated data to avoid hash conflicts
  • Summary: Think of it as a java hashmap

2.2 Ordered array

If we need interval query, the performance of hash index is not satisfactory. At this time, the advantages of ordered arrays can be reflected.

When we need to get the value between A and B from an ordered array, we only need to locate the position of A through the dichotomy method, the time complexity is O(log(N)), and then traverse from A to B. In terms of speed, it can basically be said to be the fastest. But when we need to update, there are a lot of operations that need to be performed. If you need to insert a piece of data, you need to move all the data after the data, which is a waste of performance. So in summary, only data that does not change very much is suitable for indexing into an ordered array structure.

  • Disadvantages: When inserting new data, all subsequent data needs to be changed, which is slightly more expensive.
  • Advantages: Query speed is very fast, theoretical maximum.
  • Scenario: Archive query, log query, etc. that rarely change
  • Summary: It is an array arranged in order

2.3 Binary search tree

The basic principle is that the left nodes of the tree are smaller than the parent node, and the right nodes are larger than the parent node.

Here we can see that the query efficiency of a binary search tree is in principle O(log(N)). In order to ensure a balanced binary tree, the update efficiency is also O(log(N)). However, when there is a lot of data, the height of the tree will reach very high, and it is not advisable to access the disk too many times. And in extreme cases, the tree degenerates into a linked list, and the query complexity will be reduced to O(n).

Evolving into a multi-fork tree, that is, when there are multiple child nodes, the height of the tree will be greatly reduced and the access to the disk will be reduced.

  • Disadvantages: When the amount of data is large, the tree will be too tall, resulting in multiple disk accesses.
  • Advantages: Evolving into a multi-fork tree will reduce the tree height and the number of disk accesses.
  • Scenario: Applicable to many scenarios
  • Summary: The tree is small on the left and big on the right

2.4 B-tree

Store multiple elements in each node and store as much data as possible in each node. Each node can store 1000 indexes (16k/16=1000), thus transforming the binary tree into a multi-fork tree. By increasing the fork tree of the tree, the tree is changed from tall and thin to short and fat. To construct 1 million pieces of data, the height of the tree only needs 2 levels (1000*1000=1 million), which means that only 2 disk IOs are needed to query the data. The number of disk IOs is reduced, and the efficiency of querying data is improved.

This data structure is called a B-tree. The B-tree is a multi-branch balanced search tree.

2.5 B+ tree

The main difference between B+ tree and B tree is whether non-leaf nodes store data.

  • B-tree: Both non-leaf nodes and leaf nodes store data.
  • B+ tree: Only leaf nodes store data, and non-leaf nodes store key values. Leaf nodes are connected using bidirectional pointers, and the lowest leaf nodes form a bidirectional ordered linked list.

Precisely because the leaf nodes of the B+ tree are connected through linked lists, interval queries can be quickly performed after finding the lower limit, which is faster than normal in-order traversal.

3 Index maintenance

When you insert a piece of data, the index needs to perform necessary operations to ensure the order of the data. Generally, the self-increasing data can be added directly at the end. In special cases, if the data is added in the middle, all the subsequent data will need to be moved, which will affect the efficiency.

In the worst case, if the current data page (a page is the smallest unit of MySQL storage) is full, you need to apply for a new data page. This process is called page splitting. If page splits occur, performance will be affected. But MySQL is not a brainless data split. If you split the data from the middle, half of the performance will be wasted for the auto-incrementing primary key. MySQL will determine the splitting method based on the type of your index and the tracking of inserted data. Generally, it is stored in the head of the MySQL data page. If it is scattered insertion, it will be split in the middle. If it is inserted sequentially, it is usually caused by selecting the insertion point to start splitting, or a few lines after the insertion point. Decide whether to split in the middle or at the end.

If irregular data is inserted and there is no guarantee that the latter value is larger than the previous one, the splitting logic mentioned above will be triggered, and finally the following effect will be achieved.

So in most cases, we need to use auto-increasing indexes, unless the business needs to customize the primary key. It is best to ensure that there is only one index, and the index is a unique index. This avoids backing the table, causing the query to search two trees. Ensure the orderliness of data pages and make better use of indexes.

4 Reply

In layman's terms, if the index column is in the column required by select (because the index in mysql is sorted according to the value of the index column, some values ​​in the column exist in the index node) or based on an index query There is no need to return the table if you can obtain the record. If there are a large number of non-index columns in the columns required by the select, the index needs to find the primary key first, and then find the corresponding column information in the table, which is called table return.

To introduce the back table, we naturally have to introduce clustered indexes and non-clustered indexes.
The leaf nodes of the InnoDB clustered index store row records. Therefore, InnoDB must have and only one clustered index:

  • If the table defines a primary key, PK is the clustered index;
  • If the table does not define a primary key, the first non-null unique index (not NULL unique) column is a clustered index;
  • Otherwise, InnoDB will create a hidden row-id as a clustered index;

When we use the ordinary index query method, we need to search the ordinary index tree first, then get the primary key ID, and then search again in the ID index tree. Because the leaf node of the non-primary key index actually stores the ID of the primary key. Although this process uses an index, the bottom layer actually performs two index queries. This process is called table return. In other words, queries based on non-primary key indexes need to scan one more index tree. Therefore, we should try to use primary key queries in our applications. Or when there are high-frequency requests, reasonably establish a joint index to prevent table returns.

5 index coverage

In one sentence, all the column data required by SQL can be obtained from only one index tree, without the need to return to the table, and the speed is faster. If implemented in SQL, index coverage can be triggered as long as the Extra field of the output result in the execution plan is Using index.

The common optimization method is the one mentioned above, which is to build all the query fields into the index. As for whether the DBA is willing to let you build it, you need to fight your own battle.

The applicable scenarios for general index coverage include full table count query optimization, column query table return, and paging table return. Higher versions of mysql have been optimized. When one of the fields in the joint index is hit and the other is id, it will be automatically optimized without returning the table. Because the primary key is stored on the leaves of the secondary index, it is also considered index coverage and requires no additional cost.

6 Leftmost matching principle

To put it simply, when you use 'xx%', the index will also be used if the conditions are met.
If it is a joint index, let me give you an example, create a joint index of (a, b)

You can see that the value of a is in order, 1, 1, 2, 2, 3, 3, while the value of b is 1, 2, 1, 4, 1, 2 in no order. But we can also find that when a is of equal value, the b values ​​are arranged in order, but this order is relative. This is because MySQL's rule for creating a joint index is to first sort the leftmost field of the joint index, based on the sorting of the first field, and then sort the second field. Therefore, there is no way to use the index for query conditions such as b=2. For example, I create an index,
KEY ( , ) USING BTREE to execute the first sql, full table scan idx_time_zone time_zonetime_string

Executing the second sql, you can see that the index is used.

Looking at the two SQLs again, the index created is KEY ( , ) USING BTREE idx_time_zone time_zonetime_string

According to normal logic, the second sql does not conform to the order of the index fields, and the index should not be used. However, the actual situation is different from what we expected. Why is this?

Since mysql was acquired by Oracle, mysql has incorporated many of Oracle's previous technologies. Higher versions of mysql automatically optimize the order of where conditions. To put it simply, the query optimizer does this step, sql will do preprocessing, and which rule will be used for a better query.

By the way, let me mention some things that mysql’s query optimizer can help with.

6.1 Conditional transformation

For example, where a=b and b=2, you can get a=2, conditional transfer. The final sql is a=2 and b=2 > < = like can be passed

6.2 Exclusion of invalid codes

For example, where 1=1 and a=2, 1=1 is always correct, so in the end it will be optimized to a=2.
For example, where 1=0 is always false. This will also be excluded, and the entire sql will be invalid
or illegal. Empty fields where a is null will also be excluded.

6.3 Calculate in advance

The part containing mathematical operations, such as where a= 1+2 will help you calculate, where a=3

6.4 Access types

When we evaluate a conditional expression, MySQL determines the access type of the expression. Here are some access types, ordered from best to worst:

  • system system table, and is a constant table
  • const constant table
  • eq_ref unique/primary index, and uses '=' for access
  • ref index uses '=' for access
  • ref_or_null index uses '=' for access and may be NULL
  • Range index uses BETWEEN, IN, >=, LIKE, etc. for access
  • index index full scan
  • ALL table full scan

If you often look at the execution plan, you can see what it means at a glance. Here is an example.

where index_col=2 and normal_col =3 index_col=2 will be selected as the driver item. The meaning of the driver item is that when a SQL selects its execution plan, there may be multiple execution paths. One is a full table scan, and then it is filtered to see if it matches the values ​​of index fields and non-index fields. The other is to find the corresponding index tree through the index field, key value = 2, filter the result, and then compare whether it matches the value of the non-index field. Under normal circumstances, indexing requires less disk reads than full table scanning, so it is called a better execution path, that is, using the index field as its driving expression

6.5 Range access

Simply put, a in(1,2,3) is the same as a=1 or a=2 or a=3, and between 1 and 2 is also the same as a>1 and a<2. There is no need to optimize.

6.6 Index access types

Avoid using indexes with the same prefix, that is, a field should not have the same prefix on multiple indexes. For example, if a unique index has been established for a field, if you create a joint index for it at this time, the optimizer will not know which index you want to use. Or if you build a single index and a joint index with the same prefix, even if you write the conditions, the joint index may not be used. Of course, you can force it, but that’s another story.

6.7 Conversion

Simple expressions can be converted, such as where -2 = a will automatically become where a= -2 , but if mathematical operations are involved, they cannot be converted, such as where 2= -a will not automatically be converted to where a at this time. =-2.

The second sql can use the index

Therefore, during the development process, we need to pay attention to how to write SQL, and consciously write where a=-2

6.8 and、union、order by、group by等

1)and

After the and condition, if there is no index, scan the entire table. There is a better access type. See 5.4. The index with a better storage type will be used. If they are the same, which index is created first and which one is used.

2)union

Each union statement is optimized individually

Here, two SQLs will be executed separately, using the index, and then merging the result sets.

3)order by

order by will filter out invalid sorting, such as a field that already has an index.

The query effect of the second sql is the same as the first one

Therefore, when writing SQL, do not write useless sorting, such as order by 'xxx', which is meaningless.

4)group by

To put it simply, for group by fields, if there is an index, the index will be used. For group by a order by a, the order by here means that the order by is not written. The result set is already sorted. Please refer to 6.8-3 order by
select distinct col_a from table a, etc. Price at select col_a from a group by col_a

7 Index pushdown

The main core point is to put the data filtering process on the storage engine layer for processing, instead of putting it on the server layer for filtering as before.

If both name and age are indexed on a table, the query condition is where name like 'xx%' and age=11. In lower versions of mysql (below 5.6), according to the leftmost matching principle of the index, you can get the abandonment With age, only filter data based on name. After getting all the IDs based on the name, return to the table based on the IDs.

In the higher version of mysql, the age attribute is not ignored. Filtering with the age attribute directly filters out data with an age of 11. Assume that there are 10 data items that are not filtered based on age. After filtering, there are only 3 items left, which means 7 fewer replies. surface. Reducing io will greatly reduce performance consumption

8 Small tables drive large tables

We are used to hearing the saying that a small table drives a large table. It mainly means that the data set of a small table drives the data set of a large table, thereby reducing the number of connections. For example:

Table A has 10,000 data, and Table B has 1,000,000 data. If table A is used as a driving table and is in the outer layer of the loop, then only 10,000 connections are needed. If table B is in the outer layer, it needs to be looped 1 million times.

Let's take a look at the actual test and prepare the environment mysql 5.7+

Prepare two tables, one table ib_asn_d data 9175, one table bs_itembase_ext_attr data 1584115, both have indexes on the product code field.

First, the small table drives the big table

Tested repeatedly, the execution time is about 7 seconds.
Next, let’s look at the big table driving the small table.

Nearly 300 seconds is not of the same magnitude.
Next, analyze the execution plan separately. The first item in the execution plan is the driver table.

The small table drives the large table. The large table uses indexes. The small table scans the entire table and only scans more than 8,000 rows.

The large table drives the small table, and a full table scan of the large table requires scanning 1470,000 rows.
After many tests, we came to the conclusion:

  1. When using left join, the left table is the driving table and the right table is the driven table;
  2. When using right join, the right table is the driving table and the left table is the driven table;
  3. When using inner join, MySQL will select a table with a relatively small amount of data as the driving table, and a large table as the driven table;
  4. The driver table index does not take effect, but the non-drive table index takes effect.

It is important to ensure that the small table is a driving table.

9 Summary

  1. Covering index: If the query condition uses a normal index (or the leftmost principle field of the joint index), the query result is the field or primary key of the joint index, and the result is returned directly without the table return operation, reducing IO disk reading and writing. The entire row of data, so it is necessary to establish a joint index for high-frequency fields
  2. Leftmost prefix: the leftmost N fields of the joint index, or the leftmost M characters of the string index. When building an index, be careful not to repeat the left prefix to prevent the query optimizer from being unable to determine how to use the index.
  3. Index pushdown: name like 'hello%'and age >10 retrieval, before MySQL version 5.6, the matching data will be queried back to the table. After version 5.6, data with age <10 will be filtered out first, and then returned to the table for query, reducing the table return rate and improving the retrieval speed.

Author: JD Logistics Wu Siwei 
Source: JD Cloud Developer Community Please indicate the source when reprinting

IntelliJ IDEA 2023.3 & JetBrains Family Bucket annual major version update new concept "defensive programming": make yourself a stable job GitHub.com runs more than 1,200 MySQL hosts, how to seamlessly upgrade to 8.0? Stephen Chow's Web3 team will launch an independent App next month. Will Firefox be eliminated? Visual Studio Code 1.85 released, floating window Yu Chengdong: Huawei will launch disruptive products next year and rewrite the history of the industry. The US CISA recommends abandoning C/C++ to eliminate memory security vulnerabilities. TIOBE December: C# is expected to become the programming language of the year. A paper written by Lei Jun 30 years ago : "Principle and Design of Computer Virus Determination Expert System"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10320986
Recommended