MySQL's in-depth explanation of index (Part 1)

1. Index overview

When it comes to database indexes, we are no strangers, and we will often come into contact with them in our daily work. For example, a certain SQL query is relatively slow. After analyzing the reason, you may say "add an index to a certain field" or other solutions.但到底什么是索引,索引又是如何工作的呢

Simply put,Indexing is actually to improve the efficiency of data query, just like the catalog of a book, a 500-page book, if you want to quickly find a certain knowledge point without the help of the catalog, it may take a while to find it. Similarly, for database tables ,索引其实就是它的目录

2. Common models of indexing

The emergence of the index is to improve the query efficiency, but there are many ways to implement the index, so the concept of the index model is introduced here. There are many data structures that can be used to improve the efficiency of reading and writing. Here are three common and common The simpler data structures are, respectively 哈希表, 有序数组and搜索树

2.1 Hash table

A hash table is a structure that stores data with key-value (key-value). We only need to enter the value to be searched, namely key, to find its corresponding value, namely Value.. The idea of ​​hashing is very simple, put the value in the array, use a hash function to convert the key into a certain position, and then put the value in this position of the array. Inevitably, after multiple key values ​​are converted by the hash function, the same value will appear. One way to handle this situation is to pull out a linked list (拉链法). Assume that you are maintaining a table of ID card information and names, and you need to find the corresponding name based on the ID number. At this time, the schematic diagram of the corresponding hash index is as follows
insert image description here

Advantages and disadvantages of hash table as index

In the figure, the values ​​calculated by User2 and User4 based on their ID numbers are both N, but it doesn't matter, and there is a linked list behind them. Suppose, at this time, you want to find out what the name corresponding to ID_card_n2 is. The processing steps are: First, calculate N through ID_card_n2 through a hash function; then, traverse in order to find User2. It should be noted that the values ​​of the four ID_card_n in the figure are not incremental,这样做的好处是增加新的User时速度会很快, 只需要往后追加

但缺点是, 因为不是有序的, 所以哈希索引做区间查询的速度是很慢的. You can imagine that if you want to find all users whose ID numbers are in the range [ID_card_X, ID_card_Y], you have to scan them all.

so,The hash table structure is suitable for scenarios where there are only equivalent queries, such as Memcached \ Redis and some other NoSQL engines

2.2 Ordered arrays

而有序数组在等值查询和范围查询场景中的性能就都非常优秀. Still the above example of checking the name based on the ID number, if we use an ordered array to achieve it, the schematic diagram is as follows:
insert image description here

这里我们假设身份证号没有重复, 这个数组就是按照身份证号递增的顺序保存的. At this time, if you want to check the name corresponding to ID_card_n2, you can quickly get it by using the dichotomy method, and this time will be complex O(log(N)).

At the same time, it is clear that this index structure supports range queries . If you want to check the User whose ID number is in the interval [ID_card_X, ID_card_Y]
, you can first use the dichotomy to find ID_card_X (if ID_card_X does not exist, find the
first ), and then traverse to the right until you find the first ID number greater than ID_card_Y, exit the loop.
If you only look at query efficiency, an ordered array is the best data structure

However, it is troublesome when the data needs to be updated,If you insert a record in the middle, you have to move all the subsequent records, which is too expensive所以, 有序数组索引只适用于静态存储引擎, 比如你要保存的是2017年某个城市的所有人口信息, 这类不会再修改的数据

2.3 Tree structure

二叉搜索树(BST)It is also a classic data structure in textbooks. Still the above example of checking the name based on the ID number, if we use a binary search tree to implement it, the schematic diagram is as follows:

insert image description here

The characteristic of binary search tree is: the left son of each node is smaller than the parent node, and the parent node is smaller than the right son . In this way, if you want to check ID_card_n2, you can get it according to the path of UserA -> UserC -> UserF -> User2 according to the search order in the figure. The time complexity of this is O(log(N)).Of course, in order to maintain O(log(N)) query complexity, you need to keep this tree as a balanced binary tree. To make this guarantee, 更新的时间复杂度也是O(log(N)). A tree can be binary or multi-fork. A multi-fork tree means that each node has multiple sons, and the size between the sons is guaranteed to increase from left to right.Binary tree is the most efficient search

But in fact most database storage does not use binary trees. The reason is, 索引不止存在内存中, 还要写到磁盘上, you can imagine a balanced binary tree with 1 million nodes, and the height of the tree is 20. A query may need to access 20 data blocks (根据索引一层一层往下找,每一层都找一个磁盘块). In the era of mechanical hard disks, it takes about 10 ms of seek time to randomly read a data block from the disk. That is to say, for a table with 1 million rows, if a binary tree is used to store it, it may take 20 times of 10 ms to access a single row, which is really slow enough for this query.

In order for a query to read as little disk as possible, it is necessary for the query process to access as few data blocks as possible. Well, we shouldn't use binary trees, 而是要使用“N叉”树. Here, "N" in the "N-ary" tree depends on the size of the data block. (int4个字节,bigint8字节)Take an integer field index of InnoDB as an example,This N is almost 1200 . When the height of this tree is 4, it can store 12003 values, which is already 1.7 billion. Considering that 树根the data blocks are always in memory, an index on an integer field on a table with 1 billion rows requires at most 3 disk accesses to look up a value . In fact, there is a high probability that the second layer of the tree is in memory, so the average number of disk accesses is even less.

N叉树由于在读写上的性能优点, 以及适配磁盘的访问模式, 已经被广泛应用在数据库引擎中了

Whether it is a hash, an ordered array, or an N-ary tree, they are all products or solutions of continuous iteration and optimization. With the development of database technology today, data structures such as jump tables and LSM trees are also used in engine design. The core of the underlying storage of the database is based on these data models. Whenever we encounter a new database, we need to pay attention to its data model first, so that we can theoretically analyze the applicable scenarios of this database.

In MySQL, 索引是在存储引擎层实现的there is no uniform index standard, that is, indexes of different storage engines work in different ways. And even if multiple storage engines support the same type of index, their underlying implementations may be different.由于InnoDB存储引擎在MySQL数据库中使用最为广泛, 所以下面就以InnoDB为例, 分析一下其中的索引模型

3. InnoDB index model

3.1B+Tree brief introduction

In InnoDB, 表都是根据主键顺序以索引的形式存放的the table with this storage method is called an index-organized table, and because we mentioned earlier, InnoDB uses the B+Tree tree index model,所以数据都是存储在B+树中的

Each index corresponds to a B+ tree in InnoDB. Suppose we have a table whose primary key is ID. There is a field k in the table, and there is an index on k. Create a table statement

mysql> create table T(
id int primary key,
k int not null,
name varchar(16),
index (k))engine=InnoDB;

The (ID, K) values ​​of R1~R5 in the table are (100, 1), (200, 2), (300, 3), (500, 5) and (600, 6) respectively, and the example diagram of two trees as follows

insert image description here

As can be seen from the figure, according to the content of the leaf nodes, the index types are divided into 主键索引and 非主键索引,

The leaf nodes of the primary key index are stored 整行数据. In InnoDB, the primary key index is also called a clustered index.

The content of the leaf node of the non-primary key index is 主键的值, in InnoDB, the non-primary key index is also called the secondary index (secondary index),

According to the above index structure description, let's discuss a question, what is the difference between the query based on the primary key index and the ordinary index?

  • If the statement is select * from T where ID = 500 , that is, the primary key query method, you only need to search the B+ tree of ID;
  • If the statement is select * from T where k = 5 , that is, the ordinary index query method, you need to search the k index tree first, get the ID value of 500, and then search the ID (primary key) index tree once . This process is calledreturn form.(回表的根本原因是因为完整的一条数据只存在于主键索引的B+Tree中)

In other words, queries based on non-primary key indexes need to scan one more index tree . Therefore, we should try to use the primary key query in the application

3.2 Index maintenance (page split)

In order to maintain the order of the index, B+Tree needs to do necessary maintenance when inserting new values, taking the above picture as an example, if you insert a new row with an ID value of 700, you only need to insert a new record after the R5 record. If the newly inserted ID value is 400, it is relatively troublesome and needs to be moved logically The following data, empty the space.

What's worse is that if the data page where R5 is located is full, according to the B+Tree algorithm, the 这时候需要申请一个新的数据页,然后挪动部分数据过去,这个过程称为页分裂performance will naturally be affected in this case.

In addition to performance, the page splitting operation also affects the utilization of data pages. The data that was originally placed on one page is now divided into two pages, and the overall space utilization is reduced by about 50%.

Of course, if there is a page split, there will be a merge. When two adjacent pages have deleted data and the utilization rate is very low, the data pages will be merged. The process of merging can be regarded as the reverse process of the split process.

3.3 About the use of auto-increment primary key

Based on the above index maintenance process, let's discuss a case:
the auto-increment primary key refers to the primary key defined on the auto-increment column, which is generally defined in the table creation statement as follows:NOT NULL PRIMARY KEY AUTO_INCREMENT

When inserting a new record, you do not need to specify the ID value, and the system will obtain the maximum value of the current ID plus 1 as the ID value of the next record.
In other words, the insertion data mode of the auto-increment primary key is exactly in line with the incremental insertion scenario we mentioned earlier. Every time a new record is inserted, it is an append operation, which does not involve moving other records, nor does it trigger the splitting of leaf nodes

However, if a field with business logic is used as the primary key, it is often not easy to ensure orderly insertion, so the cost of writing data is relatively high.
In addition to considering performance, we can also look at it from the perspective of storage space. Assuming that your table does have a unique field, such as an ID number of string type, should the ID number be used as the primary key, or should the auto-increment field be used as the primary key?

Since each leaf node of the non-primary key index is the value of the primary key. If the ID number is used as the primary key, then the leaf node of each secondary index occupies about 20 bytes, and if the integer is used as the primary key, it only needs 4 bytes, and if it is a long integer (bigint), it is 8 bytes byte.

显然, 主键长度越小, 普通索引的叶子节点就越小, 普通索引占用的空间也就越小. Therefore, in terms of performance and storage space, an auto-increment primary key is often a more reasonable choice.

Is there any scenario suitable for using business fields directly as the primary key? There are still. For example, some business scenarios require the following:

  • only one index;
  • The index must be unique.

You must have seen it, this is a typical KV scene.
Since there are no other indexes, there is no need to consider the size of leaf nodes of other indexes.
At this time, we must give priority to the principle mentioned in the previous paragraph “尽量使用主键查询”, and directly set this index as the primary key, which can avoid the need to search two trees for each query.

Guess you like

Origin blog.csdn.net/qq_46312987/article/details/124975757