[MYSQL articles] Understand the principle of mysql index in one article


The MySQL database should be one of the most commonly used databases. It can be seen in various large and small companies. How well do you know the MySQL database? If we want to use it better, we must first understand it. As the saying goes, if you want to do a good job, you must first sharpen your tools .

This article will lead you to deeply analyze some knowledge of MySQL index. First, let’s understand what is index and the deduction of index storage model. Why do you choose B+ tree for the underlying data structure ?

What is an index?

A table has 5 million pieces of data, execute a where query on the name field without index:

select * from user_innodb where name ='小马';

What if there is an index on the name field? Create an index on the name field and execute the same query again.

ALTER TABLE user_innodb DROP INDEX idx_name; 
ALTER TABLE user_innodb ADD INDEX idx_name (name);

Compared with the query without index, the query with index is dozens of times less efficient.

Through this case, you should be able to feel very intuitively that the performance improvement of indexing for data retrieval is very large.

So what exactly is an index? Why can it have such a big impact on our queries? What happened when the index was created?

index definition

A database index is a sorted data structure in a database management system (DBMS) to help quickly query and update data in database tables.

image-20211014152406430

The data is stored on the disk in the form of a file, and each row of data has its disk address. If there is no index, we need to retrieve a piece of data from the 5 million rows of data, and we can only traverse all the data in this table in turn until we find this piece of data.

But after we have an index, we only need to retrieve this piece of data in the index, because it is a special data structure specially designed for fast retrieval. After we find the disk address where the data is stored, we can get the data up.

index type

In InnoDB, there are three types of indexes: ordinary index, unique index (the primary key index is a special unique index), and full-text index.

Normal (Normal) : Also called a non-unique index, it is the most common index without any restrictions.

Unique (Unique) : A unique index requires that key values ​​cannot be repeated. In addition, it should be noted that the primary key index is a special unique index, and it has an additional restriction, which requires that the key value cannot be empty . Primary key indexes are created with primay key.

Fulltext (Fulltext) : For relatively large data, such as the message content we store, there are several KB of data, if you want to solve the problem of low like query efficiency, you can create a full-text index. Only text type fields can create full-text indexes, such as char, varchar, and text.

An index is a data structure, so what kind of data structure should it choose to achieve efficient data retrieval?

Index storage model deduction

binary search

After Double Eleven passed, your girlfriend played a number guessing game with you. Guess how much I bought yesterday, five chances for you.

10000? low. 30000? Taller. How much will you guess next? 20000. Why don't you guess 11000, and don't guess 29000?

This is an idea of ​​binary search, also called binary search. Every time, we reduce the candidate data by half. This method is more efficient if the data has already been sorted.

So first, we can consider a data structure with an ordered array as an index.

The efficiency of equivalence query and comparison query of ordered arrays is very high, but there will be a problem when updating data. It may need to move a large amount of data (change index), so it is only suitable for storing static data.

In order to support frequent modifications, such as inserting data, we need to use linked lists. In the case of a linked list, if it is a singly linked list, its search efficiency is still not high enough.

So, is there a linked list that can use binary search?

In order to solve this problem, BST (Binary [ˈbaɪnəri] Search Tree), which is what we call a binary search tree, was born.

Binary Search Tree

All nodes in the left subtree are smaller than the parent node, and all nodes in the right subtree are larger than the parent node. After projecting onto the plane, it is an ordered linear table.

image-20211014155654921

Binary search tree can not only realize fast search, but also can realize fast insertion.

But there is a problem with the binary search tree: the search time is related to the depth of the tree, and in the worst case, the time complexity will degenerate to O(n).

What is the worst case scenario?

It is still the batch of numbers just now, if the data we insert happens to be in order, 2, 10, 12, 15, 21, 28

At this time, the BST will become a linked list ("slanted tree"). In this case, the purpose of speeding up the retrieval speed cannot be achieved, and it is no different from the sequential search efficiency.

image-20211014162114680

What caused it to tilt?

Because the depth difference between the left and right subtrees is too large, the left subtree of this tree has no nodes at all—that is, it is not balanced enough.

So, do we have a more balanced tree where the depth difference between the left and right subtrees is not so large?

This is a balanced binary tree, called Balanced binary search trees, or AVL tree.

Balanced Binary Tree (AVL Tree)

The definition of a balanced binary tree: the absolute value of the depth difference between the left and right subtrees cannot exceed 1.

what does it mean? For example, the depth of the left subtree is 2, and the depth of the right subtree can only be 1 or 3.

At this time, we will insert 1, 2, 3, 4, 5, and 6 in order. It must be like this, and it will not become a "slanting tree".

image-20211014162322998

How is the balance of the AVL tree achieved? How to ensure that the depth difference between the left and right subtrees cannot exceed 1? For example: Insert 1, 2, 3.

When we insert 1 and 2, according to the definition of binary search tree, 3 must be on the right side of 2. At this time, the depth of the right node of the root node 1 will become 2, but the depth of the left node is 0. Because it has no children, it violates the definition of a balanced binary tree.

So what should we do? Because it is connected to a right node under the right node, right-right type, so at this time we need to put 2 up, this operation is called left rotation.

image-20211014163132830

Similarly, if we insert 7, 6, and 5, it will become a left-left type at this time, and a right-handed operation will occur, and 6 will be lifted up.

image-20211014163348892

So in order to maintain balance, the AVL tree performs a series of calculations and adjustments when inserting and updating data.

We have solved the problem of balance, so how to query data with a balanced binary tree as an index? In a balanced binary tree, a node whose size is a fixed unit, what should be stored as an index?

The first one: the key value of the index. For example, if we create an index on id, I will find the key value of id in the index when I query with the condition of where id =1.

The second one: the disk address of the data, because the function of the index is to find the address where the data is stored.

The third is because it is a binary tree, it must also have references to the left child node and the right child node, so that we can find the next node. For example, when it is greater than 26, go to the right, go to the next tree node, and continue to judge.

image-20211014174659853

If the data is stored in this way, let's see what problems there will be.

First of all, the indexed data is placed on the hard disk. View the size of data and indexes:

select CONCAT(ROUND(SUM(DATA_LENGTH/1024/1024),2),'MB') AS data_len, 
CONCAT(ROUND(SUM(INDEX_LENGTH/1024/1024),2),'MB') as index_len 
from information_schema.TABLES 
where table_schema='gupao' and table_name='user_innodb';

When we use the tree structure to store the index, because we get a piece of data, we need to compare whether it is the required data at the server layer, and if not, we need to read the disk again. Accessing a node requires an IO between the disk and the disk. The smallest unit of InnoDB operating disk is a page (or called a disk block), the size is 16K (16384 bytes).

Then, a tree node is 16K in size. If we only store one key value + data + reference in one node , such as a plastic field, it may only use a dozen or dozens of bytes, which is far from reaching the capacity of 16K, so visit a tree node and perform During an IO, a lot of space is wasted.

So if each node stores too little data, we need to visit more nodes to find the data we need from the index, which means that there will be too many interactions with the disk.

If it is the era of mechanical hard disks, it takes about 10ms to search for data from the disk every time. The more interactions, the more time it takes.

For example, in the picture above, we have 6 pieces of data in a table. When we query id=37, we need to interact with the disk 3 times to query two child nodes. What if we have millions of data? This time is even more difficult to estimate.

So what is our solution?

The first one is to allow each node to store more data.

Second, the more keywords on the node, the more pointers we have, which means that there can be more forks.

Because the more the number of forks, the depth of the tree will decrease (the root node is 0). In this way, does our tree change from the original tall and thin appearance to the short, fat and short and fat appearance?

At this time, our tree is no longer binary, but multi-fork, or multi-way.

Multi-way balanced search tree (B Tree)

Like the AVL tree, the B-tree stores key values, data addresses, and node references at branch nodes and leaf nodes.

It has a feature: the number of forks (the number of paths) is always 1 more than the number of keywords. For example, in the tree we drew, each node stores two keywords, then there will be three pointers pointing to three child nodes.

image-20211014165736948

What is the search rule of B Tree?

For example, we want to find 15 in this table. Since 15 is less than 17, go left. Since 15 is greater than 12, go right. 15 is found in disk block 7, and only 3 IOs are used.

Is this more efficient than the AVL tree? Then how does B Tree realize that one node stores multiple keywords and maintains a balance? What is the difference with AVL tree?

For example, when the Max Degree (number of paths) is 3, we insert data 1, 2, and 3. When inserting 3, it should be in the first disk block, but if a node has three keywords, it means that there are With 4 pointers, the child nodes will become 4-way, so splitting must be performed at this time (actually B+Tree). Lift up the data 2 in the middle, and turn 1 and 3 into child nodes of 2.

If a node is deleted, there will be an opposite merge operation.

Note that splitting and merging here are different from left-handed and right-handed AVL trees.

We continue to insert 4 and 5, and the B Tree will split and merge again.

image-20211014165954674

From this, we can also see that there will be a large number of index structure adjustments when updating the index, so it explains why we don't build indexes on frequently updated columns, or why we don't update the primary key.

The splitting and merging of nodes is actually the splitting and merging of InnoDB pages.

B+ tree (enhanced version of B Tree)

The efficiency of B Tree is already very high, why does MySQL need to improve B Tree and finally use B+Tree?

Generally speaking, this improved version of B-Tree solves more comprehensive problems than B-Tree.

Let's take a look at the storage structure of the B+ tree in InnoDB:

image-20211014170414762

B+Tree in MySQL has several characteristics:

  1. The number of its keywords is equal to the number of paths;

  2. The root node and branch nodes of B+Tree will not store data, only the leaf nodes will store data. Searched keywords will not be returned directly, but will go to the leaf nodes of the last layer. For example, if we search for id=28, although it is directly hit on the first layer, all the data is on the leaf node, so I will continue to search down to the leaf node.

  3. Each leaf node of B+Tree adds a pointer to the adjacent leaf node, and its last data points to the first data of the next leaf node, forming an ordered linked list structure.

  4. It retrieves data according to the interval [ ) that is left closed and right open.

The data search process of B+Tree:

  1. For example, if we want to search for 28, we found the key value at the root node, but because it is not a page child node, we will continue to search. 28 is the critical value of the left-closed and right-open interval of [28,66), so it will be Go to the middle child node, and then continue to search, it is the critical value of the left-closed right-open interval of [28,34), so it will go to the left child node, and finally find the required data on the leaf node.

  2. Second, if it is a range query, for example, if you want to query data from 22 to 60, when you find 22, you only need to traverse the nodes and pointers sequentially to access all the data nodes at once, which greatly improves Interval query efficiency (no need to return to the upper parent node for repeated traversal search).

Features of B+Tree in InnoDB:

  1. It is a variant of B Tree, and it can solve the problems that B Tree can solve. What are the two major problems that B Tree solves? (each node stores more keywords; more paths);

  2. The ability to scan the database and scan the table is stronger (if we want to perform a full table scan on the table, we only need to traverse the leaf nodes, instead of traversing the entire B+Tree to get all the data);

  3. The disk read and write capabilities of B+Tree are stronger than that of B Tree (the root node and branch nodes do not save the data area, so one node can save more keywords, and more keywords can be loaded from disk at one time);

  4. Stronger sorting ability (because there is a pointer to the next data area on the leaf node, the data forms a linked list);

  5. The efficiency is more stable (B+Tree always gets data at the leaf nodes, so the number of IOs is stable).

summary

Seeing this, I believe that everyone should know why MySQL chooses to use B+ tree as the data structure model of the index. In the next article, we will continue to talk about the rules for using indexes, as well as creating and using them. If the article is helpful to you, remember to like, follow and bookmark .

Guess you like

Origin blog.csdn.net/jiang_wang01/article/details/131293406