MySQL index-B+tree (you will understand after reading it)

Index brief description:
Index is a data structure used to help us quickly locate the data we want to find in a large amount of data. The most vivid metaphor of the index is the catalog of books. Pay attention to the large amount of data here. The index is meaningful only when the amount of data is large. If I want to find the data of 4 in [1,2,3,4], it is also very fast to retrieve the whole data directly, and there is no need to laboriously build an index. Look again.

Indexes are divided into three categories in MySQL database:
B+ tree index—Hash index—full-text index.

What we are going to introduce today is the B+ tree index in the InnoDB storage engine that is most frequently encountered in work development. To introduce the B+ tree index, you have to mention the binary search tree, and the three data structures of the balanced binary tree and the B tree. The B+ tree evolved from the three of them.

Binary search tree
First, let's look at a picture:
Insert picture description here
as you can see from the figure, we have established an index of a binary search tree for the user table (user information table), and each circle is the index of the binary search tree. Node, the node stores the key (key) and data (data). The key corresponds to the id in the user table, and the data corresponds to the row data in the user table.

The characteristic of the binary search tree is that the key value of the left child node of any node is less than the key value of the current node, and the key value of the right child node is greater than the key value of the current node. The top node is called the root node, and the node without child nodes is called the leaf node.

If we need to find user information with id=12, use the binary search tree index we created, and the search process is as follows:
1: Take the root node as the current node, compare 12 with the key value of the current node 10, if 12 is greater than 10, connect Next, we take the right child node of the current node> as the current node.

2: Continue to compare 12 with the key value 13 of the current node, and find that 12 is less than 13, and use the left child node of the current node as the current node.

3: Compare 12 with the key value 12 of the current node. 12 is equal to 12. If the condition is met, we take data from the current node, that is, id=12, name=xm.

Using a binary search tree, we only need 3 times to find matching data. If we search one by one in the table, we need 6 times to find it.

Balanced Binary Tree:
Above we explained the use of binary search trees to quickly find data. However, if the above binary search tree is constructed like this:
Insert picture description here
this time we can see that our binary search tree has become a linked list. If we need to find user information with id=17, we need to find 7 times, which is equivalent to a full table scan.

The reason for this phenomenon is actually that the binary search tree has become unbalanced, that is, the height is too high, which leads to unstable search efficiency. In order to solve this problem, we need to ensure that the binary search tree is always balanced, and we need to use a balanced binary tree. The balanced binary tree is also called the AVL tree. On the basis of satisfying the characteristics of the binary search tree, the height difference between the left and right subtrees of each node must not exceed 1. The following is a comparison between a balanced binary tree and an unbalanced binary tree:
Insert picture description here
From the structure of the balanced binary tree, we can find that the binary tree in the first figure is actually a balanced binary tree.

The balanced binary tree ensures that the structure of the tree is balanced. When we insert or delete data and cause the unbalanced binary tree to be unbalanced, the balanced binary tree will adjust the nodes on the tree to maintain balance. The specific adjustment method will not be introduced here. Compared with the binary search tree, the balanced binary tree has more stable search efficiency and faster overall search speed.

B-tree
because of the volatile nature of memory. Under normal circumstances, we will choose to store the data and indexes in the user table in a peripheral device such as a disk. But compared with the memory, the speed of reading data from the disk will be hundreds of times, thousands of times or even ten thousand times slower, so we should try to reduce the number of times to read data from the disk.

In addition, when reading data from the disk, it is read in accordance with the disk block, not one by one. If we can put as much data as possible into the disk block, then more data will be read in one disk read operation, and the time for us to find the data will be greatly reduced.

If we use the tree data structure as the index data structure, we need to read a node from the disk every time we look up data, which is what we call a disk block. We all know that a balanced binary tree stores only one key value and data for each node. What does that mean? Explain that each disk block only stores a key value and data! What if we want to store massive amounts of data? It can be imagined that the binary tree will have a lot of nodes, and the height will be extremely high. When we look up data, we will also perform many disk IOs, and our efficiency of looking up data will be extremely low!
Insert picture description here
In order to solve this disadvantage of the balanced binary tree, we should find a balanced tree in which a single node can store multiple key values ​​and data. This is the B-tree we will talk about next. A B-tree (Balance Tree) means a balanced tree. The figure below is a B-tree:
Insert picture description here
the p node in the figure is a pointer to a child node. In fact, there are binary search trees and balanced binary trees. Because of the aesthetics of the graph, Was omitted. Each node in the figure is called a page, and a page is the disk block we mentioned above. The basic unit of data reading in MySQL is a page, so what we call a page here is more in line with the underlying data structure of the index in MySQL.

As can be seen from the above figure, compared to the balanced binary tree, each node stores more keys and data, and each node has more child nodes, and the number of child nodes is general It is called the order. The B-tree in the above figure is a 3-order B-tree, and the height will be very low.

Based on this feature, the number of times that the B-tree searches for data and reads the disk will be few, and the efficiency of data search is much higher than that of the balanced binary tree.

If we want to find the user information with id=28, then the process we find in the B tree in the above figure is as follows:

First find the root node, which is page 1, and judge that 28 is between the key value of 17 and 35, then we find page 3 according to the pointer p2 in page 1.

Compare 28 with the key value in page 3. 28 is between 26 and 30. We find page 8 according to the pointer p2 in page 3.

Comparing the key values ​​in 28 and page 8, it is found that there is a matching key value 28, and the user information corresponding to the key value 28 is (28, bv).

However, there is a problem with the above B-tree as an index. It is not flexible when doing range search, and the efficiency is also very low. Every time a page is read, it needs to start from the root node, which leads to the disadvantage of inefficiency and inflexibility. Therefore, B+ The tree is to solve this shortcoming and improve the method.

B+ tree
B+ tree is a further optimization of B tree. Let us first look at the structure diagram of the B+ tree:
Insert picture description here
According to the above figure, let’s see what is the difference between the B+ tree and the B tree:
① The non-leaf node of the B+ tree does not store data, only the key value, while in the B tree node Not only store the key value, but also store the data. The reason for this is because the page size in the database is fixed, and the default page size in InnoDB is 16KB. If data is not stored, then more key values ​​will be stored, the corresponding tree order (the node's child node tree) will be larger, the tree will be shorter and fatter, so that we can find the data for disk The number of IOs will be reduced again, and the efficiency of data query will be faster.

In addition, the order of the B+ tree is equal to the number of key values. If one node of our B+ tree can store 1000 key values, then the 3-layer B+ tree can store 1000×1000×1000=1 billion data.

Generally, the root node is resident in memory, so generally we only need 2 disk IO to find 1 billion data.

②Because all the data of the B+ tree index is stored in the leaf nodes, and the data is arranged in order. Then B+ tree makes range search, sort search, group search and de-duplication search extremely simple. The B-tree is not easy to achieve because the data is scattered across various nodes.

Interested readers may also find that the pages in the B+ tree in the above figure are connected through a doubly linked list, and the data in the leaf nodes are connected through a singly linked list. In fact, in the above B-tree, we can also add a linked list to each node. These are not the differences between them before, because in MySQL's InnoDB storage engine, the index is stored in this way.

That is to say, the B+ tree index in the figure above is the real implementation of the B+ tree index in InnoDB, and to be precise, it should be a clustered index (clustered index and non-clustered index will be discussed below). As you can see from the above figure, in InnoDB, we can find all the data in the table through the doubly linked list connection between the data pages and the singly linked list connection between the data in the leaf nodes. The implementation of B+ tree index in MyISAM is slightly different from that in InnoDB.

In MyISAM, the leaf nodes of the B+ tree index do not store data, but the file address where the data is stored.

Clustered index VS non-clustered index

When we introduced the B+ tree index in the previous section, we mentioned that the index in the figure is actually the implementation of a clustered index.

So what is a clustered index? In MySQL, B+ tree indexes are divided into clustered indexes and non-clustered indexes according to different storage methods.
Here we focus on the clustered index and non-clustered index in InnoDB:

**①Clustered index (clustered index): **For tables with InnoDB as the storage engine, the data in the table will have a primary key. Even if you do not create a primary key, the system will help you create an implicit primary key. This is because InnoDB stores data in the B+ tree, and the key value of the B+ tree is the primary key. In the leaf node of the B+ tree, all the data in the table is stored. This kind of B+ tree index constructed with the primary key as the key value of the B+ tree index is called a clustered index.

**②Non-clustered index (non-clustered index): **The B+ tree index constructed with column values ​​other than the primary key as the key value is called a non-clustered index. The difference between a non-clustered index and a clustered index is that the leaf node of the non-clustered index does not store the data in the table, but stores the primary key corresponding to the column. If we want to find the data, we need to look up in the clustered index according to the primary key. This again The process of looking up data according to the clustered index is called back table. Understand the definition of clustered index and non-clustered index, we should understand this sentence: data is index, index is data.

Use clustered index and non-clustered index to find data
. When we explained the B+ tree index earlier, we did not talk about how to search for data in the B+ tree, mainly because the concepts of clustered index and non-clustered index have not been introduced.

Next, we will introduce how to find data in a data table through a clustered index and a non-clustered index.

Use clustered index to find data
Insert picture description here
or this B+ tree index map, now we should know that this is a clustered index, the data in the table is stored in it. Now suppose we want to find user data with id>=18 and id<40. The corresponding sql statement is:

select * from user where id>=18 and id <40

Among them, id is the primary key, and the specific search process is as follows:
①Generally, the root node is resident in memory, which means that page 1 is already in memory. At this time, there is no need to read data from the disk, but directly read from the memory. That's it. Read page 1 from the memory. To find the id>=18 and id <40 or range value, we first need to find the key value of id=18. From page 1, we can find the key value of 18. At this time, we need to locate page 3 according to the pointer p2.

②To find data from page 3, we need to take the p2 pointer to the disk to read page 3. After reading page 3 from the disk, put page 3 into memory, and then search, we can find the key value of 18, and then get the pointer p1 in page 3, and locate page 8.

③The same page 8 is not in the memory, we need to go to the disk to read page 8 into the memory. After reading page 8 into memory. Because the data in the page is connected by a linked list, and the key values ​​are stored in order, the key value 18 can be located according to the binary search method. At this time, because the data page has been reached, we have found a piece of data that meets the conditions, which is the data corresponding to the key value 18. Because it is a range search, and all the data has leaf nodes at this time, and is arranged in an orderly manner, then we can traverse the key values ​​in page 8 sequentially to find and match the data that meets the conditions. We can always find the data with the key value of 22, and then there is no data in page 8. At this time, we need to hold the p pointer in page 8 to read the data in page 9.

④Because page 9 is not in the memory, it will load page 9 into the memory again, and search for data in the same way as in page 8, until page 12 is loaded into the memory, and it is found that 41 is greater than 40, which is not satisfied at this time condition. Then the search ends here.

Finally we find all the data that meet the conditions, a total of 12 records:
(18,kl), (19,kl), (22,hj), (24,io), (25,vg), (29,jk), (31,jk), (33,rt), (34,ty), (35,yu), (37,rt), (39,rt).

Let's take a look at the specific search flowchart:
Insert picture description here
Use non-clustered index to search for data.
Insert picture description here
Readers may be confused when they see this picture. What is this? It's all numbers. If you feel this way, please look carefully at the explanation of the red letter in the picture below.
what? Still can't understand? Let me explain it again. First of all, this non-clustered index represents the index of the user's lucky number (why is a lucky number? I remembered it on a whim:-)), at this time the table structure is like this.
Insert picture description here
In the leaf node, all the data is no longer stored, but the key value and the primary key are stored. For the xy in the leaf node, such as 1-1. The 1 on the left represents the key value of the index, and the 1 on the right represents the primary key value.

If we want to find user information whose lucky number is 33, the corresponding sql statement is:

select * from user where luckNum=33

The search process is the same as the clustered index, so I won't introduce it in detail here. We will eventually find the primary key value 47. After finding the primary key, we need to find the specific corresponding data information in the clustered index. At this time, we return to the clustered index search process.

Let's take a look at the specific search flowchart:
Insert picture description here
In MyISAM, the leaf nodes of the clustered index and the non-clustered index will store the file address of the data.

Summary
This article explains in detail why MySQL uses B+ tree as an index of data from the binary search tree, and how to store data and find data in InnoDB through the B+ tree index. We must remember this sentence: data is index, index is data.

Please indicate: Liu Zhaokao's blog »MySQL index-B+ tree (you will understand after reading)

Guess you like

Origin blog.csdn.net/wangrenhaioylj/article/details/108949468