Why is MySQL index implemented with B+ tree?

First of all, you must first understand what a B-tree is and what a B+ tree is.

What is a B-tree

Although the self-balancing binary tree can keep the time complexity of the query operation at O(logn), because it is essentially a binary tree, each node can only have 2 child nodes. When the number of nodes is more, the height of the tree will decrease. It will also become higher accordingly, which will increase the number of disk I/Os and affect the efficiency of data query.

In order to solve the problem of reducing the height of the tree, the B-tree came out later. It no longer limits a node to only have 2 child nodes, but allows M child nodes (M>2), thus reducing the height of the tree.

Each node of the B-tree can include at most M child nodes. M is called the order of the B-tree, so the B-tree is a multi-tree.

Assuming M = 3, then it is a 3-order B-tree. The characteristic is that each node has at most 2 (M-1) data and at most 3 (M) child nodes. If these requirements are exceeded, it will Split nodes, such as the following animation:
Please add image description
Let's take a look at the query process of a 3-order B-tree?
Please add image description

Assume that we are looking for a record with an index value of 9 in a 3-order B-tree in the above figure, then the steps can be divided into the following steps:

  1. Compare with the index of the root node (4, 8), if 9 is greater than 8, then go to the child node on the right;
  2. Then the index of the child node is (10, 12). Because 9 is less than 10, it will go to the left child node of the node;
  3. Go to the node with index 9, and then we find the node with index value 9.

It can be seen that when a 3-order B-tree queries the data in the leaf nodes, since the height of the tree is 3, three disk I/O operations will occur during the query process.

If the same number of nodes is used in a balanced binary tree scenario, the height of the tree will be very high, which means more disk I/O operations. Therefore, B-tree is more efficient than balanced binary tree in data query.

However, each node of the B-tree contains data (index + record), and the size of the user's record data is likely to far exceed the index data, which requires more disk I/O operations to read. Useful index data".

Moreover, when we query a node at the bottom (such as A record), the record data in the "non-A record node" will be loaded from the disk to the memory, but these record data are useless, we just want to read The index data of these nodes are used for comparison queries, and the record data in "non-A record nodes" are useless to us. This not only increases the number of disk I/O operations, but also takes up memory resources.

In addition, if you use B-tree to do range queries, you need to use in-order traversal, which will involve disk I/O problems on multiple nodes, resulting in a decrease in overall speed.

What is a B+ tree?

The B+ tree is an upgrade to the B-tree. The data structure of the index in MySQL uses the B+ tree. The B+ tree structure is as follows:
Insert image description here
The differences between the B+ tree and the B-tree are mainly the following points:

Only leaf nodes (the bottom nodes) will store actual data (index + record), and non-leaf nodes will only store indexes;

All indexes will appear at leaf nodes, and an ordered linked list is formed between leaf nodes;

The index of a non-leaf node will also exist in the child nodes and is the maximum (or minimum) of all indexes in the child nodes.

There are as many indexes as there are child nodes in non-leaf nodes;

Below, we compare the performance differences between B+ and B-tree through three aspects.

1. Single point query

When B-tree performs a single index query, it can be found within O(1) time cost at the fastest, and from the average time cost, it is slightly faster than B+ tree.

However, the query fluctuation of B-tree will be relatively large, because each node stores both index and record, so sometimes the index can be found by accessing non-leaf nodes, and sometimes it is necessary to access leaf nodes to find the index.

The non-leaf nodes of the B+ tree do not store actual record data, but only store indexes. Therefore, when the amount of data is the same, compared with the B-tree that stores both indexes and records, the non-leaf nodes of the B+ tree can store more indexes. , so the B+ tree can be "chunkier" than the B-tree, and the number of disk I/Os required to query the underlying nodes will be less.

2. Insertion and deletion efficiency

The B+ tree has a large number of redundant nodes, so that when deleting a node, it can be deleted directly from the leaf node, or even the non-leaf node can be left alone, so the deletion is very fast.

Note: B+ trees may have different definitions for the number of child nodes and indexes of non-leaf nodes. Some say that the number of child nodes of non-leaf nodes is M, and the number of indexes is
M-1 (this is the definition in Wikipedia), so the animations about B+ trees in this article are all based on this. But when I introduced the difference between B+ trees and B+
trees earlier, I said that "there are as many indexes as there are child nodes in non-leaf nodes." This is mainly because the B+ tree used by MySQL has this feature.

Even when the root node of the B+ tree is deleted, complex tree deformation will not occur due to the existence of redundant nodes.
B-tree is different. B-tree has no redundant nodes, and deleting nodes is very complicated. For example, deleting the data in the root node may involve complex tree deformation.

The same is true for the insertion of B+ trees. There are redundant nodes. The insertion may involve node splitting (if the nodes are saturated), but it only involves at most one path of the tree. Moreover, the B+ tree will be automatically balanced and does not require more complex algorithms, such as rotation operations of red-black trees, etc.

Therefore, insertion and deletion of B+ trees are more efficient.

3. Range query

The principles of B-tree and B+ tree equivalent query are basically the same. First search from the root node, then compare the range of the target data, and finally search recursively into the child nodes.

Because there is a linked list between all leaf nodes of the B+ tree, this design is very helpful for range searches. For example, if we want to know the orders between December 1st and December 12th, we can first search for 12 The leaf node where the 1st of the month is located is then traversed to the right using the linked list until the node of the 12th of December is found. This eliminates the need to query from the root node, further saving the time required for querying.

The B-tree does not have a structure that connects all leaf nodes in a linked list, so the range query can only be completed through tree traversal, which will involve disk I/O operations on multiple nodes, and the range query efficiency is not as good as the B+ tree.

Therefore, there are a large number of range retrieval scenarios where B+ trees are suitable, such as databases. For a large number of single index query scenarios, you can consider B-tree, such as NoSQL's MongoDB.

The storage method of MySQL differs according to different storage engines. The most commonly used one is Innodb storage engine, which uses B+ tree as the index data structure.

B+ tree in MySQL
Insert image description here

But the B+ tree used by Innodb has some special features, such as:

The leaf nodes of the B+ tree are connected using a "double linked list". The advantage of this is that it can be traversed both to the right and to the left.

The content of the B+ tree node is a data page, which stores user records and various information. The default size of each data page is 16 KB.

Innodb is divided into clustered and secondary indexes according to different index types. The difference between them is that the leaf nodes of the clustered index store actual data, all complete user records are stored in the leaf nodes of the clustered index, while the leaf nodes of the secondary index store the primary key value, not the actual data.

Because the data of the table are stored in the leaf nodes of the clustered index, the InnoDB storage engine will definitely create a clustered index for the table, and since only one copy of the data is physically saved, there can only be one clustered index. Multiple secondary indexes can be created.

After understanding B-tree and B+ tree, let’s think about why MySQL index should be implemented more thoroughly with B+ tree.

Why do you need an index?

In a database, an index is a special data structure used to speed up data retrieval. Through indexing, the database can quickly locate records, thereby improving query efficiency. If there is no index, each record in the database needs to be scanned one by one, and query efficiency will be greatly reduced.

Summary: Why choose B+ tree as the index structure?

When choosing an index structure, you need to consider the following factors:

  • What query operations are supported?
  • Data size
  • Frequency of data insertion, update, and deletion operations

B+ trees just meet these requirements and have the following characteristics:

  • The leaf nodes of the B+ tree contain all keywords and pointers to data, so the B+ tree supports very efficient range query operations, such as "find all people aged between 20 and 30 years old".
  • B+ trees are very suitable for storing large amounts of data. Each node can store multiple keywords and pointers, which can greatly reduce the number of disk I/Os.
  • Compared with other data structures, the insertion and deletion operations of B+ tree are very efficient. Each insertion and deletion only needs to operate the leaf nodes of the B+ tree, so there is no need to involve complex adjustment operations on non-leaf nodes.

Therefore, MySQL chooses to use B+ tree as the index structure to achieve efficient retrieval, especially for query and update operations of large amounts of data.

Seeing this, I believe everyone not only understands the index, but also has a clearer understanding of the implementation of the B+ tree. The following is a diagram generated by the thinking monkey
based on chatgpt.
Insert image description here

Guess you like

Origin blog.csdn.net/qq_45442178/article/details/130866057