In-depth understanding of MySQL index underlying data structure

1 Introduction

In our daily work, we will encounter some slow SQL. When analyzing these slow SQL, we usually look at the SQL execution plan to verify whether the index is used during SQL execution. Usually we will adjust some query conditions and add necessary indexes, and the SQL execution efficiency will be improved by several orders of magnitude. Have we ever thought about why adding an index can improve the query efficiency of SQL, and why sometimes adding an index does not change SQL execution? This article analyzes the underlying data structure and algorithm of the MySQL index in detail.

2 Index data structure comparison

Definition of index: Index (Index) is a sorted data structure that helps MySQL obtain data efficiently.

Common data structures in indexes are as follows:

  • Hash table
  • binary tree
  • red black tree
  • B-Tree
  • B+Tree

The Hash table
performs a hash calculation through the key of the index to quickly obtain the disk file pointer. It is very fast to search for a specified index, but it cannot support range search, and sometimes Hash conflicts may occur.

Binary
tree The characteristics of a binary tree: the data of the left child node is smaller than the data of the parent node, and the data of the right child node is greater than the data of the parent node. As shown in the figure below, if col2 is an index, to find the row element with index 65, you only need to search twice to get the disk pointer address where the row element is located.

But if it is a value that increases in order, such as building an index for col1, it is no longer suitable to use a binary tree to build an index, because at this time using a binary tree to build an index will become a chained index. The index structure at this time is shown in the figure below. If you look for 6 nodes, you need 6 traversals to find them.

Red-black tree
Red-black tree is a binary balanced tree, which can improve query efficiency. At this time, if you want to find 6 nodes, you only need to traverse 3 times to find them. But the red-black tree also has disadvantages. When storing a large amount of data, the height of the tree will become uncontrollable. The larger the number, the higher the height of the tree, and the query efficiency will be greatly reduced.

B-Tree
B-Tree is a multi-way binary tree with characteristics: 1. The leaf nodes have the same depth, and the pointers of the leaf nodes are empty; 2. All index elements are not repeated; 3. The data index in the node is from left to right Incremental order.

B+Tree
B+Tree is a variant of B-Tree. It has the following characteristics: 1. Non-leaf nodes do not store data, but only store indexes (redundancy), and more indexes can be placed; 2. Leaf nodes contain all index fields; 3. Leaf nodes are connected by pointers to improve the performance of interval access.

Compared with the red-black tree, both B-Tree and B+Tree data structures are more chunky, and when storing the same order of magnitude of index data, the level is lower.

A big difference between B-Tree and B+Tree is that the nodes of B+Tree do not store values, but only keys, while the leaf nodes store all key-value sets, and the nodes are ordered of. The advantage of this is that each disk IO can read more nodes, that is, the tree degree (Max. Degree) can be set larger, because the number of disk pages read by each disk IO is certain. For example, each disk IO can read 1 page = 4kb, then if the value is omitted, the same page of data can read more keys, which greatly reduces the number of disk IOs.

In addition, B+Tree is also a sorted data structure, and >< or order by in the database can directly rely on this feature.

The main data structure used for indexes in MySQL is also B+Tree, and the purpose is to reduce disk IO when reading data.

How to quickly search 30 million-level data with B + tree index

MySQL officially has a limit on the size of non-leaf nodes (such as the node at the top layer h = 1, B+Tree height is 3), the maximum size is 16K, which can be queried through the following SQL statement, of course, this value can be adjusted Yes, since the official gives this threshold to indicate that no matter how large it is, it will affect the disk IO efficiency.

From the execution result, we can see that the size is 16384, which is 16K size.

If: B+Tree tables are full. The type of the primary key index is BigInt, the size is 8B, and the pointer stores the file address of the next node, the size is 6B. In the last layer, if the stored data data is 1K in size, then

  1. The maximum number of nodes in the first layer is: 16k / (8B + 6B) ≈ 1170 (units);
  2. The maximum number of nodes in the second layer should also be: 1170;
  3. The maximum number of nodes in the third layer is: 16K / 1K = 16 (units).

Then, a B+Tree table can store up to 1170  1170  16 ≈ 20 million.

Therefore, through analysis, we can conclude that a table with a B+Tree structure can accommodate queries with tens of millions of data volumes. And generally speaking, MySQL will put the B+Tree root node in memory, which only requires two disk IOs.

4 Storage engine index implementation

Where are indexes stored in MySQL? Like data, indexes are stored as files on disk.
In the MyISAM storage engine, try to store data and index files separately. Data is stored in files ending in .MYD, and indexes are stored separately in files ending in .MYI.

In InnoDB, data and index files are stored together. Note that there is no file ending in .MYI in the figure below, and there is only one file ending in .ibd.

MyISAM index files and data files are separated (non-clustered), and the primary key index and auxiliary index (secondary index) are stored in the same way.

The index file and data file in InnoDB are the same file (aggregation), and the storage methods of the primary key index and the secondary index are different. As shown in the figure, the leaf nodes of the secondary index do not store data, but only store the primary key ID.

Here are a few questions to think about:

  • Why is it recommended that InnoDB tables must have a primary key, and an integer auto-increment primary key is recommended?
  • Why does the leaf node of the non-primary key index structure store the primary key value?

If we do not set a primary key when creating a table, InnoDB will automatically help us filter a column with unique data from the first column as the primary key. If such a column cannot be found, a hidden column (rowid) will be created to do it The primary key will add a lot of work to MySQL, so it is recommended that we must set the primary key when creating an InnoDB table.

An integer field is used as the primary key. On the one hand, it does not need to be converted during data comparison, and on the other hand, the storage is more space-saving. So why emphasize the primary key auto-increment? If the primary key id is unordered, it is very likely that the newly inserted value will cause the current node to split. At this time, MySQL has to move the data to insert the new record into the appropriate position, and even the target page may have been written back to disk. However, when it is cleared from the cache, it needs to be read back from the disk at this time, which increases a lot of overhead. At the same time, frequent movement and paging operations cause a lot of fragmentation, resulting in an index structure that is not compact enough, and then has to pass OPTIMIZE TABLE to rebuild the table and optimally populate the pages. On the contrary, if each insertion is in order, it will be written continuously behind the current page, and if it cannot be written, a node will be reallocated. The memory is continuous, so the efficiency is naturally the highest.

The leaf nodes of the non-primary key index store the primary key value instead of all data, mainly for consistency and space saving. If the secondary index also stores data, each index tree has to be updated every time MySQL is inserted, which intensifies the performance loss when adding new edits, and in this way, the space utilization rate is not high, and a large number of redundant data.

5 What is the underlying data structure of the joint index

A joint index is also called a composite index, such as the following table:

CREATE TABLE `test` (
`id` bigint NOT NULL AUTO_INCREMENT,
`name` varchar(24) NOT NULL,
`age` int NOT NULL,
`position` varchar(32) NOT NULL,
`address` varchar(128) NOT NULL,
`birthday` date NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_name_age_position` (`name`,`age`,`position`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

The following index is a joint index.

`idx_name_age_position` (`name`,`age`,`position`) USING BTREE

What does the underlying data structure of the joint index look like?

When comparing equality, first compare the values ​​in the first column, if they are equal, then continue to compare the second column, and so on.

After understanding the storage structure of the joint index, we know what the leftmost prefix optimization principle of the index is. When using the joint index, the definition order of the index columns will affect the usage of the index in the final query. For example, for a joint index (name, age, position), MySQL will first match from the leftmost column. If the leftmost leading brother name is not used, if the covering index is not used, it can only scan the entire table.

Thinking about the underlying data structure of the joint: MySQL will first match the first column of the joint index, and then match the next column. If you do not specify the matching value of the first column, you will not know which node to query next.

6 Summary

An index is essentially a sorted data structure. Understanding the underlying data structure and storage principles of MySQL indexes can help us better optimize SQL. In fact, database index tuning is a technical activity, not just relying on theory, because the actual situation is ever-changing, and MySQL itself has very complicated mechanisms, such as query optimization strategies and implementation differences of various engines, which will make the situation more complicated. But at the same time, these theories are the basis of index tuning. Only on the basis of understanding the theory can we reasonably infer the tuning strategy and understand the mechanism behind it, and then combine continuous experiments and explorations in practice to truly achieve efficient use of MySQL. purpose of indexing.

Guess you like

Origin blog.csdn.net/crg18438610577/article/details/130019132