Detailed explanation of MySQL's InnoDB index principle

Summary

This article introduces the knowledge about InnoDB indexes of Mysql, from various trees to index principles to storage details.

InnoDB is the default storage engine for Mysql (MyISAM before Mysql 5.5.5, documentation). For the purpose of efficient learning, this article mainly introduces InnoDB, and involves a small amount of MyISAM as a comparison.

This article is a summary completed by me during the learning process. The content mainly comes from books and blogs (references will be given), and some of my own understandings are added in the process. Please point out the inaccurate descriptions.

1. Various tree structures

I didn't plan to start with the binary search tree, because there are already too many related articles on the Internet, but considering that the clear diagram is very helpful to understand the problem, and to ensure the integrity of the article, I finally added this part.

Let's take a look at several tree structures:

1. Search binary tree: each node has two sub-nodes, the increase in data volume will inevitably lead to a rapid increase in height, obviously this is not suitable as the infrastructure for a large amount of data storage.

2. B-tree: A B-tree of order m is a balanced m-way search tree. The most important property is that the number of keywords j contained in each non-root node satisfies: ┌m/2┐ - 1 <= j <= m - 1; the number of child nodes of a node will be 1 more than the number of keywords , so that the keyword becomes the split flag of the child node. Generally, the keywords are drawn to the middle of the child nodes in the illustration, which is very vivid and easy to distinguish from the following B+ tree. Since the data exists in both leaf nodes and non-leaf nodes, it is impossible to simply traverse the keywords in the B-tree in order, and the method of in-order traversal must be used.

3. B+ tree: A B-tree of order m is a balanced m-way search tree. The most important property is that the number of keywords j contained in each non-root node satisfies: ┌m/2┐ - 1 <= j <= m; the number of subtrees can be at most as many as keywords. Non-leaf nodes store the smallest key in the subtree. At the same time, data nodes only exist in leaf nodes, and horizontal pointers are added between leaf nodes, so that it is very easy to traverse all the data sequentially.

4. B* tree: A B-tree of order m is a balanced m-way search tree. The two most important properties are that 1 the number of keywords j contained in each non-root node satisfies: ┌m2/3┐ - 1 <= j <= m; 2 A horizontal pointer is added between non-leaf nodes.

B/B+/B* trees have similar operations, such as retrieving/inserting/deleting nodes. Here we only focus on the case of inserting nodes, and only analyze their insertion operations when the current node is full, because this action is slightly complicated and can fully reflect the differences between several trees. In contrast, retrieving nodes is easier to implement, and deleting nodes only needs to complete the reverse process of inserting (in practical applications, deleting is not the complete reverse operation of inserting, and often only deletes data and reserves space for subsequent use).

First look at the split of the B tree. The red value in the figure below is the newly inserted node each time. Whenever a node is full, a split needs to occur (split is a recursive process, refer to the insertion of 7 below which leads to a two-level split). Since the non-leaf nodes of the B-tree also save the key value, the full node after splitting The values will be distributed in three places: 1 the original node, 2 the parent node of the original node, and 3 the new sibling node of the original node (refer to the insertion process of 5, 7). Splitting may increase the height of the tree (refer to the insertion process of 3 and 7), or it may not affect the height of the tree (refer to the insertion process of 5 and 6).

Splitting of B+ tree: When a node is full, a new node is allocated, 1/2 of the data in the original node is copied to the new node, and finally the pointer of the new node is added to the parent node; B+ The split of the tree only affects the original node and the parent node, not the sibling node, so it does not need a pointer to the sibling node.

Splitting of B* tree: When a node is full, if its next sibling node is not full, then move a part of the data to the sibling node, then insert the keyword in the original node, and finally modify the sibling in the parent node The node's keyword (because the sibling node's keyword scope has changed). If the siblings are also full, add a new node between the original node and the sibling node, copy 1/3 of the data to the new node, and finally add the pointer of the new node to the parent node. It can be seen that the splitting of the B* tree is very clever, because the B* tree needs to ensure that the nodes after the split are still 2/3 full. If the B+ tree method is used, simply dividing the full nodes into two will lead to Each node is only 1/2 full, which does not meet the requirements of the B* tree. Therefore, the strategy adopted by the B* tree is to continue to insert sibling nodes after the current node is full (this is why the B* tree needs to add a linked list of siblings to non-leaf nodes) until the sibling nodes are also filled, and then pull up the sibling nodes. Join together, and each of you and your siblings will invest 1/3 to set up a new node. The result is that the 3 nodes are exactly 2/3 full, meeting the requirements of the B* tree. Everyone is happy.

The B+ tree is suitable as the basic structure of the database, entirely because of the computer's memory-mechanical hard disk two-layer storage structure. Memory can perform fast random access (random access is given an arbitrary address, and the data stored at this address is required to be returned), but the capacity is small. The random access of the hard disk requires mechanical actions (1 head moves and 2 disks rotate), and the access efficiency is several orders of magnitude lower than that of the memory, but the hard disk capacity is larger. The typical database capacity greatly exceeds the available memory size, which determines that retrieving a piece of data in the B+ tree is likely to be completed with several disk IO operations. As shown in the following figure: Usually the action of reading down a node may be a disk IO operation, but non-leaf nodes usually load memory at the initial stage to speed up access. At the same time, in order to improve the horizontal traversal speed between nodes, the CPU calculation/memory read in blue in the figure may be optimized into a binary search tree (page directory mechanism in InnoDB) in a real database.

The B+ tree in the real database should be very flat. You can verify how flat the B+ tree in InnoDB is by inserting enough data into the table sequentially. We create a test table with only simple fields through the CREATE statement as shown below, and then keep adding data to populate the table. Through the statistical data in the figure below (see reference 1 for the source), several intuitive conclusions can be analyzed, which macroscopically show the scale of the B+ tree in the database.

1. Each leaf node stores 468 rows of data, and each non-leaf node stores about 1200 key values, which is a balanced 1200-way search tree!

2. For a table with a capacity of 22.1G, only a B+ tree with a height of 3 can be stored. This capacity can probably meet the needs of many applications. If the height is increased to 4, the storage capacity of the B+ tree will immediately increase to a huge 25.9T!

3. For a table with a capacity of 22.1G, the height of the B+ tree is 3. If you want to load all non-leaf nodes into memory, you only need less than 18.8M of memory (how to come to this conclusion? Because for the height of 2 Tree, 1203 leaf nodes only need 18.8M space, while the height of 22.1G from a good table is 3, and there are 1204 non-leaf nodes. At the same time, we assume that the size of leaf nodes is larger than that of non-leaf nodes, because leaf nodes store row data The non-leaf node has only keys and a small amount of data.), only using such a small amount of memory can ensure that only one disk IO operation is required to retrieve the required data, which is very efficient.

2. Mysql storage engine and index

It can be said that the database must have an index. Without an index, the retrieval process becomes a sequential search, and the time complexity of O(n) is almost unbearable. It is very easy to imagine how a table consisting of only a single key can be indexed using a B+ tree, as long as the key is stored in the node of the tree. When a database record contains multiple fields, a B+ tree can only store the primary key. If the non-primary key field is retrieved, the primary key index will not work, and it will become a sequential search. At this time, a second set of indexes should be established on the second column to be retrieved. This index is organized by a separate B+ tree. There are two common ways to solve the problem of multiple B+ trees accessing the same set of table data, one is called a clustered index, and the other is called a non-clustered index (secondary index). Although both names are called indexes, this is not a separate index type, but a data storage method. For clustered index storage, row data is stored together with the primary key B+ tree, the secondary key B+ tree only stores the secondary key and primary key, and the primary key and non-primary key B+ tree are almost two types of trees. For non-clustered index storage, the primary key B+ tree stores pointers to real data rows at the leaf nodes, not the primary key.

InnoDB uses a clustered index, which organizes the primary key into a B+ tree, and the row data is stored on the leaf nodes. If the condition of "where id = 14" is used to find the primary key, the retrieval algorithm of the B+ tree is The corresponding leaf node can be found, and then row data can be obtained. If you perform a conditional search on the Name column, you need two steps: the first step is to retrieve the Name in the auxiliary index B+ tree, and reach its leaf node to obtain the corresponding primary key. The second step uses the primary key to perform another B+ tree retrieval operation in the primary index B+ tree, and finally reaches the leaf node to obtain the entire row of data.

MyISM uses a non-clustered index. The two B+ trees of the non-clustered index look no different. The structure of the nodes is completely the same, but the stored content is different. The nodes of the primary key index B+ tree store the primary key, and the secondary key index B+ tree storage auxiliary key. The table data is stored in a separate place. The leaf nodes of the two B+ trees both use an address to point to the real table data. For table data, there is no difference between these two keys. Since the index tree is independent, retrieval by the secondary key does not require access to the index tree of the primary key.

In order to illustrate the difference between these two indexes more vividly, we imagine a table that stores 4 rows of data as shown in the figure below. Among them, Id is the primary index and Name is the secondary index. The diagram clearly shows the difference between a clustered index and a non-clustered index.

We focus on the clustered index. It seems that the efficiency of the clustered index is obviously lower than that of the non-clustered index, because every time the auxiliary index is used for retrieval, it has to go through two B+ tree searches. Isn't this unnecessary? What are the advantages of clustered indexes?

1. Since the row data and the leaf nodes are stored together, the primary key and the row data are loaded into the memory together, and the row data can be returned immediately when the leaf node is found. If the data is organized according to the primary key Id, the data can be obtained faster.

2. The advantage of using the primary key as a "pointer" for the secondary index instead of using the address value as a pointer is that it reduces the maintenance work of the secondary index when a row is moved or a data page is split. Using the primary key value as a pointer will cause the secondary index to occupy More space, in exchange, is that InnoDB does not have to update this "pointer" in the secondary index when moving rows. That is to say, the position of the row (located by the 16K Page in the implementation, which will be covered later) will change with the modification of the data in the database (the previous B+ tree node split and the Page split), and the clustered index can be used. It is guaranteed that no matter how the node of the primary key B+ tree changes, the secondary index tree will not be affected.

Three.Page structure

If the previous content is biased towards explaining the principle, then the specific implementation will be involved later.

To understand the implementation of InnoDB, we must mention the Page structure. Page is the most basic component of the entire InnoDB storage and the smallest unit of InnoDB disk management. All database-related content is stored in this Page structure. Page is divided into several types, common page types are data page (B-tree Node), Undo page (Undo Log Page), system page (System Page), transaction data page (Transaction System Page) and so on. The size of a single Page is 16K (controlled by the compilation macro UNIV_PAGE_SIZE), and each Page is uniquely identified by a 32-bit int value, which also corresponds to the maximum storage capacity of InnoDB 64TB (16Kib * 2^32 = 64Tib). The basic structure of a Page is shown in the following figure:

Each Page has a common header and footer, but the content in the middle varies depending on the type of Page. The header of the Page has some data we care about. The following figure shows the details of the header of the Page:

We focus on the fields related to the data organization structure: the header of the Page stores two pointers, pointing to the previous Page and the next Page respectively, and the header also includes the type information of the Page and the number used to uniquely identify the Page. Based on these two pointers, we can easily imagine that the Pages are linked together to form a doubly linked list structure.

Looking at the main content of the Page, we mainly focus on the storage of row data and indexes. They are all located in the User Records part of the Page. User Records occupy most of the space of the Page. User Records are composed of records one by one, and each record represents the index. A node on the tree (non-leaf nodes and leaf nodes). Inside a Page, the head and tail of the singly linked list are represented by two records of fixed content. "Infimum" in the form of a string represents the beginning, and "Supremum" represents the end. These two records used to represent the beginning and end are stored in the segment of System Records. The System Records and User Records are two parallel segments. There are 4 different Records in InnoDB, they are 1 primary key index tree non-leaf node 2 primary key index tree child node 3 secondary key index tree non-leaf node 4 secondary key index tree child node. There are some differences in the Record formats of these four nodes, but they all store the Next pointer to point to the next Record. Later, we will introduce these four kinds of nodes in detail. Now we only need to regard Record as a singly linked list node that stores data and contains Next pointer.

User Record exists in the form of a singly linked list in the Page. Initially, the data is arranged in the order of insertion, but as new data is inserted and old data is deleted, the physical order of the data will become chaotic, but they still maintain logic sequence above.

Combining the organizational form of User Record with several Pages, you can see a slightly more complete form.

Now let's see how to locate a Record:

1. Start traversing an indexed B+ tree through the root node, and finally reach a Page through the non-leaf nodes of each layer. This Page stores all leaf nodes.

2. Traverse the singly linked list from the "Infimum" node in the Page (this traversal is often optimized), if the key is found, it will return successfully. If the record reaches "supremum", it means that there is no suitable key in the current Page. At this time, it is necessary to use the Next Page pointer of the Page to jump to the next Page and continue to search from "Infimum" one by one.

Take a detailed look at what data is stored in different types of Records. According to the different B+ tree nodes, User Records can be divided into four formats, and the following types are distinguished by color.

1. Main index tree non-leaf nodes (green)

a. The minimum value in the primary key stored by the child node (Min Cluster Key on Child), which is necessary for the B+ tree, is used to locate the location of a specific record in a Page.

b. The number of the Page where the smallest value is located (Child Page Number), which is used to locate the Record.

2. Main index tree child node (yellow)

a. The primary key (Cluster Key Fields), which is required for the B+ tree, is also part of the data row

b. Remove all columns except the primary key (Non-Key Fields), which is the collection of all other columns of the data row except the primary key.

The two parts a and b here add up to a complete data row.

3. Auxiliary index tree non-leaf node non (blue)

a. Min Secondary-Key on Child, which is the minimum value in the secondary key value stored in the child node, which is necessary for the B+ tree, and is used to locate the location of a specific record in a Page.

b. Primary key value (Cluster Key Fields), why should non-leaf nodes store primary keys? Because the secondary index may not be unique, but the B+ tree requires that the value of the key must be unique, so here the value of the secondary key and the value of the primary key are combined as the real key value in the B+ tree to ensure uniqueness. However, this also causes the non-leaf nodes in the auxiliary index B+ tree to have 4 bytes more than the leaf nodes. (That is, the blue node in the figure below is 4 bytes more than the red node)

c. The number of the Page where the smallest value is located (Child Page Number), which is used to locate the Record.

4. Auxiliary index tree child node (red)

a. Secondary Key Fields, which are required for B+ trees.

b. The primary key value (Cluster Key Fields) is used to do another B+ tree search in the main index tree to find the entire record.

The following is the most important part of this article. Combining the structure of the B+ tree and the contents of the four types of Records introduced earlier, we can finally draw a panorama. Since the B+ tree of the secondary index has a similar structure to the primary key index, only the structure diagram of the primary key index tree is drawn here, which only includes two nodes: "primary key non-leaf node" and "primary key leaf node", which is the above figure. green and yellow parts.

Restore the above picture to the following more concise tree diagram, which is part of the B+ tree. Note that there is no one-to-one correspondence between Page and B+ tree nodes. Page is only used as a storage container for Records. Its purpose is to facilitate batch management of disk space. Page number 47 in the above figure is in the tree. The structure is split into two independent nodes.

So far, this article is over. This article only summarizes the data structure and implementation related to InnoDB indexes, and does not involve the actual combat experience of Mysql. This is mainly for several reasons:

1. The principle is the cornerstone. Only by fully understanding how the InnoDB index works can we be able to use it efficiently.

2. Principle knowledge is especially suitable for using diagrams, and I personally like this expression very much.

3 About InnoDB optimization, there is a more comprehensive introduction in "High Performance Mysql". Students who are interested in optimizing Mysql can obtain relevant knowledge by themselves. My own accumulation has not reached the point where I can share these contents.

Another: Students who are more interested in the implementation of InnoDB can look at Jeremy Cole's blog (the source of the three articles in the reference). This dude has worked in database-related work in Mysql, Yahoo, Twitter, and Google. His articles are very Great!

Article source: http://www.admin10000.com/document/5372.html

Detailed explanation of MySQL's InnoDB index principle

Guess you like