[B+ tree index] The structure of the index page contains secrets that can be quickly queried

1. Records are stored in pages

In the previous article The logical structure of the table under the Innodb storage engine explained the Compact row record format. When excluding variable length fields, NULL value fields, hidden fields and data, you need to pay special attention to these words头信息, as follows:

Insert image description here

  • deleted_flag: Deletion mark (0: not deleted 1: deleted). Why are the deleted records still on the page? Or is it still on the disk?

Answer: The reason why these deleted records are not deleted from the disk is because if they are removed, the remaining records will need to be rearranged on the disk, which will also cause a certain amount of performance consumption, so just delete them. A deletion mark can avoid redeployment (i.e. soft deletion). Thenall the deleted records will form a garbage linked list, and the space occupied by the recordin this linked list will be calledreusable space. If new records are inserted into the table later, they can overwrite the storage space occupied by the deleted records.

Records are stored according to rows, but database reading is not in units of [rows]. Otherwise, one read (a page is an I/O operation) can only process one row of data, which is extremely inefficient.

In fact,InnoDB data is read and written in units of [data pages], that is to say, when you need to read a record At this time, the record itself is not read from the disk, but the entire record is read into the memory in units of pages.

The smallest unit of database I/O operation is page. The default size of InnoDB data page is 16KB, which means that each read of the database is in units of 16KB. At least 16K of content is read from the disk to the memory at a time. At least 16K of the contents in memory are flushed to disk at a time.

The structure of a data page (index page) is as follows:
Insert image description hereThe following is an explanation of the functions of each of the above structures:

  • File Header: holds some general information about the page, including 页类型 (index page, overflow page, etc.), 前页地址、后页地址, checksum, etc. The following two-way linked list is formed based on the front and back addresses (Note that the addresses here are not consecutive)
    Insert image description here

  • Page Header: holds some proprietary information of the data page, such as: 记录数, 页目录中的槽数, the address offset of Free Space in the page, 当前页所在 B+ 数中的层级Wait.

  • 最小和最大记录(Infimum+Supremum): Two virtual records, useful for connecting records. If next_record of a certain record is a negative number, it points to the largest record.

  • User Records: actually store the record we inserted, and use the next_record attribute to turn the record into a one-way linked list, Note: They are arranged according to the primary key size from small to large . (This is why you mind using the auto-incrementing primary key provided by MySQL. If it is not incremented, when the page is full, but a record with the primary key between them is inserted, the page will be split, and then the appropriate position will be found. Insertion, resulting in page splitting, will affect performance, so it is recommended to process the primary key incrementally) as shown in the following figure:
    Insert image description here

  • Free Space: The unused portion of the page. When Free Space is replaced by User Records, this page means it is used up.

  • Page Directory: In real time, the records of the singly linked list in this page are grouped. The index value of each group is recorded in the page directory. This index value refers to the last in the group. The address offset of a record (which is the distance between the actual data of the record and the 0th byte in the page). This index value is stored in the page directory, where they are called 槽(Slot), and each slot occupies 2 bytes. The page directory is composed of multiple slots. (The n_owned attribute indicates how many records there are in the group. Of course, the last record refers to the record with the largest primary key value in the group)
    Insert image description here

  • File Tailer: Used to verify whether the page is complete and match the checksum of the file header.

2. What can I know after analyzing the page structure?

  • User records are stored in User Records, which are stored in a one-way linked list. They are stored from front to back according to the primary key size.

  • The process of creating a page directory is as follows:

    • Divide all records into several groups, including minimum records and maximum records, but excluding records marked as "deleted";
    • The last record of each record is the one with the largest primary key value in the group, and the header information of the last record will be Store the total number of records in this group as the field. For group members, the value of this attribute field is 0.n_owned1
    • The page directory is used to store the address offset of the last record in each group. These address offsets will be stored in order. The address offset page of each group is called槽(Slot), each slot is equivalent to a pointer pointing to the last record of a different group.
  • Find record process:Find which slot the record is in from the main directory based on the primary key value (which record group), after locating the slot, < /span>. There is no need for the smallest record to start traversing the record list in the entire page. The record can be found by traversing all the records in the slot

  • The records in the slot are specified, not just a few. In that case, the time complexity of searching for records will not be O(n): the first There can only be 1 record in a group, which is Infimum. The number of records in the last group can only be in the range of 1-8, which is Supremum; the number of records in the remaining groups can only be in the range of 4- Between 8 items.

3. How to query B+ tree?

The record retrieval mentioned above is for a single data page, but the records of a single data page are limited, and the primary key values ​​​​are ordered, so by grouping all records, and then storing the group number (slot number) to the page Directory, so that it can act as an index, and the binary search method can be used to quickly retrieve which group the record is in, thereby reducing the time complexity of retrieval.

However, when we need to store a large number of records, we need multiple data pages. At this time, we need to consider how to establish a suitable index to easily locate the page where the record is located.

In order to solve this problem, InnoDB uses the B+ tree as the index. Each node in the B+ tree in InnoDB is a data page. It just means that the data stored in leaf nodes is the entire data, while non-leaf nodes are only used to store directories as indexes.That is, the page number (page address), which can quickly locate that page , and then query according to the query record process on the above page.

Here we need to explain an attribute of the header informationrecord_type: 0 means ordinary records, 1 means non-leaf nodes, that is, directory entry records of non-leaf nodes in the B+ tree, 2 means Infimum record, 3 represents Supremum record.

The structural diagram of the B+ tree in InnoDB is as follows (each node is an index page, which means it has header information, hidden fields...):
Insert image description here
Through the above figure, We can see the characteristics of B+ tree:

  • The records inside each index page are still arranged in a one-way linked list according to the primary key increment, and the same is true for non-leaf node pages. The primary key value corresponds to the smallest primary key value of the record in the corresponding page of the index page. (Take clustered index as an example, otherwise it would be difficult for me to explain)
  • Only leaf nodes (nodes at the lowest level) store data, non-leaf nodes (nodes in other layers) are only used to store directory itemsAs an index, is positioned to the page of the next level.
  • Non-leaf nodes are divided into different levels, and the search volume of each level is reduced by layering. Of course, the highest level is four levels.
  • All nodes are sorted according to the size of the index key, forming a doubly linked list to facilitate range query, based on the information provided by the file header (File Header).
  • Leaf node record_type The attribute value is 0, and the non-leaf node is the directory entry node record_type The attribute value is 1.

Take a look at how the B+ tree in the picture above quickly finds the record with the primary key 6:

  • Starting from the root node, use the dichotomy method to quickly locate the page that matches the range within the page and includes the query value. Because the part-time job of the query is 6, which is between [1,7), so go to page 30 to query.
  • Since the record on page 30record_type is 1, which is the directory entry record, we continue to quickly locate the page of the query value through the dichotomy method. The primary key value is greater than 5, so we locate the page. Page 16.
  • Since the record on page 16record_type is 0, it means it is a user record. After locating the slot through dichotomy, it traverses all the records in the slot, and then finds the record with the primary key 6 .

4. Clustered index and secondary index

The B+ tree mentioned above is actually the index we mentioned. The leaf nodes of the tree store complete user records, while the non-leaf nodes record the location of the underlying pages.

But in fact, in InnoDB, indexes are divided into clustered indexes and non-clustered indexes (secondary indexes). The difference between them lies in what data is stored in the leaf nodes:

  • The leaf nodes of the clustered index store actual data, and all complete user records are stored in the leaf nodes of the clustered index;
  • The leaf nodes of the secondary index store the primary key value, not the actual data. That is to say, it needs to return the table when querying the actual data. If it only queries the primary key, index coverage will occur and there is no need to return the table.

Because the data of the table is stored in the leaf nodes of the clustered index, the InnoDB storage engine must create a clustered index for the table, and since only one copy of the data is physically saved, only one clustered index can be used. . (In fact, there was originally only one root node, and then data was continuously added, page by page, to become the B+ tree index we see. This will be explained in the following blog post on precautions)

When InnoDB creates a clustered index, it will select different columns as indexes according to different scenarios:

  • If there is a primary key, the primary key will be used as the index key of the clustered index by default;
  • If you have no opinion, choose a unique column that does not contain NULL values as the index key of the clustered index;
  • If neither of the above is satisfied, InnoDB will automatically generate an implicit auto-incrementing id column (6 bytes) as the index key of the clustered index.

A table can only have one clustered index. In order to achieve fast search of non-primary key fields, a secondary index (non-clustered index/auxiliary index) is introduced. It also uses the data structure of the B+ tree, but the secondary index The leaf nodes store the primary key value, not the actual data. To find the actual data, you need to return to the table.

The B+ tree of the secondary index is as shown below, and the data part is the primary key value:
Insert image description here

If a query statement uses a secondary index, but the queried data is not a primary key value, then after finding the primary key value in the secondary index, you need to obtain the data rows in the clustered index. This process is called "returning to the table." ”, that is to say, two B+ trees must be checked to find the data. However, when the queried data is the primary key value, because it can only be queried in the secondary index, there is no need to look up the clustered index. This process is called "index coverage", that is, only one B+ tree needs to be queried. Data can be found.

Regarding the precautions for indexing, the use of indexes, index failure, index optimization, etc., the blog will be updated in the future~~~

References:
"How MYSQL runs"
Looking at the B+ tree from the perspective of the data page
Understand the data page in one article

Guess you like

Origin blog.csdn.net/qq_63691275/article/details/132926918