B+ tree index

1. B+ tree

       MySQL mainly supports B+ tree indexes, full-text indexes, and hash indexes.

      The B+ tree index is the most common and most frequently used index in the database, so this article mainly introduces the B+ tree index, but before that we need to introduce some closely related algorithms and data structures to help us better Understand how B+ tree indexes work.

      1. Binary search method

            The binary search method is also called the halved search method. For a set of ordered data, we use the data at the midpoint of the data as the comparison object. If the searched object is smaller than the element at the midpoint, the sequence with the search will be reduced. It is the left half, otherwise it is the right half, and the search interval can be reduced by half by one comparison.

     2. Binary search tree and balanced binary tree

             Before introducing the B+ tree, we need to understand the binary search tree. B+ tree is evolved from binary search tree, balanced binary tree and B tree. A binary tree is a classic data structure. Figure 1 shows a binary tree. In a binary tree, the key value of the left subtree is always less than the key value of the root, and the key value of the right subtree is always greater than the key value of the root. Therefore, the sorted output of key values ​​can be obtained through in-order traversal. Of course, a binary tree can be constructed arbitrarily. Figure 2 is also a binary tree, but the number of times required to find each node in Figure 1 is (3+3+3+2+2+1)/6=2.3 times. However, the average number of searches for the binary tree in Figure 2 is (1+2+3+4+5+5)/6=3.16 times, so the query efficiency is low at this time. Therefore, if you want to build a binary tree with the best performance, the binary tree needs to be balanced.

                   

                      Figure 1 Figure 2

       

 

       3. Balanced binary tree (AVL tree)

          Balanced binary tree definition: First, it needs to meet the definition of a binary search tree, and secondly, it must meet that the height difference between the two subtrees of any node is 1.

The search performance of a balanced binary tree is relatively high, but not the highest, just close to the highest performance. The best performance requires building an optimal binary tree (Huffman tree), but the establishment and maintenance of the optimal binary tree requires a lot of operations, so we only need to build a balanced binary tree. The query speed of a balanced binary tree is faster, but the cost of maintaining a balanced binary tree is high. Generally speaking, one or more left-hand and right-hand rotations are required to obtain the balance after insertion or update.

  

       4. B+ tree

           (1) B+ tree is a classic data structure like binary tree and balanced binary tree. The B+ tree is evolved from the B tree and the index sequential access method, and the B tree is almost never used in the actual use process. A B+ tree is a balanced lookup tree designed for disks or other direct storage aids. In the B+ tree, all record nodes are stored on the leaf nodes of the same layer in the order of the size of the key value, and are connected by the pointers of each leaf node. Let's first look at a B+ tree, whose height is 2, each page can store 4 records, and the fan-out is 5, as shown in Figure 3.

                                                                     image 3

 

       (2) Insertion operation of B+ tree

           The insertion of the B+ tree must ensure that the records in the leaf nodes are still sorted after insertion. At the same time, three cases of inserting into the B+ tree need to be considered, each of which may lead to different insertion algorithms (Figure 4).


                                                               Figure 4

        (3) Delete operation of B+ tree

           The B+ tree uses the fill factor to control the deletion and change of the tree. 50% is the minimum value that the fill factor can be set. The deletion operation of the B+ tree must also ensure that the records in the deleted leaf nodes are still sorted. Like the insertion, the deletion of the B+ tree There are three cases to consider for the same operation. Unlike insertion, deletion is measured by the change in fill factor.

                                                          (Figure 5)

 

Two, b+ tree index

        The data structure and operation of the B+ tree have been discussed above. The essence of the B+ tree index is the implementation of the B+ tree in the database. However, one of the characteristics of B+ indexes in the database is high fan-out. Therefore, in the database, the height of the B+ tree is generally 2-4 layers, so we only need 2 to 4 IOs at most when looking for a row record of a certain key value. . The B+ tree index can be divided into clustered index and secondary index. Sometimes it is also called non-clustered index, but whether it is a clustered or a secondary index, its interior is a B+ tree, that is, highly balanced, and the leaf nodes store all the The difference between data, clustered index and auxiliary index is whether the leaf node stores a whole row of information.

        2.1, clustered index

        The InnoDB storage engine table is an index-organized table, that is, the data in the table is stored in the order of the primary key. The clustered index constructs a B+ tree according to the primary key of each table, and the row record data of the entire table is stored in the leaf nodes, and the leaf nodes of the clustered index are also called data pages. This feature of the clustered index determines that the data in the index-organized table is also part of the index. Like the B+ tree data structure, each data is also linked through a doubly linked list.

       Since actual data pages can only be sorted according to one B+ tree, each table can only have one clustered index. In most cases, the query optimizer prefers to use a clustered index, because the clustered index can directly find data on the leaf nodes of the B+ tree index.

       2.2, auxiliary index (non-clustered index)

       The leaf nodes of the auxiliary index do not contain all the data of the row record. A bookmark is also included in the index row in each leaf node. The bookmark is used to tell the InnoDB storage engine that the table is an index-organized table, so the bookmark of the secondary index of the InnoDB storage engine is the clustered index key of the corresponding row data. The existence of the auxiliary index does not affect the organization of data in the clustered index. The InnoDB storage engine traverses the auxiliary index and obtains the primary key index through the leaf-level pointer to find a complete row record.

       2.3. Splitting of B+ tree index

       The splitting of the B+ tree introduced earlier is the simplest case, which is slightly different from the B+ tree index in the database, and there is no concurrency involved before, which is the most difficult part of implementing the B+ tree index.

        The splitting of B+ tree index pages does not always start from the middle record of the page, which may lead to a waste of page space. For example: 1, 2, 3, 4, 5, 6, 7, 8, 9. The insertion is performed according to the self-increment order. If 10 is inserted at this time, according to the B+ tree insertion operation, when the leaf node is full, the records smaller than the middle are placed on the left, those greater than or equal to are placed on the right, and 5 is used as the split point record, and the following two are obtained. pages:

 P1:1、2、3、4         P2: 5、6、7、8、9、10

However, the insertion is sequential, and no more records will be inserted in the page P1, resulting in a waste of space. And P2 will split again.

There are several parts in the Page Header of the InnoDB storage engine that are used to save the insertion order information:

       1、page_last_insert         2、page_direction    3、page_n_direction

       Through this information, the InnoDB storage engine can decide whether to split left or right, and at the same time decide which split point record is to be. If the insertion is random, the middle record of the page is taken as the split point record. the same. If the number of records inserted in the same direction is 5, and the currently located (cursor) record (when the InnoDB storage engine inserts, it needs to be located first, and the located record is the previous record of the record to be inserted). If there are 3 records, the record of the split point is the third record after the located record, otherwise the record of the split point is the record to be inserted.

        Figure 6 is an example of splitting to the right, and there are 3 records after the located record, then the split record is the split point record, and finally

Split to the right to get as shown in Figure 7

                                                                            Image 6

 

                                                                          Figure 7

      When the split point is the inserted record itself, after splitting to the right, only the record itself is inserted, which is a common situation in auto-increment insertion. (Figure 8)

                                                                            Figure 8

 

 

Third, the use of B+ tree index

       3.1. Unique index: create unique index index name on table name (column name);

       3.2. Simple index: create index index name on table name (column name);

       3.3. Joint index:

               A joint index refers to indexing multiple columns on a table. A joint index is created in the same way as a single index, except that there are multiple index columns. create index index name on table name (column 1, column 2);

               Essentially, the joint index is also a B+ tree, but the number of key values ​​of the joint index is greater than or equal to 2. Suppose the names of the two key values ​​are a and b, as shown in Figure 9. From the figure, we can see that the B+ tree of multiple key values ​​is no different from the B+ tree of a single key value, and the key values ​​are sorted. , all data can be read logically sequentially through the leaf nodes, ie (1,1)(1,2)(2,1)(2,4)(3,1)(3.2). The data is stored in the order of (a, b).

                                                                       Figure 9 Joint index
        Therefore, when we query select * from table where a=xxx and b=xx, we can obviously use this joint index, and we can also use this a,b for a single column a query select * from table where a=xxx Index, but for the query select * from table where b=xxx for column b, the index of this tree cannot be used, because we can find that the value of b on the leaf node is 1, 2, 1, 4, 1, 2. Obviously Not sorted, so the a,b index is not used for b column queries. This is the leftmost matching principle that we use the union index to follow.

         3.4. Covering Index

         The InnoDB storage engine supports a covering index, that is, the query records can be obtained from the auxiliary index without querying the records in the clustered index. One advantage of using a covering index is that the auxiliary index does not contain all the information of the entire row, so its size is much smaller than the clustered index, so it can reduce a lot of IO operations.

     For example, when we use select count(*) from tableA where date>='2018-4-20' and date<='2018-4-22' , table tableAy has a joint index of (userId, date), only according to the column date For conditional queries, the joint index cannot be used in general, but this SQL query is a statistical operation and uses the information of the covering index, so the optimizer will select the joint index,

      

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325340400&siteId=291194637