"Broken MySQL" - MySQL Index

The article is continuously updated every Saturday, you can search for "Desolate Ancient Legend" on WeChat to read it first.

In the previous article ""Broken MySQL" - MySQL Overview", we briefly introduced the knowledge about the MySQL architecture and the MySQL storage engine, so in this article we mainly introduce the index implementation principle of the InnoDB storage engine . In the article, after each part of the knowledge points are introduced, common interview questions and answers corresponding to the knowledge points will be given to achieve the effect of combining theory and practice.

Interview question 1: Tell me about your understanding of MySQL indexes? Do you understand the indexing strategy in the /InnoDB engine?

what is index

Index (Index) is a data structure that helps MySQL obtain data efficiently, so the essence of index is: data structure .

The purpose of indexing is to improve query efficiency , which can be compared to dictionaries, train schedules at railway stations, catalogs of books, etc.

Principle of B+ Tree

data structure

B Tree refers to Balance Tree, which is a balanced tree. A balanced tree is a search tree, and all leaf nodes are on the same layer.

B+ Tree is a variant of B Tree, which is implemented based on B Tree and sequential access pointers of leaf nodes, and is usually used in databases and file systems of operating systems.

A B+ tree has two types of nodes: internal nodes (also called index nodes) and leaf nodes.

Internal nodes are non-leaf nodes. Internal nodes do not store data, but only store indexes, and all data exists in leaf nodes.

The keys in the internal nodes are arranged in order from small to large. For a key in the internal node, all the keys in the left subtree are smaller than it, and the keys in the right subtree are all greater than or equal to it. The records of the leaf nodes are also in order Arranged from smallest to largest.

Each leaf node stores pointers to adjacent leaf nodes.

operate

look up

The lookup is done in a typical fashion, similar to a binary search tree. Starting at the root node, traverse the tree top-down, selecting child pointers whose split values ​​are on either side of the value being looked for. Typically within a node a binary search is used to determine this location.

insert

  • Perform a search to determine what bucket the new record should go into.

  • If the bucket is not full (a most b - 1 entries after the insertion, b is the number of elements in the node, usually an integer multiple of the page), add tht record.

  • Otherwise,before inserting the new record

    • split the bucket.
      • original node has 「(L+1)/2」items
      • new node has 「(L+1)/2」items
    • Move 「(L+1)/2」-th key to the parent,and insert the new node to the parent.
    • Repeat until a parent is found that need not split.
  • If the root splits,treat it as if it has an empty parent ans split as outline above.

B-trees grow as the root and not at the leaves.

delete

Similar to insert, but a bottom-up merge operation.

Introduction to Common Trees

AVL tree

A balanced binary tree is generally determined by the balance factor difference and realized by rotation. The height difference between the left and right subtrees does not exceed 1. Compared with the red-black tree, it is a strictly balanced binary tree. The balance condition is very strict (the tree height difference is only 1 ), as long as the insertion or deletion does not meet the above conditions, it must be rotated to maintain balance. Since rotation is very time consuming. Therefore, the AVL tree is suitable for scenarios where the number of insertions/deletions is relatively small but the number of searches is large.

red black tree

By constraining the color of each node on the path from the root node to the leaf node, it is ensured that no path is twice as long as the other paths, so it is approximately balanced. Therefore, compared with the AVL tree that strictly requires balance, its number of rotations to maintain balance is less. Suitable for scenarios with less lookups and more insertions/deletions. (In some scenarios, skip lists are now used to replace red-black trees. You can search for "Why does redis use skip lists instead of red-black?")

B/B+ tree

Multi-way search tree, with high out-degree and low disk IO, is generally used in database systems.

Interview question 2: What is the difference between B+ tree and B tree?

Comparison of B+ tree and B tree

  • The internal nodes of the B+ tree do not have pointers to the specific information of the keywords
  • The leaf nodes of the B+ tree have pointers to the left and right leaf nodes

B+ tree disk IO is lower

The internal nodes of the B+ tree do not have pointers to specific information about keywords. Therefore, its internal nodes are smaller than B-trees.

If all the keywords of the same internal node are stored in the same disk block, then the number of keywords that the disk block can hold is also larger. The more keywords that need to be searched are read into the memory at one time. Relatively speaking, the number of IO reads and writes is also reduced.

The range query and traversal efficiency of B+ tree are high

The leaf nodes of the B+ tree have pointers to the left and right leaf nodes. Therefore, its range query efficiency is higher and the traversal efficiency is higher.

The query efficiency of B+ tree is more stable

Because non-leaf nodes are not nodes that ultimately point to file content, but are just indexes of keywords in leaf nodes. So any keyword search must take a path from the root node to the leaf node. All keyword queries have the same path length, resulting in the same query efficiency for each data.

Comparison of B+ tree and red-black tree

Balanced trees such as red-black trees can also be used to implement indexes, but file systems and database systems generally use B+ Tree as the index structure, mainly for the following two reasons:

(1) Disk IO times

A node of a B+ tree can store multiple elements. Compared with the red-black tree, the tree height is lower and the number of disk IOs is less.

(2) Disk read-ahead feature

In order to reduce disk I/O operations, the disk is often not strictly read on demand, but read ahead every time. During the read-ahead process, the disk performs sequential reads, and sequential reads do not require disk seeks. Integer multiples of pages are read each time.

The operating system generally divides memory and disk into blocks of fixed size, each block is called a page, and memory and disk exchange data in units of pages. The database system sets the size of a node of the index to the size of a page, so that one node can be fully loaded in one I/O.

mysql index

Indexing is implemented at the storage engine layer, not at the server layer, so different storage engines have different index types and implementations.

B+ Tree index

Is the default index type for most MySQL storage engines.

  • Because there is no need to perform a full table scan, only the tree needs to be searched, so the search speed is much faster.

  • Because of the ordered nature of B+ Tree, it can also be used for sorting and grouping in addition to searching.

  • You can specify multiple columns as index columns, and multiple index columns together form a key.

  • Applicable to full key value, key value range and key prefix lookup, where key prefix lookup is only applicable to leftmost prefix lookup. Indexes cannot be used if lookups are not done in the order of the indexed columns.

InnoDB's B+Tree index is divided into primary index and auxiliary index. The data domain of the leaf node of the main index records complete data records, and this index method is called a clustered index. Because there is no way to store data rows in two different places, a table can have only one clustered index.

The data field of the leaf node of the auxiliary index records the value of the primary key. Therefore, when using the auxiliary index to search, you need to find the primary key value first, and then search in the main index. This process is also called returning to the table.

Interview question 3: What is the reason for the slow query in the late stage of paging SQL? How to deal with it?

for example:

select * from table where type = 1 limit 0,10 

There is no problem with this, but there will be problems when entering the later stage of paging, such as the following sql:

select * from table where type = 1 limit 100000,10 

First of all, you need to understand the implementation principle of the B+ tree. When paging with a relatively large amount of data:

  1. (according to the index) find the starting position (leaf node)
  2. Traversing backwards along the leaf nodes, filtering out the previous data that is not within the scope of this return
  3. Return after finding the target data

From the above process, we can see that the main time-consuming in the later stage is traversing along the leaf nodes, so is there any solution to save this part of the time-consuming?

Yes, you can record the maximum value of the id of this page after each traversal, and then add this id limit to the where condition in the next round of query, so as to ensure that the speed of each query is consistent with the first page .

select * from table where type = 1 and id > max_id limit 100000,10 

Interview question 4: What is a covering index?

Covering index means that the data columns to be queried can be obtained only from the index, without going back to the table to query the main index tree.

To answer this question well, you need to understand the implementation principle of the B+ tree introduced above. After understanding the principle, it is easy to answer.

hash index

Hash indexes can be looked up in O(1) time, but lose order:

  • Cannot be used for sorting and grouping;
  • It only supports exact search and cannot be used for partial search and range search.

The InnoDB storage engine has a special function called "adaptive hash index". When an index value is used very frequently, a hash index will be created on top of the B+Tree index, so that the B+Tree Indexes have some of the advantages of hash indexes, such as fast hash lookups.

full text index

The MyISAM storage engine supports full-text indexing, which is used to find keywords in text, rather than directly comparing whether they are equal.

Lookup conditions use MATCH AGAINST instead of plain WHERE.

The full-text index is implemented using an inverted index, which records the mapping of keywords to the documents in which they are located.

The InnoDB storage engine also began to support full-text indexing in MySQL 5.6.4.

Spatial Data Index

The MyISAM storage engine supports spatial data indexes (R-Tree), which can be used for geographic data storage. The spatial data index will index data from all dimensions, and can effectively use any dimension to perform combined queries.

Data must be maintained using GIS-related functions.

Advantages of Indexing

  • Significantly reduces the number of data rows that the server needs to scan.

  • Help the server avoid sorting and grouping, and avoid creating temporary tables (B+Tree indexes are ordered and can be used for ORDER BY and GROUP BY operations. Temporary tables are mainly created during sorting and grouping, and do not require sorting and grouping , there is no need to create a temporary table).

  • Change random I/O into sequential I/O (B+Tree indexes are ordered and store adjacent data together).

Index usage conditions

  • For very small tables, a simple full table scan is more efficient than indexing in most cases;

  • For medium to large tables, indexes are very effective;

  • But for very large tables, the cost of creating and maintaining indexes will increase accordingly. In this case, it is necessary to use a technology that can directly distinguish a set of data that needs to be queried, instead of matching records one by one, for example, partitioning technology can be used.

Why is a simple full table scan more efficient than indexing in most cases for very small tables?

If a table is relatively small, it is obviously faster to traverse the table directly than to go through the index (because it needs to go back to the table).

Note: First of all, it should be noted that the implicit condition of this answer is that the queried data is not a part of the index, and no return table operation is required. Secondly, the query condition is not the primary key, otherwise the data can be obtained directly from the clustered index.

summary

This article is the second article in the "Broken Wall" series, which mainly introduces the content related to the index.

The article first introduces the basic knowledge related to B+ tree from the perspectives of data structure , operation , and comparison of common trees .

Secondly, it focuses on the implementation principle of InnoDB's B+ index .


The old rules, the interview questions that appear in the article are as follows:

  • Interview question 1: Tell me about your understanding of MySQL indexes? Do you understand the indexing strategy in the /InnoDB engine?
  • Interview question 2: What is the difference between B+ tree and B tree?
  • Interview question 3: What is the reason for the slow query in the late stage of paging SQL? How to deal with it?
  • Interview question 4: What is a covering index?

All the above interview questions have appeared in the article. If you can’t answer it, you can go back to the article and look at the answers and corresponding knowledge points.

References


The article is continuously updated every Saturday, you can search for "Desolate Ancient Legend" on WeChat to read it first.

Guess you like

Origin blog.csdn.net/finish_dream/article/details/113446647
Recommended