MySQL index, InnoDB index detailed explanation

Table of contents

1. Why use indexes

2. Index overview

advantage

shortcoming

3. Deduction of indexes in InnoDB

3.1. Data page

3.2. Search in a page

3.2. Search in many pages

3.3. Design an index

4. Common index concepts

4.1, clustered index

features

advantage

shortcoming

limit

4.2. Secondary index (auxiliary index, non-clustered index)

return form

Why do you need to return to the table?

Summarize

4.3. Joint index

5. Precautions for InnoDB's B+ tree index

5.1. The location of the root page will remain unchanged for thousands of years

5.2. Uniqueness of directory entry records in internal nodes

5.3. A page stores at least 2 records


1. Why use indexes

The index is a data structure used by the storage engine to quickly find data records, just like the catalog part of a textbook. By finding the page number of the corresponding article in the catalog, you can quickly locate the required article. The same is true in MySQL. When searching for data, first check whether the query condition hits an index. If it matches, then use the index to find the relevant data. If not, you need to scan the entire table , that is, you need to search for records one by one until you find the corresponding data Records that meet the condition.


The purpose of indexing: In order to reduce the number of disk I/O and speed up the query rate .


2. Index overview

MySQL's official definition of index is: Index (Index) is a data structure that helps MySQL obtain data efficiently .

The nature of the index: the index is a data structure. It can be understood as a "sorted fast search data structure" that satisfies a specific search algorithm. These data structures point to data in some way so that advanced lookup algorithms can be implemented on top of these data structures.

Indexes are implemented in storage engines , so the indexes of each storage engine are not necessarily identical, and each storage engine does not necessarily support all index types. At the same time, the storage engine can define the maximum number of indexes and the maximum index length for each table . All storage engines support at least 16 indexes per table, and the total index length is at least 256 bytes. Some storage engines support more indexes and larger index lengths.

advantage

  • Similar to building a bibliographic index in a university library, it improves the efficiency of data retrieval and reduces the IO cost of the database . This is also the main reason for creating an index.
  • By creating a unique index, the uniqueness of each row of data in the database can be guaranteed .
  • In terms of achieving referential integrity of data, it can speed up the joins between tables . In other words, the query speed can be improved when the dependent child table and the parent table are jointly queried.
  • When using grouping and sorting clauses for data query, it can significantly reduce the time of grouping and sorting in the query and reduce CPU consumption.

shortcoming

Adding indexes also has many disadvantages, mainly as follows:

  • Creating and maintaining indexes takes time, and as the amount of data increases, the time consumed increases.
  • Indexes need to occupy disk space. In addition to the data space occupied by data tables, each index also occupies a certain amount of physical space and is stored on disk. If there are a large number of indexes, the index file may be larger than the data file and reach the maximum file size. .
  • Although the index greatly improves the query speed, it will reduce the speed of updating the table. When adding, deleting and modifying the data in the table, the index should also be maintained dynamically, which reduces the speed of data maintenance.

Therefore, when choosing to use an index, the advantages and disadvantages of the index need to be considered comprehensively.

3. Deduction of indexes in InnoDB

3.1. Data page

  • Page is the basic unit of disk and memory exchange in mysql, and also the basic unit of mysql management storage space.
  • All table spaces of the same database instance have the same page size; by default, the page size in the table space is 16KB, of course, the default size can also be modified by changing the innodb_page_size option, and it is important to note that different page sizes Eventually it will also result in a difference in zone size.
  • At least 16KB of content is read from disk to memory at a time, and at least 16KB of content in memory is flushed to disk at a time. Of course, the cost of reading a single page is quite high, and pre-reading is generally performed

3.2. Search in a page

        Assuming that there are relatively few records in the current table, all records can be stored in one page. When searching for records, it can be divided into two situations according to different search conditions:

  • Use the primary key as the search condition

        You can use the dichotomy method in the page directory to quickly locate the corresponding slot, and then traverse the records in the group corresponding to the slot to quickly find the specified record.

  • Use other columns as search criteria

        Because there is no so-called page directory for non-primary key columns in the data page, we cannot quickly locate the corresponding slot through the dichotomy method. In this case, we can only traverse each record in the singly linked list sequentially starting from the smallest record , and then compare whether each record meets the search conditions. Obviously, the efficiency of this search is very low.

3.2. Search in many pages

        In most cases, there are a lot of records stored in our tables, and many data pages are needed to store these records. Searching for records in many pages can be divided into two steps:

  1. Navigate to the page where the record is located.
  2. Find the corresponding record from the page where it is located.

        In the absence of an index, whether searching based on the values ​​of the primary key column or other columns, since we cannot quickly locate the page where the record is located, we can only search down the doubly linked list from the first page . In each page, find the specified record according to our search method above. Because it is necessary to traverse all data pages, this method is obviously super time-consuming . What if a table has 100 million records? At this point the index came into being.

3.3. Design an index

 Question: When there are more and more data pages, it means that there are more and more directory items. If new data is inserted between data pages, then the directory items also need to be inserted into a new directory corresponding to it. If the above (The value stored in the directory entry is continuous), it is impossible to realize (same as the data page, in fact the directory entry is the data page). Therefore, we can process directory items as data pages, layer by layer: record_type in the main row format

 

  • Each blue frame is a data page,
  • The last node is called a leaf node, which stores real data
  • The data in the leaf nodes is stored in the form of a one-way linked list, and the leaf nodes are stored in a doubly linked list (logically continuous)
  • Non-leaf nodes are called directory pages

 Notice:

  • The more levels of the tree, the more IO times

4. Common index concepts

Indexes can be divided into two types according to their physical implementation: clustered index (clustered) and non-clustered index. We also call non-clustered index secondary index or auxiliary index.

4.1, clustered index

Clustered index is not a separate index type, but a data storage method (all user records are stored in leaf nodes), that is, the so-called index is data, and data is index

  • The term "clustering" means that rows of data are stored together in clusters of adjacent key values
features

1. Use the size of the record primary key value to sort records and pages, which includes three aspects

  • The records in the page ( that is, the nodes in the tree ) are arranged in a single-item linked list in order of the size of the primary key
  • A doubly linked list is arranged between the leaves and the leaves according to the order of the primary key value
  • The pages that store directory entry records (not leaf nodes, not storing real data) are divided into different levels, and the pages in the same level are also arranged in a doubly linked list according to the order of the primary key size of the directory entry records in the page

2. The leaf nodes of the B+ tree store complete user records (data)

  • The so-called complete user record means that the values ​​of all columns (including hidden columns) are stored in this record

We call the B+ tree with these two characteristics a clustered index, and all complete user records are stored at the leaf nodes of this clustered index. This clustered index does not require us to explicitly use the INDEX statement in the MySQL statement to create, the InnoDB storage engine will automatically create a clustered index for us.

advantage
  • Data access is faster because the clustered index stores the index and data in the same B+ tree, so getting data from the clustered index is faster than the non-clustered index
  • The sorting search and range search of the primary key by the clustered index are faster (because it is sorted, such as looking for data with an id greater than 12, just get all the data after 12 directly)
  • According to the sorting order of the clustered index, when the query displays a certain range of data, because the data is closely connected, the database does not need to extract data from multiple data blocks, and the index saves a lot of IO operations (the advantage of building an index)
shortcoming
  • The insertion speed depends heavily on the insertion order , and the insertion in the order of the primary key is the fastest, otherwise page splits will occur, seriously affecting performance. Therefore, for InnoDB tables, we generally define an auto-incrementing ID column as the primary key
  • The cost of updating the primary key is higher , because it will cause the updated row to move. Therefore, for InnoDB tables, we generally define the primary key as non-updatable
  • Secondary index access requires two index lookups , the first to find the primary key value, and the second to find row data based on the primary key value
limit
  • For MySQL database, only InnoDB database currently supports clustered index, while MyISAM does not support clustered index
  • Since there is only one sorting method for physical storage of data, each mysql table can only have one clustered index , which is generally the primary key of the table
  • If no primary key is defined, InnoDB will choose a non-empty unique index instead . If there is no such index, InnoDB will implicitly define a primary key as a clustered index.
  • In order to make full use of the clustering characteristics of the clustered index, the primary key column of the indexed InnoDB table should use an ordered sequence id as much as possible, and it is not recommended to use an unordered id, such as UUID, MD5, HASH, and string as the primary key, and the data cannot be guaranteed order of growth.

4.2. Secondary index (auxiliary index, non-clustered index)

The above-mentioned clustered index can only be effective for the primary key search (the data in the B+ tree is sorted according to the primary key), what if we want to use other columns as the search condition? (certainly not from scratch)

Answer:

  • Build a few more B+ trees. The data in different B+ trees adopts different sorting rules. For example, we use the size of column c2 as the data page, and then build another B+ tree according to the sorting rules recorded in the page.
  • This tree contains the primary key and the value of the c2 column , excluding other columns
  • When select * from xxx where c2 = 4, the data of c2=4 will be found through this B+ tree
  • Then according to the data of c2=4, get the corresponding primary key value , and then search the primary key (clustered index) through the operation of returning to the table , and finally get the corresponding data (clustered index gets all the data)
  • A total of 2 B+ trees need to be searched
return form

        We can only determine the primary key value to be searched based on the B+ tree sorted by the size of the c2 column. If we find the complete user record based on the value of the c2 column, we still need to check it again in the clustered index. This process called back table.

Why do you need to return to the table?
  • If you put all the complete data in the leaf nodes, you don’t need to return to the table, but it takes up too much space. It is equivalent to copying all the data every time you build a B+ tree, which is a waste of storage space. .

Because this kind of B+ tree built according to the non-primary key needs a table return operation to get the complete user record, all this kind of B+ tree is also called secondary index (English name secondary index), or auxiliary index. Since we use the size of the c2 column as the B+ tree sorting rule, we also call this B+ tree the index built for the c2 column.

Summarize
  • The leaf nodes of the clustered index store our data records, and the leaf nodes of the non-clustered index store the data location (the data location here refers to where the primary key of the data is, that is, who is the primary key). Nonclustered indexes do not affect the physical storage order of data tables.
  • A table can only have one clustered index, because there can only be one way of sorting and storing, but there can be multiple non-clustered indexes, that is, multiple index directories provide data retrieval.
  • When using a clustered index, the data query efficiency is high, but if the data is inserted, deleted, updated, etc., the efficiency will be lower than that of the non-clustered index. (Reason: Based on the above case, the clustered index stores all the data, and there are three columns in c1/c2/c3, while the non-clustered index of c2 only has c1/c2. When the data is updated, the index change cost is low)

4.3. Joint index

We can also use the size of multiple columns as the sorting rule at the same time, that is, create indexes for multiple columns at the same time. For example, we want the B+ tree to be sorted according to the size of the c2 and c3 columns. This contains two meanings:

  • First sort each record and page according to column c2.
  • In the case that the c2 column of the record is the same, the c3 column is used for sorting.

The schematic diagram of the index established for the c2 and c3 columns is as follows:

This involves the leftmost matching principle

  • Essentially a secondary index

5. Precautions for InnoDB's B+ tree index

5.1. The location of the root page will remain unchanged for thousands of years

5.2. Uniqueness of directory entry records in internal nodes

We know that the content of the directory entry record in the inner node of the B+ tree index is the combination of index item + page number, but this combination is not rigorous for the secondary index.

Suppose the data in the table is as follows:

If the content of the directory entry record in the secondary index is only the combination of index column + page number, then the B+ tree after indexing c2 should be:

5.3. A page stores at least 2 records

Tree bifurcation, the page here refers to a non-leaf node, if there are not at least 2 records, then it is not a tree, meaningless

Guess you like

Origin blog.csdn.net/weixin_42675423/article/details/131056714