In-depth understanding of Mysql - detailed explanation of indexing principles

1. What is a database index

A database index is a sorted data structure in a database management system (DBMS) to help quickly query and update data in database tables.

2. Deduction of the index storage model

1, an ordered array

Process: Query from small to large or reverse in order.

Disadvantage: When inserting at the beginning or the middle position, all subsequent node subscripts need to be moved. Adding, deleting, and modifying are inefficient.

2. Binary Search Tree (BST Binary Search Tree)

Features: All nodes in the left subtree are smaller than the parent node, and all nodes in the right subtree are larger than the parent node. After projecting onto the plane, it is an ordered linear table.

Process: The value to be queried is compared with the root node, the value greater than the root node is compared on the right subtree, and the value smaller than the root node is continuously queried on the left subtree, and if it is equal, the query is ended directly, and the disk address in the node is obtained for data reading .

Compared with ordered arrays: the query efficiency is higher, and the efficiency of new addition and modification is improved.

Disadvantage: Affected by the depth of the binary search tree, the time complexity in the worst case can be reduced to O(n), that is, when the binary tree has only one subtree.

3. Balanced binary tree (AVL Tree) (left-handed, right-handed)

Features: A binary search tree whose absolute value of the depth difference between the left and right subtrees cannot exceed 1. Left->left type: right-handed; right->right type: left-handed;

Process: same as searching binary tree

Compared with the binary search tree: the depth difference between the left and right subtrees of the tree is small, and the query is relatively stable.

Disadvantages: If a binary tree is used as a data index, assuming that the page size of InnoDB is 16kb, a node is stored (the value of the index column, the address of the record in the disk, and the disk physical address of the left and right subtrees need to be stored in the node), then If multiple comparisons are made during the query process, there will be multiple io operations with the hard disk. When the amount of data is large, the io time consumption will be high.

4. Multi-way balanced search tree (B Tree,  Balanced Tree ) (split, merge)

Features: Like the AVL tree, the B-tree stores key values, data addresses, and node references in branch nodes and leaf nodes. The number of forks (in the figure above, the number of paths N=3) is always 1 more than the number of keywords (field values) (similar to the rope being cut several times in the middle). Multiple node information can be stored on one page, which reduces the AVL tree and requires less storage on one page. The same data requires multiple disk IO reads.

Process: For example, we want to find 15 in this table. Since 15 is less than 17, go left. Since 15 is greater than 12, go right. 15 is found in disk block 7, and only 3 IOs are used.

Disadvantages: When new data is added, or data is deleted, or index column data is updated, page splitting and merging will occur. Also, splitting and merging will eventually require multiple IO interactions with disk IO.

5. B+ tree (enhanced multi-way balanced search tree)

Features: 1. The number of keywords is equal to the number of paths. 2. The root node and branch nodes of B+Tree will not store data, only the leaf nodes will store data. Each leaf node of B+Tree adds a pointer to the adjacent leaf node, and its last data points to the first data of the next leaf node, forming an ordered linked list structure.

Process: See the figure above, the one that is smaller than the first one of the root node does not exist, then the one that is greater than or equal to the first one in the node, and smaller than the second one, go down the pointer of the first one in the node. If it is greater than or equal to the second position in the node, if it is less than the third position, go down the second position pointer; if it is greater than or equal to the third position in the page, go down the third position pointer. Until it reaches the leaf node to obtain the disk address of the record.

Advantages: 1. Since the disk physical address of the node is not stored in the above node, more data can be stored on each page. The more paths, the lower the level of the tree; 2, the ability to scan databases and tables is stronger (if we want to perform a full table scan on the table, we only need to traverse the leaf nodes, and do not need to traverse the entire B+Tree to get to all data). 3. The disk read and write capabilities of B+Tree are stronger than that of B Tree (the root node and branch nodes do not save the data area, so one node can save more keywords, and more keywords can be loaded from disk at one time) 4, The sorting ability is stronger (because there is a pointer to the next data area on the leaf node, and the data forms a linked list). 5. The efficiency is more stable (B+Tree always gets data at the leaf nodes, so the number of IOs is stable).

3. Index differences of different storage engines of mysql

InnoDB

Under the innoDB storage engine, in addition to the tableName.frm table structure file, a table also has a tableName.ibd file. In InnoDB, it uses the primary key as the index to organize data storage, so the index file and data file
are the same file, both in the .ibd file. On the leaf node of the primary key index, it directly stores our data.

Clustered index (clustered index): The logical order of index key values ​​is consistent with the physical storage order of table data rows.

Non-clustered index (auxiliary index, secondary index): If there is a primary key index, then the primary key index is a clustered index. The other indexes are collectively called "secondary indexes" or auxiliary indexes.

Generation of a clustered index : If we define a primary key (PRIMARY KEY), then InnoDB will select the primary key as the clustered index. If the primary key is not explicitly defined, InnoDB will select the first unique index that does not contain NULL values ​​as the primary key index
. If there is no such unique index, InnoDB will choose the built-in 6-byte long ROWID as a hidden clustered index, which will increment the primary key as row records are written.

Under the InnoDB storage engine, if a table creates a primary key index, then the primary key index is a clustered index, which determines the physical storage order of data rows.

The secondary index stores the key value of the auxiliary index, for example, if an index is built on name, the value of name is stored on the node. The leaf node of the secondary index stores the value of the primary key corresponding to this record. Therefore, the process of retrieving data by the secondary index is as follows: when we use the name index to query a record, it will find name=Q in the leaf node of the secondary index, get the primary key value, that is, id=1, and then Go to the leaf node of the primary key index to get the data.

MyISAM

In MyISAM, besides the tableName.frm table structure file, a table has two other files: one is the .MYD file, and D stands for Data, which is the data file of MyISAM and stores data records. One is the .MYI file, I stands for Index, which is the index file of MyISAM and stores the index.

In the B+Tree of MyISAM, the leaf nodes store the disk addresses corresponding to the data files. So after finding the key value from the index file .MYI, it will get the corresponding data record from the data file .MYD.

In MyISAM, the auxiliary index is also in this .MYI file. There is no difference between the auxiliary index and the primary key index in the way of storing and retrieving data. The disk address is found in the index file, and then the data is obtained in the data file.

Fourth, the creation and use of indexes

Dispersion

If the column has more repeated values, the scatter will be lower, and if the column has fewer repeated values, the scatter will be higher. It is not recommended to build indexes on fields with low dispersion.

Combined index leftmost match

A joint index is a composite data structure in B+Tree, which builds a search tree in order from left to right (name is on the left, phone is on the right). As can be seen from this picture, name is ordered and phone is unordered. Phones are ordered when names are equal. At this time, when we use where name= 'Mic' and phone = '180xx ' to query data, B+Tree will compare the name first to determine the next search direction, left or right. If the names are the same, then compare the phones. But if the query condition does not have a name, you don't know which node to check in the first step, because name is the first comparison factor when building a search tree, so no index is used.

Therefore, when we build a joint index, we must put the most commonly used columns on the leftmost.

Example: When creating a joint index (a,b,c). Equivalent to creating three indexes: index(a), index(a,b), index(a,b,c)

Use where a=xx,b=xx,c=xx to hit index abc; use where a=xx ,c=xx to hit index a; use where b=xx,c=xx cannot hit index.

Back to the table : non-primary key index, we first find the key value of the primary key index through the index, and then find out the data that is not in the index through the primary key value. It scans one more index tree than the query based on the primary key index. This process is called back to the table.

Covering index : In the auxiliary index, whether it is a single-column index or a joint index, if the selected data column can only be obtained from the index without reading from the data area, the index used at this time is called a covering index, so that Back to the table is avoided
.

Principles for creating indexes:

  1. Create an index on the (on) field used for where to judge order sorting and join
  2. Do not have too many indexes (wasting space, slowing down updates)
  3. For fields with low discrimination, such as gender, do not build an index (the dispersion is too low, resulting in too many rows to be scanned)
  4. Frequently updated values ​​should not be used as primary keys or indexes (page splitting)
  5. Random unordered values, not recommended as primary key index, such as ID card, UUID (unordered, split)
  6. Create composite indexes instead of modifying single-column indexes

When the index fails:

  1. Use function (replace\SUBSTR\CONCAT\sum count avg) and expression calculation (+ - * /) on the index column: SELECT * FROM `t2` where id+1 = 4;
  2. Strings are unquoted, and an implicit conversion occurs. SELECT * FROM dept WHERE dname = 101
  3. The like condition is preceded by %
  4. Negative query, NOT LIKE invalid. != (<>) and NOT IN are OK in some cases.

 

Guess you like

Origin blog.csdn.net/liuhenghui5201/article/details/115553680