The principle and use of mysql index

1. What is an index?

1.1. What is an index

When a table has 5 million records, execute a query on the name field without index:
select * from user_innodb where name ='jim';

What if there is an index on the name field?

ALTER TABLE user_innodb DROP INDEX idx_name;
ALTER TABLE user_innodbA DD INDEX idx_name(name);

Index creation is time consuming.
Compared with the query without index, the query with index is dozens of times less efficient.

What exactly is an index? Why can it have such a big impact on our queries? What did you do when creating the index?

1.1.1. Index Diagram

Definition: A database index is a sorted data structure in a database management system (DBMS) to assist in quickly querying and updating data in database tables.
insert image description here

The data is stored on the disk in the form of a file, and each row of data has its disk address. If there is no index, we need to retrieve a piece of data from the 5 million rows of data, and we can only traverse all the data in this table in turn (calling the interface of the storage engine to read the next row of data) until we find this piece of data.

But after we have an index, we only need to retrieve this piece of data in the index, because it is a special data structure specially designed for fast retrieval. After we find the disk address where the data is stored, we can get the data up.

This is easy to understand, just like we are looking for a specific section of content from a 500-page book, it is certainly impossible to start from the first page.
This book will have a special catalog, it may only have a few pages of content, it is organized by page number, and can be searched according to pinyin or radicals, as long as we determine the page number corresponding to the content, we can quickly find us desired content.

1.1.2. Index type

How to create an index on the data table? Specify when building a table, or alter table, or use tools.
insert image description here

The first is the name of the index, and the second is the column of the index, such as whether we want to create an index on the id or on the name. The latter two are very important, one is called the index type .

In InnoDB, there are three types of indexes, common index, unique index (the primary key index is a special unique index), and full-text index.

Normal (Normal): Also called non-unique index, it is the most common index without any restrictions.
Unique (Unique): A unique index requires that key values ​​cannot be repeated. In addition, it should be noted that the primary key index is a special unique index, and it has an additional restriction, which requires that the key value cannot be empty. The primary key index is created with primay key.
Fulltext (Fulltext): For relatively large data, for example, if we store message content and there are several KB of data,

If you want to solve the problem of low efficiency of like query, you can create a full-text index. Only text type fields can create full-text indexes, such as char, varchar, and text.

create table m3(
name varchar(50),
full text index(name));
select*from fulltext_test where match(content) against('文字内容' INNATURAL LANGUAGEMODE);

After version 5.6, both MyISAM and InnoDB support full-text indexing. However, the built-in full-text indexing function of MySQL still has many limitations, and it is recommended to use other search engine solutions.

We say that an index is a data structure, so what kind of data structure should it choose to achieve efficient data retrieval
?

2. Index storage model deduction

2.1. Binary search

Douyin is a very popular number guessing game.
Guess what number you are within 100, and finally lock the number
by continuously narrowing the range

This is an idea of ​​binary search, also called binary search. Every time, we reduce the candidate data by half. This method is more efficient if the data has already been sorted.

So first, you can consider a data structure that uses an ordered array as an index.

The efficiency of equivalence query and comparison query of ordered arrays is very high, but there will be a problem when updating data. It may need to move a large amount of data (change index), so it is only suitable for storing static data.

In order to support frequent modifications, such as inserting data, we need to use linked lists. In the case of a linked list, if it is a singly linked list, its search efficiency is still not high enough.

So, is there a linked list that can use binary search?
In order to solve this problem, BST (Binary Search Tree), which is what we call a binary search tree, was born

2.2. Binary Search Tree (BST Binary Search Tree)

What are the characteristics of a binary search tree?
All nodes in the left subtree are smaller than the parent node, and all nodes in the right subtree are larger than the parent node. After projecting onto the plane, it is an ordered linear table.
insert image description here

Binary search tree can not only realize fast search, but also can realize fast insertion.
But there is a problem with the binary search tree:
its search time is related to the depth of the tree, and in the worst case, the time complexity will degenerate to O(n).
What is the worst case scenario?
https://www.cs.usfca.edu/~galles/visualization/Algorithms.html

It is still the batch of numbers just now, if the data we insert happens to be in order, 5, 7, 12, 14, 17, 25.
What has become of the binary search tree at this time?
It will become a linked list (we call this kind of tree a "slanted tree"). In this case, the purpose of speeding up the search cannot be achieved, and there is no difference in the efficiency of sequential search.
insert image description here

What caused it to tilt?
Because the depth difference between the left and right subtrees is too large, the left subtree of this tree has no nodes at all—that is, it is not balanced enough.
So, is there a more balanced tree with less depth difference between the left and right subtrees?
This is a balanced binary tree, called Balanced binary search trees, or AVL tree (AVL is the abbreviation of the person who invented this data structure).

2.3. Balanced binary tree (AVL Tree) (left-handed, right-handed)

The definition of a balanced binary tree: the absolute value of the depth difference between the left and right subtrees cannot exceed 1.
For example, the depth of the left subtree is 2, and the depth of the right subtree can only be 1 or 3.
At this time, we will insert 1, 2, 3, 4, 5, and 6 in order. It must be like this, and it will not become a "slanting tree".
insert image description here

How does it achieve its balance? How to ensure that the depth difference between the left and right subtrees cannot exceed 1?
https://www.cs.usfca.edu/~galles/visualization/AVLtree.html
Insert 5, 7, 14.
Pay attention: when we insert 5 and 7, according to the definition of binary search tree, 14 must be on the right of 7. At this time, the depth of the right node of the root node 1 will become 2, but the depth of the left node is 0, because it has no child nodes, so it violates the definition of a balanced binary tree.

So what should we do? Because it is connected to a right node under the right node, right-right type, so at this time we have to lift 7 up, this operation is called left rotation.
insert image description here

Similarly, if we insert 14, 7, and 5, it will become a left-left type at this time, and a right-rotation operation will occur, and 7 will be lifted up.
insert image description here
So in order to maintain balance, the AVL tree performs a series of calculations and adjustments when inserting and updating data.

We have solved the problem of balance, so how to query data with a balanced binary tree as an index?
In a balanced binary tree, a node whose size is a fixed unit, what should be stored as an index?

It should store the contents of three blocks:
the first is the key value of the index. For example, if we create an index on id, I will find the key value of id in the index when I query with the condition of where id =1.
The second is the disk address of the data, because the function of the index is to find the address where the data is stored.
The third one, because it is a binary tree, it must also have references to the left child node and the right child node, so that we can find the next node. For example, when it is greater than 26, go to the right, go to the next tree node, and continue to judge.
insert image description here

If the data is stored in this way, let's see what problems there will be.
First of all, for InnoDB, the indexed data is placed on the hard disk. View the size of data and indexes:

select CONCAT(ROUND(SUM(DATA_LENGTH/1024/1024),2),'MB') AS data_len,CONCAT(ROUND(SUM(INDEX_LENGTH/1024/1024),2),'MB') as index_len from information_schema.TABLES
where table_schema='yteaher' and table_name='user_innodb';

When we use the tree structure to store the index, because we get a piece of data, we need to compare whether it is the required data at the server layer, and if not, we need to read the disk again.

Accessing a node requires an I/O with the disk. The smallest unit of InnoDB operating disk is a page (or called a disk block), the size is 16K (16384 bytes).
Then, a tree node is 16K in size .

If we only store one key value + data + reference in one node, such as an integer field, it may only use a dozen or dozens of bytes, which is far from reaching the capacity of 16384 bytes, so visit a tree node , A lot of space is wasted when performing an IO.

So if each node stores too little data, we need to visit more nodes to find the data we need from the index, which means that there will be too many interactions with the disk and more time will be consumed.

insert image description here

For example, in the picture above, we have 6 pieces of data in a table. When we query id=66, we need to interact with the disk 3 times to query two child nodes. What if we have millions of data? This time is even more difficult to estimate.

So what's the solution?
It is to allow each node to store more data.
In this case, the depth of the tree will be greatly reduced. Our tree has changed from a tall and thin shape to a short, fat and chunky one.
At this time, our tree is no longer binary, but multi-fork, or multi-way.

2.4. Multi-way balanced search tree (B Tree) (split, merge)

Balanced Tree
This is our multi-way balanced search tree, called B Tree (B stands for Balance).
Like the AVL tree, the B-tree stores key values, data addresses, and node references at branch nodes and leaf nodes.

It has a feature: the number of forks (the number of paths) is always 1 more than the number of keywords. For example, in the tree we drew, each node stores two keywords, then there will be three pointers pointing to three child nodes.
insert image description here

What is the search rule of B Tree?
For example, we want to find 20 in this table.
• Search key = 20
• 20>15, exclude 0X01
• 20<35, exclude 0X03
• Then he is between 15 and 35,
• hit 0X02
• go disk block 3
• 20=20
• hit

Only 3 IOs are used, is this more efficient than the AVL tree?
Then how does B Tree realize that one node stores multiple keywords and maintains a balance? What is the difference with AVL tree?
https://www.cs.usfca.edu/~galles/visualization/Algorithms.html

For example, when the Max Degree (number of paths) is 3, we insert data 1, 2, and 3. When inserting 3, it should be in the first disk block, but if a node has three keywords, it means that there are With 4 pointers, the child nodes will become 4-way, so splitting must be performed at this time (actually B+Tree). Lift up the data 2 in the middle, and turn 1 and 3 into child nodes of 2.

If a node is deleted, there will be an opposite merge operation.
Note that splitting and merging here are different from left-handed and right-handed AVL trees.
We continue to insert 4 and 5, and the B Tree will split and merge again.

insert image description here
It can also be seen from this that there will be a large number of index structure adjustments when updating the index, so it explains why not to build indexes on frequently updated columns, or why not to update the primary key.
The splitting and merging of nodes is actually the splitting and merging of InnoDB pages.

2.5. B+ tree (enhanced multi-way balanced search tree)

The efficiency of B Tree is already very high, why does MySQL need to improve B Tree and finally use B+Tree?
Generally speaking, this improved version of B-Tree solves more comprehensive problems than B-Tree.
Let's take a look at the storage structure of the B+ tree in InnoDB:
insert image description here

B+Tree in MySQL has two characteristics:
1. The number of its keywords is equal to the number of paths;
2. The root node and branch nodes of B+Tree will not store data, only the leaf nodes will store data.
Current cognition: What is the data we want to store here? Is it the address of real data?
Searched keywords will not be returned directly, but will go to the leaf nodes of the last layer. For example, if we search for id=28, although it is directly hit on the first layer, the data address is on the leaf node, so I will continue to search down until the leaf node.
3. Each leaf node of B+Tree adds a pointer to the adjacent leaf node, and its last data points to the first data of the next leaf node, forming an ordered linked list structure.

The advantages brought by the characteristics of B+Tree in InnoDB:
1) It is a variant of B Tree, and it can solve the problems that B Tree can solve. What are the two major problems that B Tree solves? (each node stores more keywords; the number of paths is more)
2) The ability to scan databases and tables is stronger (if we want to perform a full table scan on the table, we only need to traverse the leaf nodes instead of the entire tree B+Tree gets all the data)
3) The disk read and write ability of B+Tree is stronger than that of B Tree (the root node and branch nodes do not save the data area, so one node can save more keywords, one disk More keywords are loaded)
4) Stronger sorting ability (because there is a pointer to the next data area on the leaf node, the data forms a linked list)
5) More stable efficiency (B+Tree always gets data at the leaf node, so IO times are stable)

2.6. Index method: Is B+Tree really used?

In Navicat's tools, there are two ways to create an index.
HASH: Retrieve data in the form of KV, that is, it will generate a hash code and a pointer based on the index field, and the pointer points to the data.

insert image description here

Features of hash index
First, its time complexity is O(1), and the query speed is relatively fast. However, the data in the hash index is not stored in order, so it cannot be used for sorting.
Second, when we query data, we need to calculate the hash code according to the key value, so it can only support equivalent query (= IN), not range query (> < >= <= between and).
The third: If there are many repeated values ​​in the field, there will be a large number of hash conflicts (using the zipper method to solve), and the efficiency will be reduced.

It should be noted that in InnoDB, a hash index cannot be explicitly created (the so-called supported hash index refers to
Adaptive Hash Index).
https://dev.mysql.com/doc/refman/5.7/en/create-index.html

The memory storage engine can use Hash index.

CREATE TABLE `user_memory` (
`id` INT ( 11 ) NOT NULL AUTO_INCREMENT,
`name` VARCHAR ( 255 ) DEFAULT NULL,
`gender` TINYINT ( 1 ) DEFAULT NULL,
`phone` VARCHAR ( 11 ) DEFAULT NULL,
PRIMARY KEY ( `id` ),
KEY `idx_name` ( `name` ) USING HASH
) ENGINE = MEMORY AUTO_INCREMENT = 1 DEFAULT CHARSET = utf8mb4;

If you ask why you don't use red-black trees during the interview:

What do the constraints of the red-black tree guarantee? The longest path is no more than twice the shortest path. Not very suitable for database indexing. Data structures that fit in memory, such as implementing consistent hashing.
Because of the characteristics of B Tree and B+Tree, they are widely used in file systems and databases, such as Windows' HPFS file system, Oracle, MySQL, and SQLServer databases.

3. B+Tree landing form

3.1. MySQL data storage files

In the Mysql architecture and internal modules, we know that different storage engine files are different.
Each InnoDB table has two files (.frm and .ibd), and MyISAM table has three files (.frm, .MYD, .MYI).
insert image description here

There is one identical file, .frm. .frm is the file defining the table structure in MySQL, which will be generated no matter which storage engine you choose when creating the table.

3.2.1.MyISAM CODE

In MyISAM, there are two other files:
one is the .MYD file, and D stands for Data, which is the data file of MyISAM, storing data records, such as all the table data of our user_myisam table.
One is the .MYI file. I stands for Index, which is the index file of MyISAM, which stores the index. For example, if we create a primary key index on the id field
, then the primary key index is in this index file.
In other words, in MyISAM, the index and data are two separate files.
So how do we find data based on the index?
In the B+Tree of MyISAM, the leaf nodes store the disk addresses corresponding to the data files. So after finding the key value from the index file .MYI, it will get the corresponding data record from the data file .MYD.

insert image description here

If it is an auxiliary index, what is the difference

ALTER TABLE user_innodb DROP INDEX index_user_name;
ALTER TABLE user_innodb ADD INDEX index_user_name (name);

In MyISAM, the auxiliary index is also in this .MYI file.
There is no difference between the auxiliary index and the primary key index in the way of storing and retrieving data. The disk address is found in the index file, and then the data is obtained in the data file.
insert image description here

This is the form of the index landing in MyISAM. But it's different in InnoDB. Let's take a look.

3.2.2.InnoDB

InnoDB has only one file (.ibd file), so where is the index placed?
In InnoDB, it uses the primary key as the index to organize data storage, so the index file and data file are the same file, both in the .ibd file.

On the leaf node of InnoDB's primary key index, it directly stores our data.
Therefore, this is why it is said that in InnoDB, the index is the data, and the data is the index.

But there will be a problem here. An InnoDB table may have many multi-indexes, and there must be only one copy of the data. So which index's leaf node is the data on?

insert image description here

Here I want to introduce to you a concept called clustered index (clustered index).
That is, the logical order of index key values ​​is consistent with the physical storage order of table data rows. (For example, the directory of the dictionary is sorted by pinyin, and the content is also sorted by pinyin. This kind of directory sorted by pinyin is called a clustered index).

The way InnoDB organizes data is (clustered index organize table). If a table creates a primary key index, then this primary key index is a clustered index, which determines the physical storage order of data rows.

The question is, what do they store in indexes other than the primary key index, and how to retrieve complete data if there is no data on their leaf nodes? For example, an ordinary index built on the name field.

insert image description here

In InnoDB, primary key indexes and auxiliary indexes are divided into primary and secondary. We just said that if there is a primary key index, then the primary key index is a clustered index. The other indexes are collectively called "secondary indexes" or auxiliary indexes.

The secondary index stores the key value of the auxiliary index, such as building an index on name, and storing the value of name, bobo, bibi, etc. on the node.

The leaf node of the secondary index stores the value of the primary key corresponding to this record. For example, bobo id=1, jim id=4...

Therefore, the process of retrieving data from the secondary index is as follows:
when we use the name index to query a record, it will find name=bobo in the leaf node of the secondary index, get the primary key value, that is, id=, and then go to The leaf nodes of the primary key index get the data.

From this point of view, because the primary key index scans one less B+Tree than the secondary index, its speed is relatively faster.

But what if a table doesn't have a primary key? Which index leaf node should the complete record be placed in? Or, does the table have no indexes at all? Where is the data placed?

https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html

1. If we define the primary key (PRIMARY KEY), then InnoDB will choose the primary key as the clustered index.
2. If no primary key is explicitly defined, InnoDB will select the first unique index that does not contain NULL values ​​as the primary key index
.
3. If there is no such unique index, InnoDB will choose the built-in 6-byte long ROWID as a hidden clustered index, and it will increment the primary key as row records are written.

4. Index Usage Principles

We tend to have a misunderstanding, which is to build indexes on frequently used query conditions. The more indexes, the better. Is this true?

4.1. Discrete (sàn) degree of columns

The first one is called column dispersion. Let’s take a look at the formula of column dispersion:
count(distinct(column_name)) : count(*), the ratio of all distinct values ​​of a column to all data rows. In the case of the same number of data rows, the larger the numerator, the higher the dispersion of the columns.

insert image description here

In simple terms, if the column has more repeated values, the dispersion will be lower, and if there are fewer repeated values, the dispersion will be higher.
We do not recommend that you build indexes on fields with low dispersion.
Check it when there is no index:

SELECT * FROM `user_innodb` WHERE gender = 0;

Check again after building the index:

ALTER TABLE user_innodb DROP INDEX idx_user_gender;
ALTER TABLE user_innodb ADD INDEX idx_user_gender (gender); -- 耗时比较久
SELECT * FROM `user_innodb` WHERE gender = 0;

found that it took longer.

4.2. Joint index leftmost match

What we said above is the index created for a single column, but sometimes when we query with multiple conditions, we will also create a joint index.
For example: when querying the results, you must enter the ID card and the test number at the same time.

Single-column indexes can be seen as special joint indexes.
For example, we created a joint index for name and phone on the user table.

ALTER TABLE user_innodb DROP INDEX comidx_name_phone;
ALTER TABLE user_innodb add INDEX comidx_name_phone (name,phone);

insert image description here

A joint index is a composite data structure in B+Tree, which builds a search tree in order from left to right (name is on the left, phone is on the right).
As can be seen from this picture, name is ordered and phone is unordered. Phones are ordered when names are equal.

At this time, when we use where name= 'jim' and phone = '136xx ' to query data, B+Tree will compare the name first to determine the direction of the next search, left or right. If the names are the same, then compare the phones. However, if the query condition does not have a name, you do not know which node to check in the first step, because name is the first comparison
factor when building a search tree, so no index is used.

4.2.1. When to use a joint index

Therefore, when we build a joint index, we must put the most commonly used columns on the leftmost.
For example, the following three statements, do you think the joint index is used?
1) Use two fields and use a joint index:
insert image description here

2) Use the name field on the left to use the joint index:
insert image description here

3) Use the phone field on the right, unable to use the index, full table scan:
insert image description here

4.2.2. How to create a joint index

One day our DBA found me and said that there are two queries in our project that are very slow. According to our idea, one query creates one index, so we created two indexes for these two SQLs.

CREATE INDEX idx_name on user_innodb(name);
CREATE INDEX idx_name_phone on user_innodb(name,phone);

When we create a joint index, according to the leftmost matching principle, when using the left field name to query, the index can also be used, so the first index is completely unnecessary.
It is equivalent to establishing two joint indexes (name), (name, phone).

If we create an index index(a,b,c) of three fields, it is equivalent to creating three indexes:
index(a)
index(a,b)
index(a,b,c)
use where b=? and where b =? and c=? cannot use indexes.
This is the leftmost matching principle of the joint index in MySQL.

4.3. Covering indexes

What is the return table:
non-primary key index, we first find the key value of the primary key index through the index, and then find out the data that is not in the index through the primary key value, it scans one more index tree than the query based on the primary key index, this process Just call back to the table.

For example:

select * from user_innodb where name = 'bobo';

insert image description here

In the auxiliary index, whether it is a single-column index or a joint index, if the selected data column can only be obtained from the index, and does not need to be read from the data area, the index used at this time is called a covering index, which avoids the return surface.

The value in Extra is "Using index" which means that the covering index is used.
insert image description here

Let's first create a joint index:

-- 创建联合索引
ALTER TABLE user_innodb DROP INDEX comixd_name_phone;
ALTER TABLE user_innodb add INDEX `comixd_name_phone` (`name`,`phone`);

All three queries use covering indexes:

EXPLAIN SELECT name,phone FROM user_innodb WHERE name= 'jim' AND phone = '
13666666666';
EXPLAIN SELECT nameFROM user_innodb WHERE name= 'jim' AND phone = '
13666666666';
EXPLAIN SELECT phone FROM user_innodb WHERE name= 'jim' AND phone = '
13666666666';

select * , the covering index is not used here.
If changed to only where phone = query. According to our previous analysis, it does not use indexes.
Covering indexes can actually be used! A covering index is not directly related to whether or not it is possible to use an index.
Obviously, because the covering index reduces the number of IOs and the amount of data access, query efficiency can be greatly improved.

5. Index creation and use

Because indexes play a huge role in improving query performance, our goal is to use indexes as much as possible.

5.1. On what field to index?

1. Create an index on the (on) field used for where judgment, order sorting and join.
2. The number of indexes should not be too much.
- Waste of space, slower updates.
3. Do not build indexes for fields with low discrimination, such as gender.
- The dispersion is too low, resulting in too many scan lines.
4. Do not use frequently updated values ​​as primary keys or indexes.
——Page splitting
5. Random and unordered values ​​are not recommended as primary key indexes, such as ID cards and UUIDs.
- unordered, split
6, create a composite index instead of modifying a single column index
- create a joint index

5.2. When do indexes expire?

1. Use function (replace\SUBSTR\CONCAT\sum count avg) and expression
calculation (+ - * /) on the index column: https://www.runoob.com/mysql/mysql-functions.html

explain SELECT * FROM `t2` where id+1 = 4;

2. If the string is not quoted, there will be an implicit conversion

ALTER TABLE user_innodb DROP INDEX comidx_name_phone;
ALTER TABLE user_innodb add INDEX comidx_name_phone (name,phone);

explain SELECT * FROM `user_innodb` where name = 136;
explain SELECT * FROM `user_innodb` where name = '136';

3. Like
abc%, like %2673%, and like %888 in the where condition with % in front of the like condition are not indexed? Why?

explain select *from user_innodb where name like 'wang%';
explain select *from user_innodb where name like '%wang';

Filtering is too expensive. At this time, full-text indexing can be used.
4. The negative query
NOT LIKE cannot:

explain select *from employees where last_name not like 'wang'

!= (<>) and NOT IN can work in some cases:

explain select *from employees where emp_no not in (1) explain select *from
employees where emp_no <> 1

Note that it is related to the database version, data volume, and data selection.
In fact, the optimizer has the final say on whether to use an index or not.
What is the optimizer based on?
Based on cost overhead (Cost Base Optimizer), it is not based on rules (Rule-Based Optimizer), nor is it based on semantics. No matter how little the cost is, you can get it.
https://docs.oracle.com/cd/B10501_01/server.920/a96533/rbo.htm#38960

https://dev.mysql.com/doc/refman/5.7/en/cost-model.html
There are basic principles for using indexes, but there are no specific rules. There is no rule that indexes must be used in any situation, and indexes must not be used in any situation

Guess you like

Origin blog.csdn.net/lx9876lx/article/details/129129755