What is a mysql index? Index data structure? Type of index, how to use index correctly?

Insert image description here

1. What is an index?

MySQL’s official definition of index is:Index (Index) is a data structure that helps MySQL obtain data efficiently.

You can see the essence of the index:The index is a data structure.

Let’s take an example to understand. When reading any book, the first thing you see is the catalog. It will be very fast to query the contents of the book through the catalog, as follows:

Insert image description here

The table of contents of the book is placed in order, with Chapter 1, Chapter 2..., which itself is a data structure stored in order, a sequential structure.

But what if we are going to the library to find a book?

The best way is to have the following guidelines:

Insert image description here

As can be seen from the above, the entire index structure is an upside-down tree. In fact, it is a data structure. This data structure increases the query speed better than the linear directory mentioned earlier. Otherwise, you may not Look for it from floor to floor.

Indexes in MySql are actually the same thing. We can create a series of indexes in the database, such as when creating a primary key IDA primary key index will be created by default, when we query content through ID, we first check the index database. After finding the index, we can quickly locate the specific location of the data based on the index.

So indexing is an important aspect of application design and development. If there are too many indexes, application performance may be affected, and if there are too few indexes, query performance may be affected. Finding the right balance is critical to the performance of your application.

Note: The following introduction to indexes mainly refers to indexes in InnoDB unless otherwise specified.

2. Classification of indexes

In order to learn indexing well, we must first have some understanding of the types of indexes. Let's take a look at what are the classifications of indexes? We classify and understand from four angles:

1. Data structure perspective

  • B+Tree index
  • Hash index
  • Full-text index (Full-texts)

2. Physical storage perspective

  • Clustered index (clustered index)
  • Auxiliary index (secondary index, non-clustered index)

3. Logical perspective

  • primary key index
  • unique index
  • Ordinary index
  • prefix index

4. Actual usage angle

  • Single column index
  • Union index (composite index)

3. Index data structure

After the classification is clear, let's take a look at the data structures of the index?

The InnoDB storage engine supports the following common indexes:

  1. B+Tree index
  2. Full text index
  3. Hash index

The more critical one is the B+Tree index:

  • B+Tree index is an index in the traditional sense, which is currently the most commonly used and most effective index for searching in relational database systems.
  • The structure of the B+Tree index is similar to a binary tree, and data can be quickly found based on the key value.
  • Note that the B in B+Tree does not represent binary, but represents balance, because B+Tree evolved from the earliest balanced binary tree, but B+Tree is not a binary tree.

3.1. Binary tree

The child nodes of the node in the tree do not exceed 2Ordered tree, in popular terms, it is the nodeThe left is small and the right is large.

Insert image description here

3.2. Binary search tree

Binary Search Tree (BST, Binary Sort Tree), also known as binary sorting tree, or binary search tree. A binary search tree satisfies the following conditions:

  • First it is a binary tree
  • All values ​​in the left subtree are less than the value of the root node
  • All values ​​in the right subtree are greater than the value of the root node
  • The left and right subtrees also satisfy the above two points.

Insert image description here

3.3. Balanced binary tree

Through the search operation of the binary search tree, it can be found that the search efficiency of a binary search tree depends on the height of the tree. If the height of the tree is made to the lowest, the search efficiency of the tree will also become higher.

For example, the following binary tree is composed entirely of right subtrees:

Insert image description here

You can see at this point that this binary search tree is actuallydegenerated into a similar linked list< /span> structure, if we want to find 60, we have to traverse until the last one to find it. The search time complexity is O(n), so the efficiency is not high.

So what is a balanced binary tree (AVL)?

A balanced binary tree is a binary sorting tree. At the same time, the height difference between the two subtrees on the left and right of any node (or balance factor, referred to asBF ) does not exceed 1, and the left and right subtrees are also satisfied. The performance of the balanced binary search tree query is close to the binary search method, and the time complexity is O(logn), so the search performance must be higher than that of the ordinary binary search tree. .

Insert image description here

So the question is, since balanced binary tree search efficiency is high, why not use it instead of B+Tree?

  • The search performance of balanced binary trees is relatively high, but not the highest, just close to the highest performance. The best performance requires building an optimal binary tree, but the establishment and maintenance of the optimal binary tree requires a lot of operations. Therefore, users generally only need to build a balanced binary tree.
  • The query speed of a balanced binary tree is indeed very fast, but the cost of maintaining a balanced binary tree is very high. Generally speaking, in order to maintain balance, one or more left turns and right turns< a i=4> to get the balance of the tree after insertion, update and deletion

Interview soul torture: What problems does balanced binary tree have for a database?

Because each node of a binary tree has at most two child nodes, when the number of nodes is relatively large, the height of the binary tree grows quickly. For example, when there are 1,000 nodes, the height of the tree is almost 9 to 10 levels. We know that the database is persistent and the data must be read from the disk. A general mechanical disk can do at least 100 IOs per second. The time for one IO is basically 0.01 seconds. It takes 0.1 seconds to search for 1000 nodes. Seconds, if it is 10,000 nodes, what about 100,000 nodes? Therefore, for databases, in order to reduce the height of the tree, the B+Tree data structure is proposed

Before learning B+Tree, let’s first understand what B-Tree is

3.4、B-Tree

In a binary tree, each node has a data item (can be understood as the value of the node, key), and each node has up to 2 child nodes. If allowed, each node can have more data items. and child nodes, which is multiple tree. If this tree is still a balanced tree, then it can be called B-Tree (B-tree, not called B minus tree), the main characteristics of B-tree are as follows:

  1. Multiple elements are stored in the nodes of the B-tree, and each internal node has multiple forks;
  2. The elements in the node contain key values ​​and data, and the key values ​​in the node are arranged from large to small. In other words, all nodes store data;
  3. Elements in the parent node will not appear in the child node;
  4. All leaf nodes are located at the same level, leaf nodes have the same depth, and there are no pointer connections between leaf nodes.

As shown in the figure below, it belongs to B-Tree:

Insert image description here

Obviously, with more nodes, the height decreases. For example, if there are 1-100 numbers, the binary tree can only be divided into two ranges at a time, 0-50 and 51-100, while the B-tree is divided into 4 ranges 1-25, 25 -50, 51-75, 76-100 can filter out three-quarters of the data in one go. So why choose B+Tree instead of B-Tree?

Let’s first look at this B-Tree:

Insert image description here

We know that the database index is stored on the disk. When the amount of data is large, the entire index cannot be loaded into the memory. Instead, each disk page (corresponding to the node of the index tree) can only be loaded one by one. So we need to reduce the number of IOs. For a tree, the number of IOs is the height of the tree.

For example, in the picture above we want to check the value of target element 8:

For the first IO, load the nodes where 5 and 20 are located into the memory, compare the target element with 5 and 20, and locate the node corresponding to the p2 pointer in the middle area;
The second IO, load the nodes where 7 and 10 are located into the memory, compare the target element with 7 and 10, and locate the node corresponding to the p2 pointer in the middle area;
The third IO , load nodes 8 and 9 into the memory, compare the target element with 8 and 9 and find that the target value exists, so the target element is found.

It can be seen that the number of comparisons of B-Tree during query is no less than that of a binary tree, especially when there are very many nodes. However,the memory comparison speed is very fast and consumes a lot of memory. can be ignored, so as long as the tree height is low and IO is small, query performance can be improved. This is one of the advantages of B-Tree

Let’s take a look at B+Tree

3.5、B+Tree

B+ trees, like binary trees and balanced binary trees, are classic data structures.

B+ tree evolved from B-Tree and index sequential access method, but in actual use, B-tree is almost no longer used.

The definition of B+Tree can be found in many data structure books and is very complicated. Let’s briefly outline its definition:

B+Tree is a modified form of B-Tree. The leaf nodes on B+Tree store keywords and the addresses of corresponding records, and the layers above the leaf nodes are used as indexes.

A B+Tree of order m is defined as follows:

  • Each node can have up to m elements;
  • Except for the root node, each node has at least (m/2) elements;
  • If the root node is not a leaf node, then it has at least 2 child nodes;
  • All leaf nodes are on the same level;
  • A non-leaf node with k children has (k-1) elements, in ascending order;
  • The elements in the left subtree of an element are all smaller than it, and the elements in the right subtree are all greater than or equal to it;
  • Non-leaf nodes only store keywords and indexes pointing to the next child node, and records are only stored in leaf nodes;
  • Adjacent leaf nodes are connected with pointers.

The concept is complicated, but we can summarize the differences between B+Tree and B-Tree:

  1. B+Tree non-leaf nodes do not store data;
  2. B+Tree leaf nodes are connected using bidirectional pointers, and the lowest leaf nodes form a bidirectional ordered linked list.

Look at this B+Tree:

Insert image description here

If you want to find the target element 8, you will definitely need to perform 3 IO operations (same as B-Tree). This is the case of equivalent query. What if we want to perform a range query? For example, we want to query data between 8~25.

  • The first step is to query the element with the minimum value of 8. At this time, 3 IOs are performed;
  • Because each leaf node is connected through a pointer and is an ordered linked list, you only need to traverse the linked list composed of leaf nodes until you find the maximum value of 25.

So what about comparing B-Tree’s range search?

Because there is data in each node of B-Tree, when 8 is found in the leaf node, but the maximum value cannot be found, it needs to be traversed back and searched, that is, in-order traversal is required.

There are three methods for tree traversal. Preorder, inorder, postorder.

Insert image description here

So, why does MySQL use B+Tree instead of B-Tree for indexes?

In general, the main factor that affects mysql search performance is the number of disk IOs, and B-Tree will save data regardless of leaf nodes or non-leaf nodes. This results in fewer pointers that can be saved in non-leaf nodes and fewer pointers. In order to save a large amount of data, the height of the tree can only be increased, resulting in more IO operations and lower query performance.

3.6. Hash index

The main differences between Hash index and B+Tree are as follows:

  • When querying a single piece of data, the time complexity of Hash index query is O(1), and the time complexity of B+Tree index is O(logN);
  • Hash index is only suitable for equal value query, but cannot be used for range query because it is randomly distributed;
  • Hash index cannot use the index to complete sorting, or because the data is randomly distributed;
  • Hash indexes do not support the leftmost matching rule of joint indexes;
  • If there are a large number of duplicate key values, the efficiency of the hash index will be very low because there are hash collisions. The larger the amount of data, the higher the probability of hash collisions.

3.7. Full text index

  • It is mainly used to find keywords in text rather than directly comparing them with values ​​in the index.
  • The fulltext index is very different from other indexes. It is more like a search engine rather than a simple parameter matching of the where statement.
  • The fulltext index is used with the match against operation instead of the general where statement plus like.
  • It can be used in create table, alter table, and create index, but currently only full-text indexes can be created on char, varchar, and text columns.
  • It is worth mentioning that when the amount of data is large, it is better to first put the data into a table without a global index and then use CREATE index to create a fulltext index than to first create a fulltext for a table and then write the data into it. The speed is much faster.

4. Classification of physical storage angles

4.1. Clustered index (clustered index)

The indexes in InnoDB are naturally organized according to B+ trees. Earlier we said that the leaf nodes of B+ trees are used to store data, but what data are stored?

The index naturally needs to be placed, because the function of the B+ tree is originally a data structure proposed to quickly retrieve data. What would be put without the index? But the data in the tables in the database is the data we really need, and the indexes are only auxiliary data, and even a table does not need a custom index. So how is the data in InnoDB organized?

InnoDB uses a clustered index, which uses the primary key of the table to construct a B+ tree and records data in the entire table's rows Stored in the leaf nodes of the B+Tree, that is, the leaf nodes store "primary key and current row data" . That is to say, the index is the data, and the data is the index. Since the clustered index is built using the primary key of the table, each table can only have one clustered index.

Example: Take a table like the following:

Insert image description here

The B+Tree structure is like this:

Insert image description here

The leaf node of the clustered index is the data page. In other words, the data page stores a complete record of each row. Therefore, the advantages of clustered index are:

  1. The complete row of data can be obtained through the clustered index;
  2. Sort searches and range searches for primary keys are very fast.

What if we don't define a primary key?

MySQL will use a unique index. Without a unique index, MySQL will also create an implicit column RowID as the primary key, and then use this primary key to build a clustered index.

4.2. Auxiliary index (secondary index)

The clustered index introduced above can only work when the search condition is the primary key value, because the data in B+Tree is sorted according to the primary key. So what if we want to use other columns as search conditions?

We generally create multiple indexes, which are called auxiliary indexes (secondary indexes, or non-clustered indexes). The leaf nodes store the "primary key and current index columns" value".

  • For secondary indexes, leaf nodes do not contain all data for a row record.
  • In addition to the key value, the index row in each leaf node also contains a bookmark.
  • This bookmark is used to tell the InnoDB storage engine where to find the row data corresponding to the index.
  • Therefore, the bookmark of the auxiliary index of the InnoDB storage engine is the clustered index key of the corresponding row of data.

Create an index on name, and the leaf node stores data as follows:

Insert image description here

4.3. Reply to table

Since the nodes of the auxiliary index do not store all the data, what should we do when we want to query all the data through name?

  • The existence of auxiliary indexes does not affect the organization of data in the clustered index, so there can be multiple auxiliary indexes on each table.
  • When looking for data through the secondary index, the InnoDB storage engine will traverse the secondary index and obtain the primary key pointing to the primary key index through the leaf-level pointer, and then use the primary key index (clustered index) to find a complete row record . This process is also calledback to table.
  • That is to say, querying a complete user record based on the value of the auxiliary index requires the use of 2 B+Tree: one auxiliary index and one clustered index, as follows:

Insert image description here

4.4. Covering index

Obviously, returning the table requires an additional B+Tree search process, which will inevitably increase the query time.

So under what circumstances is it not necessary to return the form?

The primary key and current index column value are stored in the leaf node of the auxiliary index.If we only need to query the value in the leaf node, then there is no need to return the table. This This situation is called covering the index or triggering index coverage.

select id,name from account where name='H';

Just like the above SQL, you only need to query the id and name fields, so there is no need to return the table.

Strictly speaking, a covering index is not an index structure. It can be understood as an optimization method, such as establishing a joint index.

5. Logical perspective classification

5.1. Primary key index

For an index built on the primary key field, the value in the index column must be unique, and null values ​​are not allowed.

For example: order table (order number, name, price, etc.), the order number is uniquely identified and can be used as the primary key.

5.2. Unique index

The index built on the UNIQUE field is a unique index. A table can havemultiple unique indexes, and the index column value is allowed to be NULL. , to avoid duplication of values ​​in a data column in the same table.

Such as: ID number, etc.

5.3. Ordinary index

The requirements for primary key indexes and unique indexes on fields are: the fields are required to be primary key or UNIQUE fields

Those indexes built on ordinary fields are called ordinary indexes, which neither require the field to be the primary key nor the field to be UNIQUE.

Both the index and key keywords can set ordinary indexes.

5.4. Prefix index

A prefix index refers to an index established on the first few characters of a character type field or the first few bytes of a binary type field, rather than building an index on the entire field.

For example, you can index the first 5 characters of the name field in the table above:

Insert image description here

The prefix index can be built on columns of type char、varchar、text、binary、varbinary, which can greatly reduce the storage space occupied by the index and improve the query efficiency of the index.

6. Actual usage angle

The above indexes are all built on one column, so they can be called single-column indexes. In actual use, we often build indexes on two or even more columns.

6.1. Joint index

Combining multiple columns on a table for indexing is called a joint index or composite index.

For example: index(a,b) combines the two columns a and b to form an index.

Creating a joint index will only create one B+Tree. Creating an index for multiple columns will create a B+Tree for each column. There are several B+Tree for several columns, for example, index(a), index( b), construct an index for each of the two columns a and b, 2 B+Tree.

Index(a,b) has two meanings in index construction:

  1. First sort the records according to column a;
  2. When the records in column a are the same, column b is used for sorting.

Therefore, for a jointly indexed B+Tree, the nodes are as follows:

Insert image description here

According to the above statement, you can see that the value of a is in order, that is, 1,1,2,2,3,3, while the value of b is 1,2,1,4,1,2, which is disordered .

At the same time, we can also find that when the a values ​​are equal, the b values ​​are arranged in order, but this order is relative. Therefore, the leftmost matching principle will stop when encountering a range query, and the remaining fields cannot use the index. For example, a = 1 and b = 2, both a and b fields can use indexes, because when the value of a is determined, b is relatively ordered, and a>1 and b=2 , the a field can match the index, but the b value cannot, because the value of a is a range, and b is unordered in this range.

6.2. Leftmost matching principle

The above matching method is calledleftmost matching principle:

Leftmost priority, any consecutive index starting from the leftmost one can be matched. At the same time, matching will stop when a range query(>、<、between、like) is encountered.

Suppose we create such a joint indexindex(a,b,c), which is equivalent to creating three indexes: a, a-b, a-b-c.

1. Full value matching query

SELECT * FROM users WHERE a=1 AND b=3 AND c=1;

SELECT * FROM users WHERE b=3 AND a=1 AND c=1;

SELECT * FROM users WHERE c=1 AND a=1 AND b=3;

Analyze whether the index is used through the execution plan

EXPLAIN SELECT * FROM users WHERE c=1 AND a=1 AND b=3;
...

After testing, it was found that indexes are used:

Insert image description here

Isn’t it said to be the most leftist principle? Why can c also use the index on the leftmost side?

This is because there is a query optimizer in Mysql, which will automatically optimize the query order. It will judge the most efficient order in which to correct this SQL statement, and finally generate the real execution plan.

2. Match the column on the left

The case of equivalent query:

① Follow the leftmost principle and use the index.

select * from users where a = '1' 

select * from users where a = '1' and b = '2'  

select * from users where a = '1' and b = '2' and c = '3'

② Does not follow the leftmost principle, does not use indexes, and scans the entire table.

select * from users where  b = '2'; 

select * from users where  c = '3';

select * from users where  b = '1' and c = '3'; 

③ If it is discontinuous, only the index of a is used.

select * from users where a = '1' and c = '3';

3. Matching range value

In the case of range query:

① Perform a range query on the leftmost column, using an index.

select * from users where  a > 1 and a < 3;

② When multiple columns perform a range search at the same time, the B+Tree index is used only for the range search on the leftmost column of the index, that is, only a uses the index. In , we can only continue filtering one by one according to the condition . 1<a<31<a<3b > 1

select * from users where  a > 1 and a < 3 and b > 1;

③ The column on the left matches equal values, and the range matches another column, using an index.

select * from users where  a = 1 and b > 3;

7. Other indexes

7.1. Adaptive hash index

In addition to the various indexes we mentioned earlier, the InnoDB storage engine also has an adaptive hash index. We know that the number of B+ tree searches depends on the height of the B+ tree. In a production environment, the height of the B+ tree is generally 3 ~4 layers, so 3~4 IO queries are required.

Therefore, the InnoDB storage engine monitors the index table itself. If a certain index is monitored to be frequently used, it is considered to be hot data, and then a hash index is created internally, which is called an Adaptive Hash Index (AHI). ), after creation, if this index is queried next time, the address of the record is directly deduced through the hash algorithm, and the data can be found directly once, which is more efficient than repeatedly querying the node in the B+Tree index three or four times. Quite a few.

The hash function used by the InnoDB storage engine uses division hashing, and its collision mechanism uses a linked list. Note that the adaptive hash index is only created and used by the database itself, and we cannot interfere with it.

You can see the current usage of the adaptive hash index through the command show engine innodb status\G.

8. Summary

Although indexes can help us improve query efficiency, it does not mean that they are necessarily good or that indexes are always needed, so we need to understand its advantages and disadvantages.

  • Advantages: Can greatly speed up data retrieval
  • shortcoming:
    • Time: Creating and maintaining indexes takes time
    • In terms of space: the index requires physical space

Under what circumstances do you need to create an index?:

  1. The primary key automatically creates a unique index
  2. Fields that are frequently used as query conditions
  3. Related fields in multi-table related queries
  4. Fields to sort on
  5. For frequently searched fields, joint indexes can be established and covered indexes can be created.
  6. Statistics or grouping fields in queries

In which cases it is not necessary to create an index:

  1. Too few table records
  2. Fields that are frequently added, deleted, modified and checked
  3. Fields that are not used frequently in where conditions

Guess you like

Origin blog.csdn.net/Libigtong/article/details/133858477