A large collection of MYSQL index knowledge

Introduction

This article will introduce the knowledge related to MYSQL index from the following aspects:

1. Mysql engine comparison: the advantages and disadvantages of MyISAM , InnoDB , and Memory y engines;

2. InnoDB index principle: why use B+ tree, the amount of data that B+ tree can store;

3. Index classification: clustered index & non-clustered index, joint index, unique index, etc., how these indexes are stored and queried in the B+ tree;

4. Use of the index: how to use the index and the scene where the index fails to avoid stepping on the pit;

mysql engine comparison

There are more than a dozen engines currently supported by mysql. You can use the command show engines; to query the engines supported by your own database. The default engine for mysql is InnoDB. Next, we will compare different engines.

Let's first look at the engines supported by our database:

In fact, the main engines that come into contact with in daily life are: MyISAM , InnoDB , and Memory . Next, we will make a summary of the advantages and disadvantages of these three engines.

InnoDB index principle

The B+ tree used by mysql index is evolved from binary search tree, balanced binary tree and B tree. There are many similar articles on the Internet, you can search by yourself. To put it simply: B+ tree stores as much data as possible while ensuring balance. Compared with B tree, B+ tree stores data in leaf nodes, so that when the order of the tree is the same, the data that B+ tree can store The amount is larger.

First, the structure of the B+ tree is as follows. Non-leaf nodes store key values, and only leaf nodes store data.

The amount of data that a B+ tree can store

Let's look at the amount of data that can be stored in the next B+ tree with a height of 3. We all know that the reading of the disk in the operating system is in units of pages, and the default is 4kb (legacy left over from history, it can be adjusted to 16kb or even higher based on the performance of the machine now), while in mysql, when reading data, It is also read according to the page. The default page size in msyql is 16kb. Let's take the bigint type as an example. The size of the bigint type is 8 bytes. In addition to storing the key value, there is also a pointer in the B+ tree. The size of the pointer Therefore, the number of key values ​​and pointers that can be stored in a page in mysql is: 16Kb/(8byte+6byte) = 16*1024byte/14byte = 1170, and in a single leaf node, the record data size is usually set It is 1k. Therefore, when the tree height is 2, the data volume is 1170*16, and when the tree height is 3, it reaches tens of millions.

B+ tree query insertion strategy

Classification of Index

In mysql, indexes are divided into different types from different dimensions, and we will introduce them in detail next

Clustered and nonclustered indexes

In the amount of data that can be stored in the B+ tree, we mentioned that each page can store 16 pieces of data, and the upper limit of each piece of data is 1kb. This piece is for the data rows stored in the leaf nodes that are mysql. In the index, the data stored in the leaf node is divided into two cases: 1. The index is the primary key, and the leaf node stores complete data rows; 2. The index is a non-primary key, and the leaf node stores the primary key pointer. When you need to obtain complete data , you need to go back to the table to query the clustered index to get the data. Mysql official explanation of clustered index:

The InnoDB term for a primary key index. InnoDB table storage is organized based on the values of the primary key columns, to speed up queries and sorts involving the primary key columns. For best performance, choose the primary key columns carefully based on the most performance-critical queries. Because modifying the columns of the clustered index is an expensive operation, choose primary columns that are rarely or never updated.

eg: We assume to build a user table, including the self-incrementing id as the primary key, and the age field. The left side of the figure below is the clustered index, and the leaf node stores the entire row of data. The right side is the non-clustered index, the index set by age, and the leaf node The self-incrementing id is stored, so when obtaining the user's name and other information based on age, it is necessary to query the entire record from the clustered index according to the id (back to the table: back to the table will bring additional disk overhead, so it needs to be avoided as much as possible back to the table).

unique index and normal index

Ordinary indexes can be repeated, and unique indexes are the same as primary keys, unique in the table and cannot be repeated

Let's first introduce the query process of the following unique index and ordinary index

  • Unique index, eg: select * from user where id = 110; Use the dichotomy method to quickly find the record with id=110, stop searching and return the data directly;

  • Ordinary index, eg: select * from user where age = 50; Use the dichotomy method to quickly find the record whose age is 50, we need to continue to search for the next record until we find the end search where age is not equal to 50. Therefore, in the process of ordinary index query, the mysql page size is 16kb. When there is too much data, it needs to be read page by page from the disk. This is why it is not recommended to set up an index when the data is not highly differentiated.

We understand the query process of the index, and then we need to continue to understand how the index is updated when inserting data?

Let's first understand two concepts, change buffer and merge. The change buffer caches the operation to be updated. When the next data is loaded into memory, the merge operation is used to synchronize the updated operation to ensure the correctness of the data.

  • Unique index: Since the unique index needs to determine whether it is a duplicate, in this process, the record needs to be loaded into the memory, so when inserting, it can be directly updated to the memory

  • Ordinary index: Ordinary index does not need to judge the weight, so first check whether the data page is in the memory, if it is, insert it directly, otherwise record it in the change buffer and wait for the merge

The timing of merge

  1. Access data page number

  2. background regular merge

  3. The database shuts down gracefully

Therefore, whether the change buffer is used depends on whether the data will be read immediately after being inserted. If it is read immediately, the maintenance of the change buffer will be costly and bring side effects.

Index use

avoid returning to the table

We can see the figure below. When querying users by age, since the primary key (user id) is stored in the leaf nodes of the non-clustered index, if we only need the user id during the query process, avoid using select id, name from For queries like user where age = 10, since the query results need to check the name field, the table needs to be returned.

Index invalidation scenario

1. When using like query, the wildcard character is in front

2. Perform function operations on the index column (eg: the create_time field creates an index, where in the query: DATE_FORMATE(create_time) == "2222-02-12")

3. The calculation causes the index to fail (eg: where record_id+1 = 2)

4. Type conversion causes index to try (eg: where name=888, name is varchar type in the database, so invisible conversion will be performed)

5. The leftmost prefix match is not followed (eg: joint index: a, b, c is set, but when querying, use where b = 1 and c = 2)

6. The index field after the range index field takes effect (eg: set the joint index: a, b, c, but when querying, use where a = 1 and b > 3 and c = 3, so that the index after b will not took effect)

7. Not equal to (<>,!) Index failure

8. The index uses is not null

9. Use or

index pushdown

We can first look at the query process from the client to the server:

  1. The client sends a query to the server;

  2. The server first queries the cache (starting from 8.0, the query cache is no longer supported), and returns directly if it hits, otherwise continue to the next step;

  3. The server parses the sql statement, and then generates the corresponding execution plan through the preprocessor and optimizer;

  4. According to the execution plan, call the engine api to query;

  5. Return data to the client.

The pushdown of the index pushdown actually transfers some of the things that the upper layer (service layer) is responsible for to the lower layer (engine layer) for processing in real time.

Let me look at an example: Now there are joint indexes: r1, r2, r3

查询 sql: select * from user where r1 = 1 and r2 >= 50 and r3 = 1 and r4 = 11;

When ICP is not used (index pushdown), the query process is as follows:

1. The mysql service layer parses, preprocesses, and optimizes to generate a query plan and call the storage engine api;

2. The storage engine reads the index, and the index filters r1=1. Since r2>50 is a range query, the index fails and does not query, and then returns the result that meets r1=1 to the service layer;

3. The service layer filters r2>50 and r3 = 1 and r4 = 11, and returns the result to the client;

When using ICP, the query process is as follows:

1. The mysql service layer parses, preprocesses, and optimizes to generate a query plan and call the storage engine api;

2. The storage engine reads the index, the index filters r1=1, and also filters out r2>50 and r3 = 1, and returns the result to the service layer;

3. The service layer filters r4=11 again, and returns the result to the client;

We can see that when index pushdown is used, less data is transmitted between the storage engine and the service layer, reducing network overhead, but index pushdown is only applicable to secondary indexes, because clustered index leaf nodes store The entire row of data does not reduce disk IO.

Guess you like

Origin blog.csdn.net/m0_69804655/article/details/130249262