[B+ tree index] The use and precautions of index

1. Things to note about indexing

Previous article[B+ tree index] The structure of the index page contains the secret that can be quickly queried Understand the purpose of MySQL from the perspective of the index page To improve the query speed, the B+ tree data structure is used to store the index pages in memory. The non-leaf nodes are the directory item structure, and the leaf nodes are the user records. Let’s talk about the considerations for indexing B+ trees in InnoDB. (Summary from "How MySQL Works")

The root node will not change!

First of all, we must make it clear that the clustered index is not created by us, it is there by default. There is only one node at the beginning, which is the root node and an empty page.

Then a user record is inserted, and the user record is inserted into User Records of this root node.

As user records continue to increase, the Free Space space gradually decreases until it reaches 0.

If you then insert records again, because the root node is already full, page splitting will occur at this time. At this time, a page a will be found to copy all the user records of the root node to page a, and then a page b will be taken to store the newly inserted records. Then the root node is now the directory entry node, and its user record is the primary key value, which maps page a and page b. And so on...

This process is to illustrate that the root node of the B+ tree is the same one and will not change.The page number will not change

Uniqueness of directory entry records in internal nodes

That is to say, the secondary index does not necessarily set unique and non-empty constraints. For secondary index records, the records are sorted first by the value of the secondary index column. When the secondary index column values ​​are the same, Then sort according to the primary key value. When inserting data, you need to insert records into the B+ tree. When there are multiple identical records in the secondary index, how should you insert them at this time? The index introduces the primary key. When the secondary indexes are the same, the position to be inserted is found according to the ascending order of the primary key.

A page must accommodate at least two records

In fact, this is the easiest thing to understand. Assuming that a page can only hold one record, what is the significance of introducing a B+ tree? It would be better to check the leaf nodes directly. It would take half a day to check the directory entry index page, which is free.

This is why storing mutable fields limits the maximum number of bytes a column can store, and then lets the overflow fields be stored on another page. This is why I see that many tables are created using the Dynamic row format. If overflow occurs, all records are directly placed on another page. Then what is stored is the page number (or address) of that page. I think it greatly reduces the capacity, and then more records can be stored. Of course, the MySQL designer actually saves space. One piece is considered the ultimate.

2. The cost of return to the table

The concept of table return: that is to say, when using a secondary index, you cannot obtain all the queried data from the secondary index. You have to get the primary key value from the secondary index and then find the required data from the clustered index. This A process that requires returning to a clustered index is called table rollback.

The more records that need to be returned to the table, the less efficient it is to use secondary indexes for queries. In some queries, it is better to use full table scans than secondary indexes. For example: assuming that the number of user records with key1 values ​​between 'a'~'c' accounts for the majority of all records, if you use the secondary index corresponding to the key1 field, there will be a large number of id values ​​that need to be returned to the table. It's better to perform a full table scan directly.

That is to say, the more operations that need to be performed to return the table, the more likely it is to use a full table scan, and conversely, the more likely it is to use the secondary index + table return method.Limit can limit the number of records queried. This operation may make the query more inclined to the secondary index + table return method.

For queries that require sorting of results, if there are particularly many records that need to be returned to the table when executing the query using a secondary index, full table scan + file sorting is also preferred for sorting. For example, the following query statement:
select * from single_table order by key1
Since the query list is *, if you use the secondary index + table return method for sorting, you need to perform a table return operation on all secondary index records. The cost of this operation is not as low as directly traversing the clustered index and then sorting the files, so the query optimizer will tend to use a full table scan to execute the query and then sort the text.
But if the limit keyword is added, as follows:
select * from single_table order by key1 limit 10
This query needs to perform very few records of the table return operation. The optimizer will tend to use secondary index + table back method to execute.

3. Better use of indexes

  1. Create indexes only for columns used for searching, sorting, or grouping;
  2. Consider the number of unique values ​​in the index column, that is, consider the cost of table return. If there are too many duplicate values ​​in an index column, it will cause a large number of table return operations, and the efficiency of the index will be greatly reduced.
  3. The type of the index column should be as small as possible so that the index page of the directory entry node can store more records, which is also a way to prevent the index tree from becoming taller;
  4. Index column prefix. Avoid strings that are too long to reduce the space occupied by the index, but note:Sorting in this way will not use this index. Create it like this:
    alter table single_table add index idx_key1(10));
  5. covering index: Even using this secondary index, we can query the data we want without needing to return the table, which reduces the cost of indexing. The most typical one is to query the primary key through the secondary index, that is, there is no need to perform table back .
  6. Let the index columns appear alone in the search conditions as column names, otherwise the index will become invalid.
  7. It is best to store the primary key in incremental form, to prevent page splitting and reduced efficiency.

4. The cost of indexing

B+ tree indexes are costly in space and time, so don't create indexes blindly.

Space: The establishment of B+ tree index takes up memory.
Time: When performing addition and deletion operations, an execution plan is generated, and the relevant secondary indexes are also added and deleted, which costs time.

Reference:
"How MySQL works"

A digression, an answer to a question:Why does InnoDB need to introduce What about recording operation logs? binlogredolog

This is still related to the functions of the two logs. The binlog works at the server layer and records logical operations. It is used fordatabase replication and recovery operations . The existence of redolog works on the InnoDB storage engine. It exists to ensure the consistency and durability of the database. It records physical operations, which are flushed to disk after the transaction is committed. Even if the database crashes, redolog will perform data recovery.

Guess you like

Origin blog.csdn.net/qq_63691275/article/details/133107237