This article takes you to understand the use of MySQL's B+ tree index

foreword

In our previous article, we detailed the B+ tree index of the InnoDB storage engine. We must know the following conclusions:

  • Each index corresponds to a B+ tree. The B+ tree is divided into several layers. The bottom layer is the leaf node, and the rest are internal nodes (non-leaf nodes). All user records are stored in the leaf nodes of the B+ tree, and all directory entry records are stored in the inner nodes.

  • The InnoDB storage engine will automatically create a clustered index for the primary key (if not, it will be automatically added for us), and the leaf nodes of the clustered index contain complete user records.

  • We can create a secondary index for the columns we are interested in. The user records contained in the leaf nodes of the secondary index are composed of index columns + primary keys, so if we want to find complete user records through the secondary index, we need to go back to the table. , that is, after the primary key value is found through the secondary index, the complete user record is searched in the clustered index.

  • Each layer of nodes in the B+ tree is sorted according to the order of the index column values ​​from small to large to form a doubly linked list, and the records in each page (whether user records or directory entry records) are sorted according to the index column values ​​from small to large Large sequential groups form a singly linked list. If it is a joint index, the pages and records are first sorted according to the column in front of the joint index, and if the column values ​​are the same, they are sorted according to the column behind the joint index.

  • Searching for records through the index starts from the root node of the B+ tree and searches downward layer by layer. Since each page establishes the Page Directory (page directory) according to the value of the index column, the search in these pages is very fast.

If you have any doubts about the above conclusions, it is recommended to go back and read the previous content first. After getting familiar with the principle of B+ tree, this chapter will show you how to make better use of indexes.

B+ tree index [through train]

1. The cost of indexing

Although the index is a good thing, it cannot be built randomly. Before learning how to use indexes better, let's understand the cost of using indexes, which will slow down in space and time

  • space cost

    This is obvious. Every time an index is created, a B+ tree must be built for it. Each node of each B+ tree is a data page. A page will occupy 16KB of storage space by default. A large B+ tree The tree consists of many data pages, which is a large storage space.

  • time cost

    Every time you add, delete, or modify data in the table, you need to modify each B+ tree index. And we have said that the nodes at each level of the B+ tree are sorted according to the value of the index column from small to large to form a doubly linked list. Whether it is the records in the leaf nodes or the records in the internal nodes (that is, whether it is a user record or a directory entry record), a one-way linked list is formed in the order of the index column values ​​from small to large. The addition, deletion, and modification operations may damage the ordering of nodes and records, so the storage engine needs additional time to perform operations such as record shifting, page splitting, and page recycling to maintain the ordering of nodes and records. If we build a lot of indexes, the B+ tree corresponding to each index needs to perform related maintenance operations, can this not hinder the performance? Therefore, the more indexes are built on each table, the more storage space will be occupied, and the performance will be worse when adding, deleting, and modifying records. In order to build good and few indexes, we must first learn the conditions under which these indexes work

2. Applicable conditions of B+ tree

First of all, the B+ tree index is not a panacea, not all queries will use the index we built. The following introduces some situations where we may use B+ tree index to query. First, we create a demo7table to store some basic information:

mysql> drop table if exists demo7;
Query OK, 0 rows affected (0.01 sec)

mysql> create table demo7(
	c1 int not null auto_increment,
	c2 varchar(11) not null,
	c3 varchar(11) not null,
	c4 char(11) not null,
	c5 varchar(11) not null,
	primary key(c1), key idx_c2_c3_c4(c2,c3,c4) 
);
Query OK, 0 rows affected (0.03 sec)

insert into demo7(c2,c3,c4,c5) values('a','a','a','d');
insert into demo7(c2,c3,c4,c5) values('a','ab','a','d');
insert into demo7(c2,c3,c4,c5) values('a','a','ab','d');				
insert into demo7(c2,c3,c4,c5) values('ab','ab','ab','d');
insert into demo7(c2,c3,c4,c5) values('ab','abc','ab','d');
insert into demo7(c2,c3,c4,c5) values('ab','ab','abc','d');			
insert into demo7(c2,c3,c4,c5) values('abc','abc','abc','d');
insert into demo7(c2,c3,c4,c5) values('abc','abcd','abc','d');
insert into demo7(c2,c3,c4,c5) values('abc','abc','abcd','d');		 
insert into demo7(c2,c3,c4,c5) values('abcd','abcd','abcd','d');
insert into demo7(c2,c3,c4,c5) values('abcd','abcde','abcd','d');
insert into demo7(c2,c3,c4,c5) values('abcd','abcd','abcde','d');

What we need to know about this table is:

mysql> show index from demo7;
+-------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name     | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| demo7 |          0 | PRIMARY      |            1 | c1          | A         |          12 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| demo7 |          1 | idx_c2_c3_c4 |            1 | c2          | A         |           4 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| demo7 |          1 | idx_c2_c3_c4 |            2 | c3          | A         |           8 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| demo7 |          1 | idx_c2_c3_c4 |            3 | c4          | A         |          12 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+-------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
4 rows in set (0.01 sec)
  • The primary key in the table is c1a column, which stores a self-increasing integer, so the InnoDB storage engine will automatically create a clustered index for the id column.
  • We additionally define a secondary index idx_c2_c3_c4, which is a joint index consisting of three columns. Therefore, the user record stored at the leaf node of the B+ tree corresponding to this index can only save the three column values ​​of c2, c3, c4and the value of the primary key id, and will not save the country column value.

From these two points, we can see again that there are as many B+ trees as there are indexes in a table, and two B+ trees demo7are built for the clustered index and the index for the table . idx_c2_c3_c4Below we draw idx_c2_c3_c4a schematic diagram of the lower index, but since we have mastered the principle of InnoDB's B+ tree index, we omit some unnecessary parts, such as the additional information recorded, when we draw the picture to make the picture clearer. The page number of the page, etc. We use arrows to replace the page number information of an entry record in the internal node. In the record structure, only the real data values ​​​​of the four columns , , , are reserved, so the schematic diagram is c2like c3this c4:c1

insert image description here

We know that internal nodes store directory entry records, and leaf nodes store user records (because it is not a clustered index, user records are incomplete and lack column values). We can see from the figure that the index c5corresponds idx_c2_c3_c4to The pages and records in the B+ tree are sorted as follows:

  • c2Sort by column value first
  • Sort by c2the value of the column if the value of the column is the samec3
  • If c3the value of the column is also the same, c4sort by the value of the column

This sorting is very important, because the pages and records are sorted, we can quickly locate and search through the dichotomy method, please see this picture to understand the content below

2.1 Full value matching

If the columns in our search criteria are consistent with the index columns, this is a full-value match, as follows:

select * from demo7 where c2='a' and c3='a' and c4= 'ab';

The three columns contained in the index we built idx_c2_c3_c4are displayed in this query statement. We can imagine the process:

  • Because the data pages and records of the B+ tree are first c2sorted according to the value of the column, the record position of c2the column value is quickly locateda
  • In c2the records with the same column, c3sort by the value of the column, so in the record with c2the column value, aquickly define the record c3with the value of the column‘a’
  • If unfortunately, the values ​​of the c2and c3columns are the same, the records are c4sorted according to the values ​​of the columns, so all three columns of the joint index may be used.

Maybe we still have a question, do several search conditions in the where clause affect the query results? That is, do we say that swapping c2, c3and c4these few searched columns have an impact on the execution process? For example, write it like this:

select * from demo7 where  c4='ab' and c3= 'a' and c2='a' ;

The answer is no, MySQL has a component called query optimizer, which will analyze these search conditions and decide which search condition to use first and which search condition to use in accordance with the order of the index columns that can be used, as we will learn later

2.2 Match the left column

In fact, our search statement does not need to include all the columns in the joint index, but only includes the left one, such as the following statement:

select * from demo7 where  c2='abc' ;

Or include multiple columns on the left:

select * from demo7 where c2='abcd' and c3='abcde';  

Then why the column on the left must appear in the search condition can also use this B+ tree index? For example, the following statement does not use the B+ tree index?

select * from demo7 where  c3='abcde' ;

Yes, it is indeed not used, because the data pages and records of the B+ tree are c2sorted according to the value of the column first, and the column is used for sorting c2when the value of the column is the same , that is to say, in the records with different column values Values ​​may be unordered. But now we skip the column and search directly based on the value of the column, which is impossible, so what if I want to use only the value of the column to search through the B+ tree index? This is easy to handle, you can create a B+ tree index on the column.c3c2c3c2C3c3c3

But one thing to pay special attention to is that if we want to use as many columns as possible in the joint index, each column in the search condition must be the continuous column from the leftmost in the joint index. For example, idx_c2_c3_c4the definition order of the columns in the joint index is c2, c3, , if there are only sums and no middle ones c4in our search conditions , as follows:c2c4c3

select * from demo7	where c2 = 'abcd' and c4='abcde';

In this way, only c2the index of the column can be used, c3and c4the index of and cannot be used, because c2the records with the same value c3are sorted according to the value of the value first, and c3the records with the same c4value are sorted according to the value.

2.3 Match column prefix

We said earlier that indexing a certain column means sorting the corresponding B+ tree records using the value of the column. For example, the demo7joint index established on the table idx_c2_c3_c4will first sort the values ​​of the c2 column, so this joint The arrangement of the name column of the records in the B+ tree corresponding to the index is like this:

a
a
a
ab
ab
ab
abc
abc 
abc
abcd 
abcd 
abcd

The essence of string sorting is to compare which string is larger and which string is smaller. Comparing the size of strings uses the character set and comparison rules of the column, which we have already mentioned before. It should be noted here that the general comparison rules are to compare the size of characters one by one, that is to say, the process of comparing the size of two strings is actually like this:

  • First compare the first character of the string, and the string with the smaller first character is smaller.
  • If the first character of the two strings is the same, then compare the second character, and the string with the smaller second character is smaller.
  • If the second character of the two strings is also the same, then compare the third character, and so on.

So a sorted string column actually has the following characteristics:

  • First sort by the first character of the string.
  • If the first character is the same, sort according to the second character.
  • If the 2nd character is the same, sort according to the 3rd character, and so on.

That is to say, the first n characters of these strings, that is, the prefixes are all sorted, so for the index column of the string type, we can quickly locate the record by only matching its prefix. For example, we want
To query records whose names start with 'a', you can write a query statement like this:

select * from demo7 where c2 like 'a%'

However, it should be noted that if only the suffix or a string in the middle is given, such as this:
select * from demo7 where c2 like '%b%'
MySQL cannot quickly locate the record position because there is an 'a in the middle of the string 'The strings are not sorted, so the whole table can only be scanned.

2.4 Match range value

Looking back at the B+ tree diagram of our idx_c2_c3_c4 index, all records are sorted according to the value of the index column from small to large, so it is very convenient for us to find records whose values ​​in the index column are within a certain range. For example, the following query statement:

select * from demo7 where c2 > 'a' and < 'abcd';

Since the data pages and records in the B+ tree are first sorted by the c2 column, our above query process is actually as follows:

  • Find the record whose c2 value is a.
  • Find the record whose c2 value is abcd

Since all records are connected by linked list (single linked list is used between records, and double linked list is used between data pages), the records between them can be easily taken out, and the primary key values ​​of these records can be found, and then Go back to the table in the clustered index to find the complete record. However, you need to pay attention when using a joint range search. If you perform a range search on multiple columns at the same time, you can only use the B+ tree index when you perform a range search on the leftmost column of the index, for example:

select * from demo7 where c2 > 'a' and < 'abcd' and c3 > 'a';

The above query can be broken down into two parts:

  • Use the conditions c2 > 'a' and c2 < 'abcd' to range c2, and the result of the search may have multiple records with different c2 values.
  • Records with different c2 values ​​continue to be filtered through the c3 > 'a' condition.

In this way, for the joint index idx_c2_c3_c4, only the part of the c2 column can be used, but not the part of the c3 column, because only when the c2 value is the same can the value of the c3 column be used for sorting, and this query passes c2 The records for range search may not be sorted according to the c3 column, so the B+ tree index will not be used when continuing to search with the c3 column in the search condition.

2.5 Exactly match one column and range match another column

For the same joint index, although only the leftmost index column can be used for range search on multiple columns, if the left column is an exact search, the right column can be used for range search, for example:

select * from demo7 where c2='a' and c3 > 'a' and c3 <'ab' and c4>'a'

The conditions of this query can be divided into 3 parts:

  • c2 = 'a', perform precise search on column c1, of course you can use B+ tree index
  • c3 > 'a' and c3 < 'ab', since the c2 column is an exact search, the c2 values ​​of the results obtained after searching through the c2 = 'a' condition are all the same, and they will be sorted according to the c3 value. So at this time, the range search of the c3 column can use the B+ tree index
  • c4 > 'a', the value of c3 of the records searched through the range of c3 may be different, so this condition can no longer use the B+ tree index, and can only traverse the records obtained in the previous step query

Similarly, the following query may also use this idx_c2_c3_c4 joint index:

select * from demo7 where c2='a' and c3= 'a' and c4>'a'

2.6 for sorting

When we write query statements, we often need to sort the queried records according to certain rules through the order by statement. Under normal circumstances, we can only load the records into the memory, and then some sorting algorithms, such as quick sort, merge sort, etc., sort these records in memory, and sometimes the result set of the query may be too large to If the sorting cannot be performed in memory, it is possible to temporarily use disk space to store intermediate results, and return the sorted result set to the client after the sorting operation is completed. In mysql, this method of sorting in memory or on disk is collectively called file sorting (English name: filesort). It looks like a plane versus a snail). But if the order by clause uses our index column, it is possible to save the step of sorting in memory or hardware, such as the following simple query statement:

select * from demo7 order by c2,c3,c4 limit 10;

The result set of this query needs to be sorted according to the c2 value first, if the recorded values ​​of c2 are the same, it needs to be sorted according to c3, if the values ​​of c3 are the same, it needs to be sorted according to c4. You can look back at the schematic diagram of the idx_c2_c3_c4 index we built, because the b+ tree index itself is sorted according to the above rules, so the data is directly extracted from the index, and then the table return operation is performed to remove the columns not included in the index Enough. simple right? Yes, indexes are that awesome.

2.6.1 Considerations for sorting using a joint index

There is a problem with the joint index that needs attention. The order of the columns behind the order by clause must also be given in the order of the index columns. If the order of order by c4, c3, and c2 is given, then the B+ tree index cannot be used. The reason why the index cannot be used in this reversed order has been mentioned in detail above, so I won’t go into details here.

Similarly, order by c2, order by c2, c3 can use part of the B+ tree index in the form of matching the left column of the index. When the value of the left column of the joint index is constant, you can also use the latter column for sorting, for example:

select * from demo7 where c2 ='a' order by c3,c4 limit 10;

This query can be sorted using the joint index because the records with the same value in the c2 column are sorted according to c3 and c4, which has been said many times.

2.6.2 Situations where indexes cannot be used for sorting

mixed use of asc and desc

For the scenario of sorting using a joint index, we require that the sorting order of each sorting column be the same, that is, either each column is sorted by the asc rule, or all the columns are sorted by the desc rule.

Tip:
If the columns after the orde by clause are not added with asc or desc, they are sorted according to the asc sorting rules by default, that is, they are sorted in ascending order.
Why is there such a weird rule? This has to go back and think about the structure recorded in the idx_c2_c3_c4 joint index:

  • First sort in ascending order according to the value of the recorded column c2.
  • If the recorded values ​​in column c2 are the same, sort them in ascending order according to the values ​​in column c3.
  • If the recorded values ​​in column c3 are the same, sort them in ascending order according to the values ​​in column c4.

If the sorting order of each sorting column in the query is consistent, for example, the following two cases:

  • order by c2, c3 limit 10
    In this case, just read 10 records from the far left of the index to the right.
  • order by c2 desc, c3 desc limit 10
    In this case, just read 10 records from the far right of the index to the left.

But if our query needs to be sorted in ascending order according to column c2 first, and then sorted in descending order according to column c3, for example, such a query statement: so if the index sorting is used, the process is like this:
select * from demo7 order by c2,c3 desc limit 10;

  • First determine the minimum value of column c2 from the far left of the index, then find all the records whose column c2 is equal to this value, and then find 10 records to the left from the rightmost record whose column c2 is equal to this value
  • If there are less than 10 records in column c2 equal to the smallest value, continue to the right to find the record with the second smallest value in c2, and repeat the above process until 10 records are found
  • The point is that the index cannot be used efficiently, but a more complex algorithm is needed to fetch data from the index, which is not as fast as direct file sorting. Therefore, it is stipulated that the sorting order of each sorting column using the joint index must be consistent.

The index column used for non-sorting appears in the where clause

If there is an index column that is not used for sorting in the where clause, the sorting still does not use the index, for example:

select * from demo7 where c5 =	'a' order by c2 limit 10;

This query can only extract the records that meet the search condition c5='a' and then sort them, and the index cannot be used. Note the difference from the following query:

select * from demo7 where c2='a' order by c3,c4 limit 10;

Although this query also has search conditions, c2 = 'a' can use the index idx_c2_c3_c4, and the remaining records after filtering are still sorted according to the c3 and c4 columns, so the index can still be used for sorting.

The sorted column contains columns that are not in the same index

Sometimes multiple columns used for sorting are not in an index, and in this case, the index cannot be used for sorting, for example:

select * from demo7 order by c2,c5 limit 10;

c2 and c5 do not belong to the columns in a joint index, so the index cannot be used for sorting. As for why you can take a look at the front

Sorting columns use complex expressions

If you want to use an index for sorting operations, you must ensure that the index column appears in the form of a separate column, not a modified form, for example:

select * from demo7 order by upper(c2) limit 10;

Columns modified with the upper function are not separate columns, so they cannot be sorted using indexes.

for grouping

Sometimes we group the records in the table according to certain columns in order to facilitate the statistics of some information in the table. For example, the group query below:

select c2,c3,c4,count(*) from demo7 group by c2,c3,c4

This query statement is equivalent to doing 3 grouping operations:

  • First group the records according to the c2 value, and all the records with the same c2 value are divided into one group.
  • Group the records in each group with the same value of c2 according to the value of c3, and put the records with the same value of c3 into a small group, so it looks like a large group is divided into many small groups.
  • Then divide the small groups generated in the previous step into smaller groups according to the value of c4, so it looks as if the records are divided into one large group first, then the large group is divided into several small groups, and then several small groups Groups are subdivided into more subgroups.

Then count those small groups. For example, in our query statement, we count the number of records contained in each small group. If there is no index, all the grouping process needs to be implemented in memory, and if there is an index, it happens that the grouping order is consistent with the order of the index columns in our b+ tree, and our b+ tree index is It is sorted according to the index column, isn't this just right, so you can directly use the b+ tree index for grouping.

It is the same as using the b+ tree index for sorting. The order of the grouping columns must also be consistent with the order of the index columns, or only the left column in the index column can be used for grouping.

3. The cost of return

The discussion above has mostly skimmed the word return form, and you may not have a deep understanding of it. Let’s talk about it in detail below. Still use the idx_c2_c3_c4 index as an example, see the following query
:

select * from demo7 where c2>'a' and c2<'abcde';

When querying using the idx_c2_c3_c4 index, it can be divided into two steps:

  • From the b+ tree corresponding to the index idx_c2_c3_c4, take out the user records whose c2 value is between a and abcde.
  • Since the b+ tree user record corresponding to the index idx_c2_c3_c4 only contains four fields c2, c3, c4, and c1, and the query list is *, it means that all fields in the table must be queried, that is, the c5 field must also be included. At this time, it is necessary to find the complete user record in the b+ tree corresponding to the clustered index from the c1 field of each record obtained in the previous step, which is what we usually call back to the table, and then put the complete user record Returned to the querying user.

Since the records in the b+ tree corresponding to the index idx_c2_c3_c4 will first be sorted according to the value of the c2 column, the storage of records with values ​​between a and abcde in the disk is connected and distributed in one or more data pages. We can quickly read these connected records from the disk. This reading method can also be called sequential i/o. According to the values ​​of the c1 field of the records obtained in step 1 may not be connected, and the records in the clustered index are arranged according to the order of c1 (that is, the primary key), so according to these discontinuous c1 values ​​to cluster Access to complete user records in the cluster index may be distributed in different data pages, so reading complete user records may require access to more data pages. This reading method can also be called random I/O. In general, sequential i/o is much more performant than random i/o, so step 1 may be executed quickly, while step 2 is slower. So this query using the index idx_c2_c3_c4 has two characteristics:

  • Two b+ tree indexes, one secondary index and one clustered index will be used
  • Access to the secondary index uses sequential i/o, and access to the clustered index uses random i/o

The more records that need to be returned to the table, the lower the performance of using the secondary index, and even some queries would rather use the full table scan than the secondary index. For example, the number of user records whose c2 value is between a and abcde accounts for more than 90% of the total number of records. If the idx_c2_c3_c4 index is used, more than 90% of the c1 values ​​need to be returned to the table. Isn’t this thankless? It’s better to directly To scan the clustered index (that is, a full table scan).

Then when to use the full table scan method, and when to use the secondary index + return table method to execute the query? This is what the legendary query optimizer does. The query optimizer will calculate some statistical data on
the records , and then use these statistical data to calculate the number of records that need to be returned to the table according to the query conditions. The more records in the table, the more inclined to use full table scan, and vice versa, it tends to use secondary index + return table. Of course, the analysis done by the optimizer is not only that simple, but it is roughly this process. In general, limiting the number of records obtained by the query will make the optimizer more inclined to choose the way of secondary index + return to the table for query, because the fewer records returned to the table, the higher the performance improvement, such as the above The query can be rewritten like this:

select * from demo7 where c2>'a' and c2<'abcde' limit 10;

The query with limit 10 is added to make it easier for the optimizer to use the secondary index + return table to query

For queries with sorting requirements, the above-mentioned conditions for querying by full table scan or secondary index + table return are also valid. For example, the following query:

select * from demo7 order by c2,c3,c4;

Since the query list is *, if you use the secondary index for sorting, you need to return all the sorted secondary index records to the table. The cost of this operation is not as good as directly traversing the clustered index and then sorting the files (filesort) Low, so the optimizer will tend to use a full table scan to execute the query. If we add a limit statement, such as this:

select * from demo7 order by c2,c3,c4 limit 10;

In this way, there are very few records that need to be returned to the table, and the optimizer will tend to use the secondary index + return to the table to execute the query

covering index

In order to completely bid farewell to the performance loss caused by the table return operation, we suggest that it is best to include only index columns in the query list, for example:

select c2,c3,c4 from demo7 where c2 >'a' and c2 < 'abcde';

Because we only query the values ​​​​of the three index columns c2, c3, and c4, after obtaining the results through the idx_c2_c3_c4 index, it is not necessary to search the remaining columns of the record in the clustered index, that is, the value of the c5 column, which saves The performance loss caused by the return table operation is eliminated. We call this kind of query that only needs to use the index index coverage. The sorting operation also prefers to use the covering index method for querying, such as this query:

select c2,c3,c4 from demo7 order by c2,c3,c4;

Although there is no limit statement in this query, the covering index is used, so the query optimizer will directly use the idx_c2_c3_c4 index for sorting without returning to the table.

Of course, if the business needs to query columns other than the index, it is still important to ensure the business needs. However, we strongly discourage using the * symbol as a query list. It is best to mark the columns we need to query in turn.

4. How to choose an index

Above, we took the idx_c2_c3_c4 index as an example to explain the applicable conditions of the index in detail. Below, we will look at some matters that we should pay attention to when building an index or writing a query statement.

4.1 Create indexes only for columns used for searching, sorting, or grouping

That is, only create indexes for columns that appear in a where clause, join columns in a join statement, or columns that appear in an order by or group by clause. Columns that appear in the query list do not need to be indexed:

select c3,c5 from demo7 where name= 'abcd';

The two columns like c3 and c5 in the query list do not need to be indexed, we only need to create an index for the c2 column that appears in the where clause

4.2 Consider column cardinality

The cardinality of a column refers to the number of unique data in a column. For example, a column contains the values ​​2,5,8,2,5,8,2,5,8. Although there are 9 records, the column The base of is 3. That is to say, in the case of a certain number of record rows, the larger the cardinality of the column, the more scattered the values ​​in the column, and the smaller the cardinality of the column, the more concentrated the values ​​in the column. The cardinality index of this column is very important, which directly affects whether we can effectively use the index. Assuming that the cardinality of a column is 1, that is, all the values ​​recorded in the column are the same, then it is useless to build an index for the column, because all the values ​​​​are the same and cannot be sorted, and quick search cannot be performed, and if A column with a secondary index has a lot of duplicate values, and the records found using this secondary index may need to be returned to the table, which will lead to higher performance loss. So the conclusion is: it is best to build indexes for those columns with large cardinality, and the effect of building indexes for columns with too much cardinality may not be good.

4.3 The type of index column should be as small as possible

When we define the table structure, we need to explicitly specify the type of the column. Taking the integer type as an example, there are several types such as tinyint, mediumint, int, and bigint. The storage space they occupy increases in turn. The type size we refer to here refers to is the size of the data range represented by this type. The range of integers that can be represented is of course also increasing sequentially. If we want to index an integer column, try to make the index column use a smaller type if the integer range of the representation allows it. For example, if we can use int, we don’t Use bigint, don't use int if you can use mediumint, this is because:

  • The smaller the data type, the faster the comparison operation during the query (this is the stuff at the cpu level)
  • The smaller the data type, the less storage space the index occupies, and more records can be placed in a data page, thereby reducing the performance loss caused by disk I/O, which means that more data pages can be stored It is cached in memory to speed up read and write efficiency.

This suggestion is more suitable for the primary key of the table, because not only the primary key value will be stored in the clustered index, but also the primary key value of a record will be stored at the nodes of all other secondary indexes. If the primary key is suitable for smaller data type, which means more storage space savings and more efficient I/O.

4.4 Prefixes for indexed string values

We know that a string is actually composed of several characters. If we use the utf8 character set to store strings in MySQL, it takes 1~3 bytes to encode one character. Suppose our string is very long, then storing a string requires a lot of storage space. When we need to create an index for this string column, it means that there are two problems in the corresponding B+ tree:

  • The records in the B+ tree index need to store the complete string of the column, and the longer the string, the larger the storage space occupied in the index.
  • If the strings stored in the index column in the B+ tree index are very long, it will take more time to compare the strings.

We said earlier that the string prefixes of the index columns are actually sorted, so the index designer proposed a case to only index the first few characters of the string, that is to say, in the records of the secondary index, only Keep the first few characters of the string. In this way, although the location of the record cannot be accurately located when searching for records, the location of the corresponding prefix can be located, and then the complete string value can be checked back to the table according to the primary key value of the record with the same prefix, and then compared. In this way, only the encoding of the first few characters of the string is stored in the B+ tree, which not only saves space, but also reduces the comparison time of the string, and can probably solve the problem of sorting. Why not do it, for example, we are building a table In the statement, only the first 10 characters of the name column are indexed as follows:

create table demo7(
	c1 int not null auto_increment,
	c2 varchar(11) not null,
	c3 varchar(11) not null,
	c4 char(11) not null,
	c5 varchar(11) not null,
	primary key(c1), key idx_c2_c3_c4(c2(10),c3,c4) 
);  

c2(10) means that only the encoding of the first 10 characters of the record is reserved in the established B+ tree index. This strategy of only indexing the prefix of the string value is highly encouraged, especially when the string type can be stored When there are many characters.

Effect of index column prefix on sorting

If the index column prefix is ​​used, for example, only the first 10 characters of column c2 are placed in the secondary index, the following query may be a bit embarrassing:

select * from demo7 order by name limit 10;

Because the secondary index does not contain complete c2 column information, it is impossible to sort the records with the same first ten characters and different characters in the latter, that is, the method of using the index column prefix cannot support the use of index sorting, only file sorting .

4.5 Let the index column appear alone in the comparison expression

Suppose there is an integer column my_col in the table, and we have built an index for this column. Although the two where clauses below have the same semantics, they differ in efficiency:

  • where my_col * 2 < 4
  • where my_col < 4/2

The my_col column in the first where clause does not appear in the form of a separate column, but in the form of an expression such as my_col * 2. The storage engine will traverse all the records in turn, and the calculated value of this expression
is It is not less than 4, so in this case, the b+ tree index built for the my_col column cannot be used. However, the my_col column in the second where clause does not appear in the form of a separate column. In this case, the
b+ tree index can be used directly.

So the conclusion is: if the index column does not appear in the form of a separate column in the comparison expression, but appears in the form of an expression or function call, the index is not used.

4.6 Primary key insertion order

We know that for a table using the InnoDB storage engine, when we do not explicitly create an index, the data in the table is actually stored in the leaf nodes of the clustered index. The records are stored in the data
page, and the data pages and records are sorted according to the order of the primary key value of the record from small to large, so if the primary key values ​​​​of the records we insert increase in order, then every time we insert When the entire data page is full, switch to the next data page and continue inserting. If the primary key value we insert is large or small, it will be more troublesome. Suppose the records stored in a certain data page are full, and the primary key stored in it The key value is between 1 and 100, and we insert a record with a primary key value less than 100. We need to split the current page into two pages, and move some records in this page to the newly created page. What does page splitting and record shifting mean? Means: performance loss! So if we want to avoid such unnecessary performance loss as much as possible, it is best to increase the primary key value of the inserted record in turn, so that such performance loss will not occur. So we suggest: let the primary key have auto_increment, let the storage engine generate the primary key for the table itself, instead of inserting it manually, for example, we can define the demo7 table like this:

create table demo7(
	c1 int not null auto_increment,
	c2 varchar(11) not null,
	c3 varchar(11) not null,
	c4 char(11) not null,
	c5 varchar(11) not null,
	primary key(c1), key idx_c2_c3_c4(c2(10),c3,c4) 
); 

Our custom primary key column id has the auto_increment attribute, and the storage engine will automatically fill in the auto-incremented primary key value for us when inserting records.

4.7 Redundant and Duplicate Indexes

Sometimes some students intentionally or unintentionally create multiple indexes on the same column, for example, write a table creation statement like this:

create table demo7(
	c1 int not null auto_increment,
	c2 varchar(11) not null,
	c3 varchar(11) not null,
	c4 char(11) not null,
	c5 varchar(11) not null,
	primary key(c1), 
	key idx_c2_c3_c4(c2(10),c3,c4),
	key idx_c2(c2(10))
); 

We know that the c2 column can be quickly searched through the idx_c2_c3_c4 index, and creating an index specifically for the c2 column is considered a redundant index. Maintaining this index will only increase the maintenance cost and will not benefit the search.

In another case, we may repeatedly create an index on a column, for example like this:

create table demo7(
   	c1	int	primary	key,
   	c2	int,
   	unique	uidx_c1	(c1),
   	index	idx_c1	(c1)
); 

We see that c1 is not only the primary key, but also defines it as a unique index, and defines a common index for it, but the primary key itself will generate a clustered index, so the defined unique index and common index are repeated. situation to avoid.

Summarize

The above are just some points that we need to pay attention to in the process of creating and using B+ tree index. We will introduce more optimization methods and precautions later, so stay tuned. The content of this episode is summarized as follows:

  • B+ tree index has a cost in space and time, so don't build an index if you have nothing to do
  • B+ tree index is suitable for the following situations:
    • full value match
    • match the left column
    • match range value
    • Match exactly one column and range match another column
    • for sorting
    • for grouping
  • When using indexes, you need to pay attention to the following items:
    • Index only the columns used for searching, sorting or grouping
    • Create indexes for columns with large cardinality
    • The type of index column should be as small as possible
    • It is possible to index only prefixes of string values
    • Indexes can only be applied if the index column appears alone in the comparison expression
    • In order to minimize the occurrence of page splits and record shifts in the clustered index, it is recommended that the primary key have the auto_increment attribute.
    • Locate and drop duplicate and redundant indexes on tables
    • Try to use the covering index for query to avoid the performance loss caused by returning to the table.

So far, today's study is over, I hope you will become an indestructible self
~~~

You can’t connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future.You have to trust in something - your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life

If my content is helpful to you, please 点赞, 评论,, 收藏creation is not easy, everyone's support is the motivation for me to persevere

insert image description here

Guess you like

Origin blog.csdn.net/liang921119/article/details/130647022