The pit of MySQL index, whoever steps on it knows...

The index can be said to be a big heart in the database. If a database lacks an index, then the existence of the database itself is of little significance, and it is no different from ordinary files. Therefore, a good index is particularly important for database systems. Today, let's talk about MySQL index. From the perspective of details and actual business, let's take a look at the benefits of B+ tree index in MySQL, and the knowledge points we need to pay attention to when using index.

1. Rational use of indexes

At work, the most direct way we may judge whether a field in the data table needs to be indexed is whether this field often appears in our where condition. From a macro point of view, there is no problem with this kind of thinking, but from a long-term point of view, sometimes more detailed thinking may be required, such as do we not only need to build an index on this field? Is a joint index of multiple fields better? Taking a user table as an example, the fields in the user table may include the user's name, the user's ID number, the user's home address, and so on.

picture.jpg

1. Disadvantages of ordinary indexes

Now there is a need to find the user's name based on the user's ID number. At this time, it is obvious that the first method that comes to mind is to create an index on the id_card. Strictly speaking, it is a unique index, because the ID number must be unique. Then when we execute the following query:

SELECT name FROM user WHERE id_card=xxx

Its flow should be like this:

  • First search on the id_card index tree to find the primary key id corresponding to id_card
  • Search on the primary key index by id to find the corresponding name

From the effect point of view, the result is no problem, but from the efficiency point of view, it seems that this query is a bit expensive, because it retrieves two B+ trees, assuming the height of one tree is 3, then the height of the two trees is 6, Because the root node is in memory (two root nodes here), the final number of IOs to be performed on the disk is 4 times, and the average time taken for one random IO on the disk is 10ms, so it will eventually take 40ms. This number is average, not fast.

picture.jpg

2. The trap of primary key index

Since the problem is the return table, which results in retrieval in both trees, the core problem is to see if it can be retrieved in only one tree. From a business point of view, you may have found an entry point here. The ID number is unique, so can our primary key not use the default auto-increment ID? We set the primary key to our ID number, so that the entire table Only one index is needed, and all the required data including our name can be found through the ID number. It seems reasonable to think about it simply, as long as you specify the ID as the ID number every time you insert data, but think carefully. There seems to be a problem.

Here, from the characteristics of the B+ tree, the data of the B+ tree is stored on the leaf nodes, and the data is managed in pages. One page is 16K. What does this mean? Even if we have a row of data now, it will occupy a 16K data page. Only when our data page is full will it be written to a new data page. The new data page and the old data page are physically different. It must be continuous, and it is critical that although the data pages are physically discontinuous, the data is logically continuous.

picture.jpg

Maybe you will be curious, what does this have to do with the ID number we are talking about as the primary key ID? At this time, you should pay attention to the keyword "continuous". The ID number is not continuous. What does this mean? When we insert a piece of discontinuous data, in order to maintain continuity, we need to move the data. For example, the original data on a page has 1->5, and a 3 is inserted at this time, then we need to move 5 to the back of 3, Maybe you will say that this is not much overhead, but if the new data 3 causes this page A to be full, then it depends on whether the page B behind it has space. If there is space, the starting data of page B at this time should be The one that overflows from page A also needs to move the corresponding data. If page B does not have enough space at this time, then a new page C must be applied, and then part of the data will be moved to this new page C, and the relationship between page A and page B will be cut off, and the insertion between the two will be A page C, from the code level, is to switch the pointer of the linked list.

picture.jpg

To sum up, discontinuous ID numbers as the primary key may cause page data movement, random IO, and overhead related to frequent application of new pages. If we use the self-incrementing primary key, then the id must be sequential, and there will be no problem of data movement due to random IO, and the cost of insertion must be relatively small.

In fact, there is another reason why it is not recommended to use the ID number as the primary key: the ID number is too large as a number, and it has to be stored in bigint. Normally, it is enough for students in a school to use int. We know one page It can store 16K. When the space occupied by an index itself is larger, it will lead to less data that can be stored in a page. Therefore, in the case of a certain amount of data, using bigint requires more pages than int, that is, more pages. storage.

3. The Spear and Shield of Joint Indexing

From the above two conclusions can be drawn:

  • Try not to go back to the table
  • ID number is not suitable for primary key index

So naturally I thought of a joint index, create a joint index of [ID number + name], pay attention to the order of the joint index, and comply with the leftmost principle. So when we also execute the following SQL:


select name from user where id_card=xxx

We can get the name field we need without returning the table. However, the problem that the ID card number itself occupies too much space is not solved. This is the problem of the business data itself. If you want to solve it, we can use some conversion algorithms to convert The original large data is converted into small data, such as crc32:


crc32.ChecksumIEEE([]byte("341124199408203232"))

The ID card number that originally required 8 bytes of storage space can be replaced with a 4-byte crc code, so our database needs to add a field crc_id_card, and the joint index has also changed from [ID number + name] to [ crc32 (ID number) + name], the space occupied by the joint index has become smaller. But this conversion also comes at a cost:

  • Each additional crc, resulting in more cpu resources
  • Additional fields, although the space of the index is reduced, but they also take up space
  • There is a probability of conflict in crc, which requires us to filter the data according to id_card after querying the data. The cost of filtering depends on the number of duplicate data. The more repetitions, the slower the filtering.

Regarding the joint index storage optimization, here is a small detail. Suppose there are two fields A and B, occupying 8 bytes and 20 bytes respectively. When the joint index is already [A, B], we also To support B's separate query, so naturally we also create an index on B, then the space occupied by the two indexes is 8+20+20=48. Now we can use the index whether we query through A or B, if Under the conditions allowed by the business, can we establish [B, A] and A indexes. In this case, not only can the index be used for querying data through A or B alone, but it can also occupy a smaller space: 20+8+8= 36.

4. Short and concise prefix index

Sometimes the field we need to index is of string type, and the string is very long. We want to add an index to this field, but we don’t want this index to take up too much space. In this case, we can consider establishing a prefix index. Create an index with the first part of the character of this field, so that you can enjoy the index and save space. It should be noted here that in the case of high prefix repetition, there should be a gap between the speed of the prefix index and the ordinary index.

alter table xx add index(name(7));#name前7个字符建立索引
select xx from xx where name="JamesBond"

5. The speed and slowness of unique indexes

Before talking about the unique index, let's first understand the characteristics of the ordinary index. We know that for the B+ tree, the data of the leaf nodes is ordered.

picture.jpg

Suppose now that we want to query the data of 2, then when 2 is found through the index tree, the storage engine does not stop searching, because there may be multiple 2s, which means that the storage engine will then search backwards on the leaf nodes, in the After finding the second 2, does it stop? The answer is no, because the storage engine does not know whether there are more 2s behind, so it has to search backward until it finds the first data that is not 2, which is 3. After finding 3, stop the retrieval, this is the ordinary index search process.

The unique index is different. Because of its uniqueness, it is impossible to have duplicate data. Therefore, after retrieving our target data, it will be returned directly, and it will not be searched backwards one more time like a normal index. From this point of view, The unique index is faster than the ordinary index, but when the data of the ordinary index is all in one page, it is not much faster. In terms of data insertion, the unique index may be slightly inferior, because of the uniqueness, each time you insert, you need to determine whether the data to be inserted already exists, while the ordinary index does not need this logic, and it is very important to be unique. The index will not use the change buffer (see below).

6. Don’t blindly add indexes

At work, you may encounter such a situation: Do I need to add an index to this field? . For this problem, our common judgment method is: whether the query will use this field, if this field is often in the query conditions, we may consider adding an index. But if you only judge by this condition, you may add a wrong index. Let's look at an example: Suppose there is a user table with about 100w of data. There is a gender field in the user table to indicate males and females, and the proportion of males and females is almost half. Now we want to count the information of all boys, and then we add the gender field. index, and we wrote the SQL like this:

select * from user where sex="男"

If nothing else, InnoDB will not select the gender index. If the gender index is used, then the table must be returned. In the case of a large amount of data, what will be the consequences of returning the table? I posted a picture that is the same as above, and everyone knows it:

picture.jpg

The main reason is a lot of IO, and a piece of data needs 4 times, so what about 50w data? The results are predictable. Therefore, in response to this situation, the MySQL optimizer has a high probability of taking a full table scan and directly scanning the primary key index, because the performance may be higher.

7. Index failures

In some cases, because of our improper use, MySQL does not use indexes, which generally occurs easily in type conversion. Maybe you will say, doesn't MySQL already support implicit conversion? For example, there is now an integer user_id index field. Because we didn't pay attention when querying, we wrote it as:

select xx from user where user_id="1234"

Note that this is 1234 for characters, and when that happens, MySQL is really smart enough to convert 1234 for characters to 1234 for numbers, and happily use the user_id index. But if we have a character user_id index field, or because we didn't pay attention when querying, we wrote:

select xx from user where user_id=1234

At this time, there is a problem, and the index will not be used. Maybe you will ask, why does MySQL not convert it at this time, just convert the 1234 of the number to the 1234 of the character type? The rules of conversion need to be explained here. When comparing strings and numbers, remember: MySQL converts strings to numbers.

Maybe you will ask again: Why is there no need for an index to convert the character user_id field into a number? This is about the structure of the B+ tree index. We know that the index of the B+ tree is forked and sorted according to the value of the index. Yes, when we convert the type of the index field, the value will change. For example, the original value is A. If the integer conversion is performed, it may correspond to a B value (int(A)=B). At this time, the index tree is It cannot be used, because the index tree is constructed according to A, not B, so the index will not be used.

2. Index optimization

1、change buffer

We know that when updating a piece of data, we must first determine whether the page of this data is in memory. If so, update the corresponding memory page directly. If not, we can only go to the disk to read the corresponding data page into memory. Come on, and then update, what's the problem?

  • Reading to disk is a bit slow
  • If a lot of data is updated at the same time, then there may be a lot of discrete IO

In order to solve the speed problem in this case, the change buffer appears. First of all, don't be misled by the word buffer. In addition to being in the public buffer pool, the change buffer will also be persisted to the disk. When we have the change buffer, during the update process, if we find that the corresponding data page is not in the memory, we will not read the corresponding data page from the disk, but put the data to be updated into the change buffer. When will the data of the change buffer be synchronized to the disk? What if a read action occurs at this time? First of all, there is a thread in the background that will periodically synchronize the data of the change buffer to the disk. If the thread has not had time to synchronize, but a read operation occurs again, the event of merging the data of the change buffer to the disk will also be triggered.

picture.jpg

It should be noted that not all indexes can use changer buffer, such as primary key index and unique index, because of uniqueness, they need to judge whether the data exists or not when updating, if the data page is not in memory , you must go to the disk to read the corresponding data page into the memory, and the ordinary index does not matter, there is no need to verify the uniqueness.

The larger the change buffer, the greater the theoretical benefit. This is because firstly, there are fewer discrete read IOs, and secondly, when multiple changes occur on a data page, it only needs to merge once to the disk. Of course, not all scenarios are suitable for change buffer. If your business needs to be read immediately after the update, the change buffer will be counterproductive, because the merge action needs to be triggered continuously, resulting in the number of random IOs not decreasing, but increasing. The overhead of maintaining the change buffer is reduced.

2. Index push down

We mentioned the joint index earlier. The joint index must satisfy the leftmost principle, that is, when the joint index is [A, B], we can use the index through the following sql:

select * from table where A="xx"
select * from table where A="xx" AND B="xx"

In fact, the joint index can also use the principle of the leftmost prefix, that is:

select * from table where A like "赵%" AND B="上海市"

But it should be noted here that because part of A is used, before MySQL 5.6, after retrieving all the data whose A starts with "Zhao", the above sql immediately returns to the table (select * used), and then Contrasting the judgment of whether B is "Shanghai", is it a bit confusing here? Why is the judgment of B not directly judged on the joint index, so that the number of times of returning to the table will not be less? The reason for this problem is the use of the leftmost prefix. Although the index can use part A, it does not use B at all. It seems a bit "stupid", so after MySQL 5.6, there is an index under the Push this optimization (Index Condition Pushdown), with this function, although the leftmost prefix is ​​used, it is also possible to search for A% and filter non-B data on the joint index, which greatly reduces the return to the table. frequency.

picture.jpg

3. Refresh adjacent pages

Before we talk about refreshing adjacent pages, let's talk about dirty pages first. We know that when updating a piece of data, we must first determine whether the page where the data is located is in memory. If not, we need to read the data page into memory first. Then update the data in the memory, then you will find that the page in the memory has the latest data, but the page on the disk is still the old data, then the page in the memory where this data is located at this time is the dirty page. Needs to be flushed to disk for consistency.

So the question is, when to brush? How many dirty pages should be brushed each time? If it is flushed every time a change is made, the performance will be poor. If it is flushed for a long time, a lot of dirty pages will accumulate, resulting in fewer pages available in the memory pool, which in turn affects normal functions. Therefore, the brushing speed cannot be too fast, but it must be timely. MySQL has a cleaning thread that will be executed regularly to ensure that it will not be too fast. When there are too many dirty pages or the redo log is almost full, the disk brushing will be triggered immediately, ensuring timely .

picture.jpg

In the process of flushing dirty pages, InnoDB has an optimization here: if the neighbor pages of the dirty pages to be flushed are also dirty, then they are flushed together. The advantage of this is that random IO can be reduced. In the case of mechanical disks, The optimization should be quite big, but there may be pits here. If the neighboring dirty pages of the current dirty page are flushed together, the neighboring pages are immediately dirty again due to data changes. Is there a feeling of superfluous at this time? And instead it wastes time and money. To make matters worse, if the neighbors of the neighbor pages are also dirty pages... then this chain reaction may have short-lived performance problems.

4、MRR

In actual business, we may be told to use the covering index as much as possible and not to return the table, because the return to the table requires more IO and takes longer, but sometimes we have to return to the table, and the return of the table will not only cause too much IO, and more seriously, too much discrete IO.

select * from user where grade between 60 and 70

Now we want to query the information of users whose scores are between 60-70, so our sql is written as above. Of course, our grade field is indexed. According to common sense, we will first find the grade=60 on the grade index. data, and then search on the primary key index according to the id corresponding to the data of grade=60, and finally return to the grade index again, repeating the same action....

Suppose now that the id=1 corresponding to grade=60, the data is on page_no_1, the id=10 corresponding to grade=61, the data is on page_no_2, the id=2 corresponding to grade=62, the data is on page_no_1, so the real The situation is to first find data on page_no_1, then switch to page_no_2, and finally switch back to page_no_1, but in fact, id=1 and id=2 can be combined completely, just read page_no_1 once, which not only saves IO, but also avoids random IO , which is the MRR. After using MRR, the auxiliary index will not go back to the table immediately, but will put the obtained primary key id in a buffer, then sort it, and then read the primary key index sequentially, which greatly reduces discrete IO.

picture.jpg
Author丨Master Kong
Source丨Public Account: Pretend to understand programming (ID: suntalkrobot)

Click to get more technical information~~

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324117949&siteId=291194637