Knowledge of database indexing, all you need to know about this article!

Overview

The database index is like the table of contents in front of a book, which can speed up the query speed of the database. An index is a structure that sorts the values ​​of one or more columns in a database table (for example, the'name' column of the User table). If you want to find a specific user by his or her name, the index helps to obtain information faster than searching all rows in the table.

Indexing has the following advantages:

  • Greatly speed up the data retrieval speed;
  • Create a unique index to ensure the uniqueness of each row of data in the database table;
  • The connection between the accelerometer and the table;
  • When using grouping and sorting clauses for data retrieval, the time for grouping and sorting in the query can be significantly reduced.

Of course, there are advantages and disadvantages. The disadvantages of indexes are as follows:

  • Indexes need to occupy physical storage space outside the data table
  • It takes a certain amount of time to create and maintain indexes
  • When updating the table, the index needs to be rebuilt, which reduces the speed of data maintenance.

Common index types

The emergence of indexes is to improve query efficiency, but there are many ways to implement indexes, so the concept of index model is also introduced here. Data structure can be used to improve the efficiency of reading and writing a lot of data structures, mainly about here are three common, it is relatively simple: a hash table , ordered array and the search tree .

Hash table

Hash table is a structure that stores data in key-value. We only need to enter the value to be searched, namely key, to find its corresponding value, namely Value. The idea of ​​hashing is very simple. Put the value in the array, use a hash function to convert the key to a certain position, and then put the value in this position of the array.

If multiple key values ​​are converted by the hash function, the same value may appear. The way to resolve this conflict is to pull out a linked list. This processing method is similar to HashMap in JAVA.

Suppose, you are currently maintaining a table of ID number and name, and you need to find the corresponding name according to the ID number. At this time, the diagram of the corresponding hash index is as follows:
Insert picture description here

In the figure, the values ​​calculated by User2 and User4 based on the ID number are both N, then this becomes a linked list. Suppose, at this time, you want to find out what the corresponding name of ID_card_n2 is. The processing steps are as follows: First, ID_card_n2 is calculated through a hash function to calculate N; then, it is traversed in order to find User2.

It should be noted that the values ​​of the four ID_card_n in the figure are not incremental. The advantage of this is that the speed of adding a new User will be very fast, and only need to be added later. But the disadvantage is that, because it is not ordered, the speed of hash index for interval query is very slow.

You can imagine that if you are looking for all users whose ID numbers are in the range of [ID_card_X, ID_card_Y], you must scan them all.

Therefore, the structure of the hash table is suitable for scenarios where there are only equivalent queries, including =, IN(), <> , such as Memcached and other NoSQL engines.

Ordered array

The performance of ordered arrays in both equivalent query and range query scenarios is very good. Still the above example of checking the name based on the ID number, if we use an ordered array to implement it, the schematic diagram is as follows:
Insert picture description here

The array here is stored in the order of increasing ID number. At this time, if you want to check the name corresponding to ID_card_n2, you can quickly get it using the dichotomy. The time complexity is O(log(N)).

At the same time, it is obvious that this index structure supports range queries. If you want to find the User whose ID number is in the range of [ID_card_X, ID_card_Y], you can first find ID_card_X using the dichotomy (if ID_card_X does not exist, find the first User greater than ID_card_X), and then traverse right until the first one is found An ID card number greater than ID_card_Y, exit the loop.

If you only look at query efficiency, an ordered array is the best data structure. However, it is troublesome when you need to update the data. If you insert a record in the middle, you have to move all subsequent records, which is too costly.

Therefore, the ordered array index is only suitable for static storage engines . For example, you want to save all the population information of a certain city in 2017. This type of data will not be modified.

Search tree

Binary search tree is also a classic data structure in textbooks. Still the above example of searching the name based on the ID number, if we use a binary search tree to implement it, the schematic diagram is as follows:
Insert picture description here

The characteristic of the binary search tree is that the left son of each node is smaller than the parent node, and the parent node is smaller than the right son. In this way, if you want to check ID_card_n2, according to the search order in the figure, it is obtained according to the path UserA -> UserC -> UserF -> User2. This time complexity is O(log(N)).

From the perspective of query alone, the binary tree is the most efficient search, but in fact, most database storage does not use the binary tree, but chooses to use the multi-tree, that is, B-tree, B+tree, each node can have A tree with multiple child nodes. The reasons are as follows:

When the amount of data is relatively large, the index requires a very large storage space, so it is obviously not realistic to store the index in the memory, so the index is stored in the disk space, and then read from the disk to the memory when used , This is Disk IO.

When using an index to query data, the computer needs to read the index into the memory from a disk page of one page, and each disk page corresponds to a node of the tree structure. Each query of a tree node is equivalent to one disk IO and a page of disk pages is read. Disk IO takes about 8-10ms each time. This time overhead is particularly huge, so the shorter the tree, the query process needs The fewer tree nodes passed, the less time it takes. This is why we do not use binary trees but polytrees!

Let's use a picture to see the difference between a binary tree and a polytree:
Insert picture description here

Obviously, the more leaf nodes a node can mount, the shorter the tree will be.

There are almost 1200 leaf nodes in the InnoDB engine. A B+ tree with a tree height of 4 can store a value of 1200 to the third power, which is already 1.7 billion. Considering that the data block at the root of the tree is always in memory, the index of an integer field on a 1 billion-row table requires only 3 disk accesses to find a value. In fact, the second level of the tree has a high probability of being in memory, so the average number of accesses to the disk is even less.

InnoDB's index model

In MySQL, indexes are implemented at the storage engine layer, so there is no uniform index standard, that is, indexes of different storage engines work differently. And even if multiple storage engines support the same type of index, the underlying implementation may be different. Since the InnoDB storage engine is the most widely used in MySQL database, I will take InnoDB as an example to analyze the index model with you.

In InnoDB, tables are stored in the form of indexes according to the order of the primary key. The tables in this storage method are called index-organized tables. And because we mentioned earlier, InnoDB uses the B+ tree index model, so the data is stored in the B+ tree.

Each index corresponds to a B+ tree in InnoDB.

Suppose, we have a table with a primary key column of ID, a field k in the table, and an index on k.

The table building statement for this table is:

mysql> create table T(
id int primary key,
k int not null,
name varchar(16),
index (k))engine=InnoDB;

The (ID,k) values ​​of R1~R5 in the table are (100,1), (200,2), (300,3), (500,5) and (600,6) respectively. An example diagram of two trees as follows.

Insert picture description here

It is not difficult to see from the figure that according to the content of the leaf nodes, the index types are divided into primary key indexes and non-primary key indexes.

The leaf node of the primary key index stores the entire row of data. In InnoDB, the primary key index is also called a clustered index (clustered index).

The content of the leaf node of the non-primary key index is the value of the primary key. In InnoDB, non-primary key indexes are also called secondary indexes.

According to the above index structure description, let's discuss a question: What is the difference between a query based on a primary key index and an ordinary index?

If the statement is select * from T where ID=500the primary key query method, you only need to search the ID B+ tree;
if the statement is select * from T where k=5the ordinary index query method, you need to search the k index tree first, get the value of ID 500, and then search the ID index tree once. This process is called returning to the table .

In other words, queries based on non-primary key indexes need to scan one more index tree. Therefore, we should try to use primary key queries in our applications.

Index maintenance

In order to maintain the orderliness of the index, the B+ tree needs to do necessary maintenance when inserting new values. Take the above figure as an example, if you insert a new row ID value of 700, you only need to insert a new record after the R5 record. If the newly inserted ID value is 400, it is relatively troublesome, and the following data needs to be moved logically to make room for it.

To make matters worse, if the data page where R5 is located is full, according to the algorithm of the B+ tree, you need to apply for a new data page at this time, and then move some data in the past. This process is called page splitting. In this case, performance will naturally suffer.

In addition to performance, page splitting operations also affect the utilization of data pages. The data that was originally placed on one page is now divided into two pages, and the overall space utilization is reduced by about 50%.

Of course, where there is division, there is merger. When the utilization rate of two adjacent pages is very low due to data deletion, the data pages will be merged. The process of merging can be considered the inverse process of the splitting process.

Based on the above description of the index maintenance process, let's discuss a case:

You may have seen similar descriptions in some table building specifications, requiring that the table building statement must have an auto-incrementing primary key. Of course there is nothing absolute, let's analyze which scenarios should use auto-incrementing primary keys, and which scenarios should not.

A primary key is auto-increment on auto-increment primary key defined, in the construction of the table statement is so general definition: NOT NULL PRIMARY KEY AUTO_INCREMENT.

When inserting a new record, it is not necessary to specify the ID value, and the system will obtain the current ID maximum value plus 1 as the ID value of the next record.

In other words, the data insertion mode of the self-incrementing primary key is in line with the incremental insertion scenario we mentioned earlier. Every time a new record is inserted, it is an append operation, it does not involve moving other records, and does not trigger the split of leaf nodes.

It is often not easy to ensure orderly insertion if fields with business logic are used as primary keys, so the cost of writing data is relatively high.

In addition to considering performance, we can also look at it from the perspective of storage space. Assuming that your table does have a unique field, such as a string type ID number, should I use the ID number as the primary key, or use the auto-increment field as the primary key?

Because each leaf node of the non-primary key index is the value of the primary key. If the ID number is used as the primary key, then the leaf node of each secondary index occupies about 20 bytes, and if the integer is used as the primary key, it only needs 4 bytes, and if it is a long integer (bigint), it is 8 Bytes.

Obviously, the smaller the primary key length, the smaller the leaf nodes of the ordinary index, and the smaller the space occupied by the ordinary index.

Therefore, in terms of performance and storage space, auto-incrementing the primary key is often a more reasonable choice.

Are there any scenarios suitable for using business fields as primary keys directly? There are still. For example, some business scenario requirements are like this:

  1. Only one index;
  2. The index must be a unique index.

You must have seen it, this is a typical KV scene. Since there are no other indexes, there is no need to consider the size of the leaf nodes of other indexes.
At this time, we must give priority to the "try to use primary key query" principle mentioned in the previous paragraph, and directly set this index as the primary key to avoid the need to search two trees for each query.

Indexing principle

Before introducing the index principle, let's look at an example:

In the following table T, if I execute select * from T where k between 3 and 5, how many tree search operations need to be performed, how many rows will be scanned?

mysql> create table T (
ID int primary key,
k int NOT NULL DEFAULT 0,
s varchar(16) NOT NULL DEFAULT '',
index k(k))
engine=InnoDB;
insert into T values(100,1, 'aa'),(200,2,'bb'),(300,3,'cc'),(500,5,'ee'),(600,6,'ff'),(700,7,'gg')

Insert picture description here

Now, let's take a look at the execution process of this SQL query statement:

  1. Find the record of k=3 on the k index tree, and get ID=300;
  2. Then go to the ID index tree to find R3 corresponding to ID=300;
  3. Take the next value k=5 in the k index tree, and get ID=500;
  4. Go back to the ID index tree and find R4 corresponding to ID=500;
  5. Take the next value k=6 in the k index tree. If the condition is not met, the loop ends.
    In this process, the process of returning to the primary key index tree search is called back to the table. As you can see, this query process reads 3 records of the k index tree (steps 1, 3, and 5), and returns to the table twice (steps 2 and 4).
    In this example, because the data required by the query result is only on the primary key index, it has to be returned to the table. So, is it possible to go through index optimization to avoid the process of returning to the table?

Covering index

If the executed statement is select ID from T where k between 3 and 5, then you only need to check the value of ID, and the value of ID is already on the k index tree, so you can directly provide the query results without returning to the table. In other words, in this query, the index k has "covered" our query requirements, and we call it a covering index.

Since a covering index can reduce the number of tree searches and significantly improve query performance, using a covering index is a common performance optimization method.

It should be noted that using the covering index inside the engine actually reads three records on index k, R3~R5 (corresponding to the record items on index k), but for the server layer of MySQL, it is just looking for the engine to get There are two records, so MySQL considers that the number of scanned rows is 2.

Based on the above description of the coverage index, let's discuss a question: Is it necessary to establish a joint index for the ID number and name on a citizen information table?

Assume that the statement for building the citizen information table is as follows:

CREATE TABLE `tuser` (
	`id` int(11) NOT NULL,
	`id_card` varchar(32) DEFAULT NULL,
	`name` varchar(32) DEFAULT NULL,
	`age` int(11) DEFAULT NULL,
	`ismale` tinyint(1) DEFAULT NULL,
	PRIMARY KEY (`id`),
	KEY `id_card` (`id_card`),
	KEY `name_age` (`name`,`age`)
) ENGINE=InnoDB

We know that the ID number is the only identification of citizens. In other words, if there is a need to query citizen information based on the ID number, we only need to build an index on the ID number field. Is it a waste of space to create a joint index of (ID number, name)?

If there is a high-frequency request now to query the citizen's name based on his ID number, this joint index will be meaningful. It can use the covering index on this high-frequency request, no longer need to go back to the table to check the entire row of records, reducing the execution time of the statement.

Of course, the maintenance of index fields always has a price. Therefore, it is necessary to weigh and consider when creating redundant indexes to support covering indexes. This is the job of the business DBA, or business data architect.

Leftmost prefix principle

Seeing this, you must have a question. If you design an index for each query, are there too many indexes? What if I want to check the citizen's home address according to his ID number? Although the probability of this query requirement appearing in the business is not high, you can't let it go through a full table scan, right? Conversely, it feels a bit wasteful to create an index of (ID number, address) for an infrequent request. What should I do?

Here, let me first conclude with you. The B+ tree index structure can use the "leftmost prefix" of the index to locate records.

To illustrate this concept intuitively, we use the (name, age) joint index to analyze.
Insert picture description here

As you can see, the index items are sorted according to the order of the fields that appear in the index definition.

When your logical requirement is to find all the people whose name is "Zhang San", you can quickly locate ID4, and then traverse backward to get all the required results.

If you are looking for all the people whose names are "Zhang" in the first word, the condition of your SQL statement is "where name like'Zhang%'". At this time, you can also use this index to find that the first eligible record is ID3, and then traverse backwards until the condition is not met.

It can be seen that not just the full definition of the index, as long as the leftmost prefix is ​​satisfied, the index can be used to speed up retrieval. The leftmost prefix can be the leftmost N fields of the joint index, or the leftmost M characters of the string index.

Based on the above description of the leftmost prefix index, let's discuss a problem: how to arrange the order of the fields in the index when establishing a joint index.

Our evaluation criterion here is the reusability of the index. Because the leftmost prefix can be supported, when the joint index (a, b) already exists, there is generally no need to create an index on a separately. Therefore, the first principle is that if one less index can be maintained by adjusting the order, then this order is often a priority.

So now you know, in the question at the beginning of this paragraph, we are going to create a joint index of (ID number, name) for high-frequency requests, and use this index to support the requirement of "query address based on ID number".

So, what if there is both a joint query and a query based on a and b? There is only b in the query condition, and the joint index (a,b) cannot be used. At this time, you have to maintain another index, which means you need to maintain both (a,b) and (b) at the same time index.

At this time, the principle we have to consider is space . For example, in the case of the citizen table above, the name field is larger than the age field, then I suggest you create a joint index of (name, age) and a single-field index of (age).

Index push down

In the previous paragraph, we said that when the principle of the leftmost prefix is ​​satisfied, the leftmost prefix can be used to locate records in the index. At this time, you may want to ask, what will happen to those parts that do not meet the leftmost prefix?

Let's take the joint index (name, age) of the citizen table as an example. If there is a demand now: Retrieve all the boys whose name is Zhang and the age is 10 years old in the table. So, the SQL statement is written like this:

mysql> select * from tuser where name like '张 %' and age=10 and ismale=1;

You already know the prefix index rules, so when searching the index tree, you can only use "Zhang" to find the first record ID3 that meets the conditions. Of course, this is not bad, better than a full table scan.

and then?

Of course, it is to judge whether other conditions are met.

Before MySQL 5.6, you can only return tables one by one starting from ID3. Find the data row on the primary key index and compare the field values.

The index condition pushdown introduced by MySQL 5.6 can first make judgments on the fields contained in the index during the index traversal process, and directly filter out the records that do not meet the conditions and reduce the number of return to the table.

The following is the execution flow chart of these two processes.
Insert picture description here
Insert picture description here

In the first picture, I deliberately removed the value of age in the (name, age) index. InnoDB does not look at the value of age during this process, but records a record of "the first word of name is'Zhang'" in order. Take out the table back and forth. Therefore, you need to return to the table 4 times.

The difference between the second picture and the first picture is that InnoDB judges whether age is equal to 10 in the (name, age) index. For records that are not equal to 10, judge directly and skip it. In our example, we only need to return to the table to fetch data for the two records ID4 and ID5, and only need to return to the table twice.

More exciting articles are in JAVA Rizhilu

Guess you like

Origin blog.csdn.net/jianzhang11/article/details/105986819