Demystifying database indexes

This article is reproduced from: https://blog.csdn.net/weizhiai12/article/details/68962145

        I stumbled across this article when I was reading the database index article, and I thought it was well written and easy to understand, so I reprinted it in the space for everyone to learn and communicate;

        

Some time ago, a newly launched website of the company had a problem of slow page response. A girl who was in charge of this project but not technical . My first reaction was a problem with the database. I pretended to think about it for a while, and said with a deep and cool look, "Is there a problem with the database query? Add an index to the table." Then the girl came. One sentence: "Now our website has too much traffic, adding indexing may cause performance degradation when writing data, affecting users' use." At that time, I was stunned for a moment. I felt like I was forcibly pretending to be dismantled. I was educated by non-professional classmates in my professional field.

In fact, I said this example is not to show that our colleagues in our company have strong professional abilities, great products, high security, and excellent performance. Even non-technical colleagues understand the technical details. In fact, I just want to explain that "database" and "database index" are the two most widely used concepts in the field of server-side development. Skilled use of databases and database indexes is an essential skill for developers to survive in the industry. , and non-technical personnel who deal with technical personnel all day long, because they have been familiar with each other for a long time, they will naturally be able to speak well.

Using an index is very simple. As long as you can write a statement to create a table, you can definitely write a statement to create an index. You must know that there are no server-side programmers who can't create tables in this world. However, it's one thing to know how to use an index, and it's another thing to have a deep understanding of how indexes work and use them properly, which are two worlds apart (and I haven't reached this level myself). Most programmers know about indexes only to the concept of "indexing makes queries faster".

  • Why add a primary key to a table?

  • Why does adding an index make queries faster?

  • Why does adding an index make writing, modification, and deletion slower?

  • Under what circumstances do you want to build an index on two fields at the same time?

They may not necessarily have answers to these questions. What is the benefit of knowing the answers to these questions? If there are only 10,000 pieces of data in the database table used by the developed application, there is really no difference between knowing and not knowing. However, if the developed application has hundreds of millions or even hundreds of millions of data, then you don’t have a deep understanding of the principle of indexing. , After writing the program, it can't run at all, just like if you put a car engine on a truck, can the truck still pull the goods?

Next, I will explain some of the questions raised above, hoping to help readers.

Many articles on the Internet that explain the index describe the index as follows: "The index is like the directory of a book, and the specific content of the book can be accurately located through the directory of the book." This sentence is very correct, but it is like taking off your pants. Fart, it's the same as not saying it. It is naturally faster to find the content of a book through the catalogue than flipping the book page by page. It is difficult for people who use the same index to know that it is easier to locate the data through the index than directly. One by one query comes fast, otherwise why would they build an index.

If you want to understand the indexing principle, you must know a data structure "balanced tree" (non-binary), that is, b tree or b+ tree. The important thing is said three times: "balanced tree, balanced tree, balanced tree". Of course, some databases also use hash buckets as the index data structure. However, the mainstream RDBMS all use the balanced tree as the default index data structure of the data table.

We usually add a primary key to the table when we create a table. In some relational databases, if the primary key is not specified when creating a table, the database will refuse to execute the statement that creates the table. In fact, a table with a primary key cannot be called a "table". A table without a primary key, its data is placed on the disk storage in disorder, and the rows are arranged neatly, which is very close to the "table" in my perception. If the primary key is added to the table, the storage structure of the table on the disk is changed from a neatly arranged structure to a tree-like structure, which is the "balanced tree" structure mentioned above. In other words, the entire table becomes an index. That's right, again, the entire table becomes an index, the so-called "clustered index". This is why a table can only have one primary key, and a table can only have one "clustered index", because the function of the primary key is to convert the data format of the "table" into the format of "index (balanced tree)".

The above figure is the structure diagram of a table with a primary key (clustered index). The picture is not very good, will see. The data of all nodes of the tree (except the bottom) is composed of the data in the primary key field, which is usually the id field where we specify the primary key. The bottom part is the data in the real table. Suppose we execute an SQL statement:

select * from table where id = 1256;

First, locate the leaf node where the value of 1256 is located according to the index, and then fetch the data row with id equal to 1256 through the leaf node. The operation details of the balanced tree are not explained here, but as can be seen from the above figure, the tree has three layers, and the result can be obtained only after three searches from the root node to the leaf node. As shown below

If there are 100 million pieces of data in a table, you need to find one of them. According to conventional logic, if you match one by one, in the worst case, you need to match 100 million times to get the result. Using the big O notation method is O( n) Worst time complexity, this is unacceptable, and obviously these 100 million pieces of data cannot be read into memory at one time for the program to use, so this 100 million matches is 100 million without cache optimization IO overhead, with the current IO capability of the disk and the computing power of the CPU, it may take several months to get the result. If this table is converted into a balanced tree structure (a very lush tree with many nodes), assuming that the tree has 10 layers, then it only takes 10 IO overhead to find the required data, and the speed is exponential Level promotion, using the big O notation method is O(log n), n is the total tree of records, the base is the number of branches of the tree, and the result is the number of layers of the tree. In other words, the number of searches is based on the number of branches of the tree, and the logarithm of the total number of records is expressed by the formula as

The program is Math.Log(100000000, 10), 100000000 is the number of records, 10 is the number of forks of the tree (the number of forks in the real environment is far more than 10), the result is the number of searches, and the result here is reduced from 100 million to 100 million Digits. Therefore, utilizing indexes can lead to amazing performance improvements for database queries.

However, there are two sides to everything. Indexes can increase the speed of database query data and decrease the speed of data writing. The reason is very simple, because the structure of the balanced tree must always be maintained in a correct state, adding, deleting and modifying data. It will change the index data content in each node of the balanced tree and destroy the tree structure. Therefore, every time the data changes, the DBMS must recombine the structure of the tree (index) to ensure it is correct, which will bring a lot of performance. overhead, which is why indexes can have side effects for operations other than queries.

After talking about the clustered index, let's talk about the non-clustered index, which is the conventional index that we often mention and use.

A non-clustered index, like a clustered index, also uses a balanced tree as the index data structure. The value of each node in the index tree structure comes from the index field in the table. If an index is added to the name field of the user table, the index is composed of the value in the name field. When the data changes, the DBMS needs to maintain the index structure all the time. correctness. If you add indexes to multiple fields in the table, there will be multiple independent index structures, and each index (non-clustered index) is not related to each other. As shown below

Every time a new index is created for a field, the data in the field will be copied and used to generate the index. Therefore, adding an index to a table will increase the size of the table and occupy disk storage space.

The difference between a non-clustered index and a clustered index is that the data to be searched can be found through the clustered index, while the primary key value corresponding to the record can be found through the non-clustered index, and then the value of the primary key can be used to find the required data through the clustered index, As shown below

No matter how the table is queried in any way, the primary key will eventually be used to locate the data through the clustered index, and the clustered index (primary key) is the only path to where the real data resides.

However, there is an exception where the required data can be queried without using a clustered index. This non-mainstream method is called a "covering index" query, which is commonly referred to as a compound index or multi-field index query. The above content of the article has pointed out that when a field is indexed, the content of the field will be synchronized to the index. If two fields are specified for an index, the content of the two fields will be synchronized to the index.

First look at the following SQL statement

//create index

create index index_birthday on user_info(birthday);

//Query the username of the user whose birthday is on November 1, 1991

select user_name from user_info where birthday = '1991-11-1'

The execution process of this SQL statement is as follows

First, find the primary key ID value of all records whose birthday is equal to 1991-11-1 through the non-clustered index index_birthday

Then, perform a clustered index lookup through the obtained primary key ID value to find the location where the real data (data row) of the primary key ID value pair is stored

Finally, the value of the user_name field is obtained from the obtained real data and returned, that is, the final result is obtained

We changed the index on the birthday field to a double-field covering index

create index index_birthday_and_user_name on user_info(birthday, user_name);

The execution process of this SQL statement will become

Use the non-clustered index index_birthday_and_user_name to find the content of the leaf node whose birthday is equal to 1991-11-1. However, in addition to the value of the primary key ID of the user_name table in the leaf node, the value of the user_name field is also in it, so there is no need to pass the primary key ID value. To find the true location of the data row, you can directly obtain the value of user_name in the leaf node and return it. Through this method of direct searching by covering index, the last two steps of searching without using covering index can be omitted, which greatly improves the query performance, as shown in the following figure

The general working principle of database indexing is as described in the text, but the details may be slightly different, but this does not affect the results of the concept description.


Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326833062&siteId=291194637