Indexing and Optimization Principles (Part 1)

About the author: Hello everyone, I am Brother Smart, a former architect of ZTE and Meituan, and now the CTO of an Internet company.

Contact qq: 184480602, add me to the group, let’s learn together, make progress together, and fight against the cold winter of the Internet together

In the previous article, we revisited the history of database indexing and learned about the B+ tree structure. In this article, we return to the real MySQL database, initially learn specific SQL optimization principles, and try to analyze why this happens based on the underlying principles of indexing. There are so many "rules".

Why you should learn SQL optimization

My former employer was engaged in recruitment services, so it was inevitable to check the industry classification. Generally speaking, the front end can obtain the industry category by loading in steps based on the parentId, but some scenarios also require full nested query: query the industry category and its subcategories.

Here we assume that all categories and their subcategories are directly queried:

I designed a simple version of the table structure myself, which is roughly as follows:

There are 1106 pieces of data in the table:

We can easily write the following code:

/**
 * 查询行业分类及其子分类
 */
@Test
public void testCascade() {

    // 查询数据库,得到所有行业类别
    List<SysPosition> sysPositionList = sysPositionMapper.selectAll();

    long start = System.currentTimeMillis();

    Map<String, SysPosition> sysPositionMap = new HashMap<>();
    List<SysPosition> result = new ArrayList<>();

    // 第一步:List转Map
    for (SysPosition sysPosition : sysPositionList) {
        sysPositionMap.put(sysPosition.getCode(), sysPosition);
    }

    // 第二步:遍历List,利用Map完成嵌套
    for (SysPosition sysPosition : sysPositionList) {
        if ("-1".equals(sysPosition.getParentCode())) {
            result.add(sysPosition);
        } else {
            SysPosition parent = sysPositionMap.get(sysPosition.getParentCode());
            parent.getChildren().add(sysPosition);
        }
    }

    long end = System.currentTimeMillis();
    System.out.println("耗时:" + (end - start));
}

We have analyzed the efficiency: this method of converting List to Map will have 2N loops, which means it will loop 2212 times.

Guess how long the above procedure takes?

only 1seconds.

For the CPU, calculating data in memory is very fast, and thousands of cycles can be basically ignored.

Do you want to know how much time the first version of the algorithm we tested took? Test it:

Just to33 seconds.

You know, the first version of the algorithm looped 1106*1106 = 100w+ times in this case! ! But for the CPU, it is not worth mentioning. Of course, this is the difference between a single call. Imagine that this interface is called by hundreds of thousands or even millions of users every day. The cumulative difference is still considerable.

Through the above case, what I want to say is: in most cases, the processing time of data in memory is almost negligible.

Have you noticed that the above program does not include the time for Mapper to query the database? Will database select operations be time-consuming?

I saw a passage about database insert in a column on a website:

The time required to insert a row is determined by the following factors (refer to the MySQL 5.7 Reference Manual: 8.2.4.1 Optimizing INSERT statements)

  • Connection: 30%
  • Send query to server: 20%
  • Parsing queries: 20%
  • Insert row: 10% * row size
  • Insert index: 10% * number of indexes
  • End: 10%

It can be found that most of the time is spent communicating between the client and the server, so you can use insert to contain multiple values ​​to reduce the communication between the client and the server. We conduct experiments to verify the efficiency of inserting multiple rows at a time versus inserting one row at a time.

Although the above is about insert, the situation of select is actually similar. Now I include the time of the Mapper query:

It suddenly increased to 496 milliseconds! !

Well, this example tells us that network requests (and IO operations) are very time-consuming operations. We should try to avoid calling network requests or performing IO operations in a loop, such as:

This is very poor writing.

OK, at this point everyone should be able to form an understanding: for a normal request, the most likely places where performance bottlenecks occur are network requests and IO operations (generally speaking, performance bottlenecks often occur in the database).

To optimize data query, there are two general directions:

  • Optimize the relational database itself, such as adding indexes, etc.
  • With the help of big data and ES, query pressure is transferred (essentially it has nothing to do with relational databases)

For ordinary small companies, big data and ES are still rare things, so when we discuss performance optimization, SQL optimization is almost the focus! Compared with SQL performance improvements, code optimization is sometimes insignificant. Even if there is optimization, the final analysis is to reduce the number of requests to the database. Everyone should be happy, because you will finally get into the room and explore SQL optimization.

Unless there are special circumstances, the content discussed in this article is based on the InnoDB engine.

In my opinion, for general Java development, SQL optimization is divided into several levels:

  • Index optimization 70%
  • Transactions and locks 20%
  • Read and write separation, etc. 10%

Among them, index optimization is the most important and the most commonly used method by average Java developers.

Index type

The classification of the index may have different dimensions. We do not pursue a particularly accurate classification here. After all, we are not doing academic work. We only need to understand a few perceptually.

Open Navicat and when trying to create an index, you will find that there are 4 index types to choose from:

  • Full text index
  • Ordinary index
  • spatial index
  • unique index

Ordinary indexes can organize the tree structure, and unique indexes also require that the index columns cannot be repeated on the basis of ordinary indexes. For example, suppose we add a unique index to the name column of the student table. If "Zhang San" already exists in the table, an error will be reported when inserting "Zhang San" again.

Relational databases like MySQL are not suitable for full-text retrieval (consider Elastic Search), so full-text indexes are generally rarely used.

As for the spatial index, I don't know what it is.

The only indexes commonly used in actual development arecommon index andunique index, you can ignore the others (primary key index is actually equivalent to unique index + non-NULL).

How indexing is implemented

There are two common ways to implement indexes, through B+ tree structure or hash algorithm.

Pay special attention to the fact that although "BTREE" is written here, MySQL does use B+Tree.

This concept actually does not conflict with the "type of index" above.

For example, for ordinary indexes, we can use the B+ tree structure to organize the index, or we can use the hash algorithm to implement it. After studying in the previous article, we already have a relatively good understanding of the B+ tree structure, so here we will talk about the hash index separately.

The so-called hash index actually uses thehash algorithm to calculate the unique storage address for the index column. Generally speaking, this address is not will be repeated (duplication is called a hash collision).

Take an example from teacher Yan Shiba:

Install a spring on the wall whose elasticity never decays. Each time you take a different object and press the spring to the limit and then release it, the final landing point of different objects will be different. For example, if you saved a book last time, when you want to find this book next time, you only need to pick up an exactly the same book and play it again. , you can find the last book at the current landing point.

The design of the database hash index is similar. Suppose you want to store the data with id=10086, you need to calculate the id through the hash algorithm, obtain a storage location and then write the data. The next time you query with id=10086, you just need to calculate it again according to the same algorithm, and you will be able to find the corresponding data immediately. Isn’t it fast?

It should be noted that although the example of a spring is used to metaphor the hash algorithm, it may make people mistakenly think that heavier items fall closer to each other, and lighter items fall farther away, which leads to the conclusion: hash index Range searches are possible.

This is not the case.

The hash algorithm has a significant feature: even if the source data has a certain correlation, the results obtained after hash mapping will become "very scattered" and there is no pattern to follow. Going back to the previous example, you can understand that the weight is not the only factor that affects the final landing point of the book. The material and shape of the book all occupy a certain proportion, which is ultimately reflected in the irregular landing point caused by air resistance.

I don’t know if you still remember that when coming into contact with HashMap in the JavaSE stage, many people will find that the order of put and get are not necessarily the same. For example, the order of put is 1000 grams, 500 grams, 300 grams, but the order of get is 500 grams, 300 grams, 1000 grams. In other words, after hash calculation, the correlation of the data will be greatly weakened.

So, when you want to find books between 500g and 1000g, you cannot use boundary values ​​to perform range searches. The B+ treeleaf nodes are ordered linked lists, and range queries are very convenient.

In addition to being unable to perform range searches, hash indexes cannot perform fuzzy searches.

The hash algorithm itself represents precise positioning and relies on the calculated input parameters to obtain a "unique" value, so fuzzy matching cannot be performed. For example, if you give me "bravo", I can calculate a unique hash value. If you give me "bra%", I will think that this person is called "bra%" and calculate a value.This value represents the drop point calculated by "bra%", not the drop point of "all data starting with bra", is obviously wrong. And

But B+ tree can perform fuzzy search, you can assume that because it will search along the tree and call Java-like String#startWith() method in the node containing the data for comparison.

Advantages and disadvantages of hash index

  • Advantages: Very fast, only one calculation is needed to get the address, the time complexity is O(1), while the B+ tree is O(logn)
  • Disadvantages: Does not support fuzzy query, range query, index sorting (it is irregular in itself, how to use index sorting)

Finally, the operation of converting List to HashMap actually draws on hash index!

Index creation

There are generally two times when indexes are created:

  • Initially, when creating a table, an index was created
  • Later, modify the table structure and create indexes (this is usually the case, because it is difficult to predict in advance, and optimizing in advance is equal to blind optimization)

For example, to create an index from the beginning:

This table has two indexes: primary key index and auditor_id ordinary index.

The primary key index does not belong to one of the four index types introduced above, but the so-called Primary Key can be regarded as a unique index + NOT NULL constraint.

If you need to add an index later, you can do so in two ways:

  • SQL statement
  • Navicat graphical interface

Use SQL statements to add indexes:

-- 1.添加PRIMARY KEY(主键索引) 
ALTER TABLE `table_name` ADD PRIMARY KEY (`column`) ;
-- 2.添加UNIQUE(唯一索引) 
ALTER TABLE `table_name` ADD UNIQUE (`column`);
-- 3.添加INDEX(普通索引) 
ALTER TABLE `table_name` ADD INDEX index_name (`column`);
-- 4.添加FULLTEXT(全文索引) 
ALTER TABLE `table_name` ADD FULLTEXT (`column`);
-- 5.添加联合索引 
ALTER TABLE `table_name` ADD INDEX index_name (`column1`, `column2`, `column3`);

In this case, you can write:

ALTER TABLE `moneywithdraw` ADD INDEX idx_auditor_id (`auditor_id`);

Create using Navicat graphical interfaceSingle column index:

Create using Navicat graphical interfaceJoint index:

Oh, by the way, if the data volume of a table is too large, don't add indexes by yourself, or the table may be locked... I'll talk about it later when I have the opportunity. In short, you can "understand the index", but you'd better think twice before "moving the index".

The good and bad of indexing

When it comes to indexes, many people will say: Oh, indexes can increase query speed. Generally speaking, people who say this may have learned well, but they have definitely not fully grasped the underlying principles of indexing.

If you think that the advantage of indexes is only to speed up queries, you are underestimating indexes.

The advantages of indexing are:

  • Speed ​​up queries (including related queries)
  • Speed ​​up sorting (ORDER BY)
  • Speed ​​up grouping (GROUP BY)

Although speeding up sorting and grouping is ultimately reflected in speeding up queries, being able to proactively realize this is a breakthrough. Only when you realize that indexes can speed up sorting and grouping, will you consciously use index grouping and sorting (leftmost matching principle) when writing ORDER BY and GROUP BY, thereby writing better SQL.

Disadvantages of indexing:

  • There is a price to pay for creating an index, which is mainly reflected in maintenance costs, space costs and table return costs. In other wordsindex can improve query efficiency, but it often reduces the speed of additions, deletions and modifications (if hundreds of words are added to the dictionary, additional cataloging is required. It will take up a few more pages)
  • If a joint index is used, you also need to consider the issue of index failure (joint index will be introduced in the next article)
  • Too many indexes will increase the query optimizer's selection time (too many selections are also troublesome)

Principles of index building

Many people think that SQL optimization is the top priority. Creating an index only requires one line of code and is no big deal. But now that you know the advantages and disadvantages of indexing, you'll understand how empty a slogan "index on the right fields at the right time" is. What is the basis for judging how to create an index?

There are 4 major principles for creating indexes:

  • More indexes are not better. Joint indexes should be better than multiple single-column indexes.
  • Indexes should be built on highly differentiated fields
  • Try to create indexes for frequently queried fields and avoid creating indexes for frequently modified fields.
  • Avoid duplicate indexes

The reason behind the first principle is that in fact, a database query will only selectone index tree (excluding table returns) , a more professional statement is that only one execution plan will be selected for each query. Even if you add indexes to all columns a, b, c, d, e, f, g..., when SELECT xx, xxx FROM table WHERE..., the database will onlySelect the bestselect an execution plan for query.

It should be noted that each time you build an index, you need to maintain an index tree. So the more indexes the better, the better. Inappropriate indexes will increase the load on the database. For example, if you have already created a directory for searching Chinese characters based on pinyin, and you want to search for Chinese characters based on radicals, there is no other way. You will have to go to the trouble of creating another directory yourself.

Seeing this, you may ask: Damn, MySQL is too stupid. Why is it so stubborn that it only uses one index at a time?

The more superficial reason is: after you check the Chinese characters based on pinyin, do you check again based on the radicals?

The more serious reason is: According to my personal understanding, the starting point of the index itself is "After completing the index, the database should returnaccurate resultsora very small result set". From a cost perspective, it is faster to traverse the result set directly than to go through the index again at this time. Of course, it takes a lot of hard work to get an accurate result set in one index. Which field should be indexed? I suggest that indexes should be added to highly differentiated fields as much as possible.

What is a high degree of differentiation? This is the second principle of index building. For example, if there are 1 million student data in the table, if you add an index on the sex column, then only 500,000 data can be filtered out based on sex, and the remaining result set is still very large, indicating that the index is not properly built and the degree of differentiation is too low. .

The third principle is literal meaning. For example, after a dictionary has compiled a table of contents based on its contents, new words need to be added to it every now and then, or the pronunciation of Chinese characters often needs to be modified. After one operation, the table of contents will inevitably be affected and the only choice is to rearrange it. In other words, in order to ensure that the directory can correctly point to the corresponding Chinese characters, an additional operation is required after each addition, deletion, and modification: revising the directory.

In short, you must realize that indexes will almost certainly burden modifications while speeding up queries, so creating indexes is not that simple. It is definitely an "art of balance."

The fourth principle is that, for example, if an index has been established and a joint index of index(a,b,c) is established, then the single-column index a is redundant, because the joint index can already guarantee that it will be Use an index. In terms of physical storage, a single-column index and index(a, b, c) are two independent B+ trees. Duplicate indexes will increase maintenance costs. .

The above four principles will be mentioned again later.

MySQL common engines

There are many MySQL engines, but the most commonly heard ones are MyISAM and InnoDB. In actual development, almost 99% choose to use InnoDB, and after MySQL 5.6, the default engine changed from MyISAM to InnoDB, so I will focus on it here. InnoDB, a brief introduction to MyISAM.

Here I mainly want to discuss with you the differences in index organization between MyISAM and InnoDB. Everyone should already know that MyISAM and InnoDB store data differently.

Each table in MyISAM is divided into 3 files when stored:

  • Table Structure
  • table data
  • index

In other words, table data and indexes are stored independently.

InnoDB's table data is only divided into 2 files when stored:

  • Table Structure
  • Table data + index

It should be noted that the data and indexes of all InnoDB tables are in the same file (see the next section).

Clustered index and non-clustered index

For BTREE indexes, from the perspective of data organization, indexes can be divided into two categories:

  • clustered index
  • non-clustered index

The so-called clustered index can be simply understood as the index and data are "aggregated" together, while the data and index of non-clustered index are separated.

According to the InnoDB engine'sprimary key index, there is no need to return the table when querying. Each complete row of data is directly hung under the leaf node, and can be directly return. In other words, for InnoDB's primary key index, the data is the index, and the index is the data.

MyISAM is not very important, so I won’t mention it.

Not all InnoDB indexes require table backing. They can be divided into two categories according to whether they need table backing: primary key indexes and auxiliary indexes (or secondary indexes and ordinary indexes).

What would make this distinction?

Assume a scenario:

After creating a new table, a primary key index will naturally be generated. But later I found that the name field was queried very frequently, so I added a name index.

If the name index also hangs data like the primary key index, then the data of the two indexes will be duplicated. Imagine that there is a tree called name and a tree called id in the disk. One has name as a node and the other has id as a node. The same thing is that the bottom leaf nodes all have complete table data. In other words, there are two identical copies of student data stored on the disk. Not to mention data redundancy, data inconsistency may also occur during updates (data needs to be synchronized to ensure data consistency across multiple tables).

So what InnoDB does is that the auxiliary index only stores the index column + primary key, and performs a "table return" operation when necessary:

Since in SELECT * FROM stu WHERE name='bravo', the queried data is *, which is the entire row of data. The above auxiliary index only stores the primary key + name, so it must be returned to the table: take the primary key and run the primary key index again, and finally return the entire row of data.

Now, we can make a brief summary of the index classifications of MyISAM and InnoDB:

  • MyISAM: non-clustered index, need to return the table
  • InnoDB:
    • Clustered index: primary key index, leaf nodes are table data, no need to return to the table
    • Non-clustered index: auxiliary index (unique index, ordinary index). The leaf node is the primary key. If necessary, you need to query the table based on the primary key.

Each InnoDB table can only have one primary key index, and there can be multiple auxiliary indexes. There is only one copy of table data, which is hung under the primary key index.

It should be noted that if possible, return to the table should be avoided. The essence of SQL optimization is actually to reduce/reduce disk IO, and returning tables will inevitably increase the number of disk IOs.

For example, assume that a certain table has a total of two index trees: primary key index + name auxiliary index, and the height of both trees is 3. Since table data is only linked to the primary key index, for SELECT * FROM table WHERE name='xxx', you need to first go through the auxiliary index to get the id, and then go through the primary key index based on the id. Assuming that the data required by both trees is in the third layer, this SQL requires 6 logical IO accesses. If you query directly based on the ID, you can directly use the primary key index, and the number of IOs is 3.

So,Normallyauxiliary index queries need to return the table, which scans one more index tree (self + primary key index), when actually writing SQL, you should try to use the primary key index.

So,under what circumstancescan auxiliary indexes avoid table back?

Index coverage

The name index coverage sounds unclear at first, so many beginners still don’t understand what it means. In fact, its biggest function is to avoid table returns.

The following is illustrated through a case.

Suppose there is a requirement: the front end needs to supportsearch for orders based on user name,and the fields required on the page are as follows.

id

productName

price

userName

userAge

1

iphone12

5999

bravo1988

18

A possible solution is:

  1. Search the user based on name in the t_user table and get user_id, user_name, user_age
  2. Query orders based on user_id in the t_order table
  3. Return after matching order and user data according to user_id in memory

Since the query condition of the t_user table at this time is user_name, in order to speed up the query speed of the t_user table, a normal index can be added to user_name. But is this really good? I advise! Don't be so clever, you are so clever.

You have to know that at this time we can query not only user_name from the t_user table, but also user_age and id. If only user_name is indexed, the index tree generated on the disk will look like this:

The non-leaf node of this tree is user_name, and the leaf node is id,That is to say, we can only get user_name and user_id from this tree, as for user_age, the bottom layer of MySQL can only jump out of the name index tree, and then run to the next door primary key index to obtain it. The whole process is called table return, and table return means one more trip.

At this point we can add a joint index to user_name and user_age, which will produce the so-called "index coverage":

When the fields on the auxiliary index completely satisfy the columns of this query, it is the so-called index coverage.This is good news, which means that there is no need to return the table and the query efficiency is improved. will be greatly improved. This is why SQL optimization principles often emphasize: try to only take necessary fields and avoid SELECT * (to increase the probability of index coverage, the more fields queried, the lower the probability).

Even if there are only two fields in the table and have been covered by indexes, do not write SELECT *. Because in the future, as the business expands, other fields will be added to this table. At this time, SELECT * will no longer cover the index!

In order to facilitate memory, you can understand index coverage as indexed fields >= fields required for query. For example, if the field of the joint index is index(a,b,c), then index coverage will occur when querying any combination of a/b/c/id fields. The biggest benefit of index coverage is to avoid table backing.

It should be emphasized that there is no necessary relationship between covering index and joint index. For example, I only add a single index to user_name, and my query statement is

SELECT id, user_name FROM t_user WHERE name='bravo';

is also index covered at this time. Therefore, whether can be covered by the index does not depend on the index alone, but requires the cooperation of the query.

Regarding the joint index, we will introduce it in the next article.

Summary of key points

Today we learned several classifications of indexes, the good and bad of indexes, and several principles for creating indexes. We also introduced clustered indexes and non-clustered indexes, which introduced the concept of table backing. Returning the table will generate additional IO. In order to improve efficiency, we also learned "index coverage". I hope that this article can give everyone a general impression of the index.

But the most difficult part of indexing has not yet come. In the next article, we will learn about joint indexing. This is a very important knowledge point. It leads to countless problems, such as index failure, leftmost matching principle, and ORDER BY sorting. Failure and so on.

About the author: Hello everyone, I am Brother Smart, a former architect of ZTE and Meituan, and now the CTO of an Internet company.

Join the group, let’s learn together, make progress together, and fight the Internet winter together

 

Guess you like

Origin blog.csdn.net/smart_an/article/details/135029162