MySQL logical architecture and performance optimization principle

MySQL logical architecture

If you can build an architecture diagram of how the components of MySQL work together in your mind, it will help you understand the MySQL server in depth. The following figure shows the logical architecture diagram of MySQL.

The MySQL logical architecture is divided into three layers as a whole. The top layer is the client layer, which is not unique to MySQL. Functions such as connection processing, authorization authentication, and security are all handled in this layer.

Most of MySQL's core services are in the middle layer, including query analysis, analysis, optimization, caching, and built-in functions (such as time, math, encryption, etc.). All cross-storage engine functions are also implemented at this layer: stored procedures, triggers, views, etc.

The bottom layer is the storage engine, which is responsible for data storage and extraction in MySQL. Similar to the file system under Linux, each storage engine has its advantages and disadvantages. The middle service layer communicates with storage engines through APIs, and these API interfaces shield the differences between different storage engines.

When each client initiates a new request, the server-side connection/thread processing tool is responsible for receiving the client's request and opening up a new memory space. A new thread is generated in the server-side memory. When each user connects to the server-side At that time, a new thread is generated in the process address space to respond to client requests. The query requests initiated by the user are run in the thread space, and the results are also cached and returned to the server. The reuse and destruction of threads are implemented by the connection/thread processing manager.

To sum up: the user initiates a request, the connection/thread processor opens up memory space, and begins to provide a query mechanism.

MySQL query process

Users always hope that MySQL can obtain higher query performance. The best way is to figure out how MySQL optimizes and executes queries. Once you understand this, you will find that a lot of query optimization work is actually to follow some principles so that the MySQL optimizer can run in a reasonable way as expected.

What exactly does MySQL do when sending a request to MySQL? The following figure shows the query process of MySQL.

Client/server communication protocol

The MySQL client/server communication protocol is "half-duplex": at any one time, either the server sends data to the client or the client sends data to the server. These two actions cannot happen at the same time. Once one end starts to send a message, the other end has to receive the entire message to respond to it, so we can't and don't need to cut a message into small pieces and send it independently, and there is no way to control flow.

The client sends the query request to the server in a single data packet, so when the query statement is very long, the max_allowed_packet parameter needs to be set. But it should be noted that if the query is too large, the server will refuse to receive more data and throw an exception.

In contrast, the server responds to the user with a lot of data, consisting of multiple data packets. But when the server responds to the client request, the client must receive the entire returned result in its entirety, instead of simply taking the first few results, and then let the server stop sending. Therefore, in actual development, it is a very good habit to keep the query simple and return only the necessary data, and reduce the size and number of data packets between communications. This is also the reason for avoiding the use of SELECT * and the LIMIT limit in the query. one.

Query cache

Before parsing a query statement, if the query cache is turned on, MySQL will check whether the query statement hits the data in the query cache. If the current query happens to hit the query cache, the result in the cache is returned directly after checking the user permissions once. In this case, the query will not be parsed, an execution plan will not be generated, and it will not be executed.

MySQL stores the cache in a reference table (not to be understood as a table, it can be considered as a data structure similar to HashMap), through a hash value index, the hash value is passed through the query itself, the current database to be queried, and the client protocol version Number and other information that may affect the results are calculated. So the two queries are different in any characters (for example: spaces, comments), will cause the cache to not hit.

If the query contains any user-defined functions, stored functions, user variables, temporary tables, and system tables in the mysql library, the query results will not be cached. For example, the function NOW() or CURRENT_DATE() will return different query results due to different query times, and for example, query statements containing CURRENT_USER or CONNECION_ID() will return different results due to different users. Cache such query results It doesn't make any sense.

Since it is a cache, it will become invalid. When does the query cache become invalid? MySQL's query cache system will track each table involved in the query. If these tables (data or structure) change, all cached data related to this table will be invalidated. Because of this, in any write operation, MySQL must invalidate all caches of the corresponding table. If the query cache is very large or fragmented, this operation may cause a lot of system consumption, and even cause the system to freeze for a while. Moreover, the additional consumption of the query cache on the system is not only in write operations, and read operations are no exception: 

1. Any query statement must be checked before starting, even if this SQL statement will never hit the cache

2. If the query results can be cached, then after the execution is completed, the results will be stored in the cache, which will also bring additional system consumption

Based on this, we need to know that query caching will not improve system performance under any circumstances, caching and invalidation will bring additional consumption, only when the resource savings brought by the cache is greater than the resources consumed by itself, will it bring performance to the system Promote. If the system does have some performance problems, you can try to open the query cache and make some optimizations in the database design, such as:

1. Multiple small tables instead of one big table, be careful not to over-design

2. Batch insert instead of cycle single insert

3. Reasonably control the size of the cache space. Generally speaking, it is more appropriate to set the size to tens of megabytes

4. You can control whether a query statement needs to be cached through SQL_CACHE and SQL_NO_CACHE

The final piece of advice is not to open the query cache lightly, especially for write-intensive applications. If you really can't help it, you can set query_cache_type to DEMAND. At this time, only queries that add SQL_CACHE will be cached, and other queries will not. This allows you to control which queries need to be cached very freely.

Syntax analysis and preprocessing

MySQL parses SQL statements through keywords and generates a corresponding parse tree. In this process, the parser mainly uses grammar rules to verify and parse. For example, whether the wrong keywords are used in SQL or whether the sequence of keywords is correct, etc. The preprocessing will further check whether the parse tree is legal according to MySQL rules. For example, check whether the data table and data column to be queried exists.

Query optimization

The syntax tree generated through the previous steps is considered legal, and the optimizer converts it into a query plan. In most cases, a query can be executed in many ways, and the corresponding results are returned in the end. The role of the optimizer is to find the best execution plan among them.

MySQL uses a cost-based optimizer, which tries to predict the cost of a query using a certain execution plan and chooses the least costly one. In MySQL, you can get the cost of the current query by querying the value of last_query_cost of the current session.

mysql> select * from t_message limit 10;
...省略结果集
mysql> show status like 'last_query_cost';
+-----------------+-------------+
| Variable_name | Value |
+-----------------+-------------+
| Last_query_cost | 6391.799000 |
+-----------------+-------------+
--------------------- 

The result in the example indicates that the optimizer thinks that it needs to do a random search of 6391 data pages to complete the above query. This result is calculated based on some columns of statistical information, including: the number of pages of each table or index, the cardinality of the index, the length of the index and data rows, the distribution of the index, and so on.

There are many reasons why MySQL chooses the wrong execution plan, such as inaccurate statistical information, will not consider the operating costs (user-defined functions, stored procedures) that are not under its control, and the best that MySQL thinks is not what we think The same (we want the execution time to be as short as possible, but MySQL value chooses that it thinks the cost is small, but the small cost does not mean the execution time is short) and so on.

There are many reasons why MySQL chooses the wrong execution plan, such as inaccurate statistical information, will not consider the operating costs (user-defined functions, stored procedures) that are not under its control, and the best that MySQL thinks is the best we think The advantages are not the same (we want the execution time to be as short as possible, but MySQL value chooses that it thinks the cost is small, but the small cost does not mean the execution time is short) and so on.

MySQL's query optimizer is a very complex component. It uses a lot of optimization strategies to generate an optimal execution plan:   

1. Redefine the order of association of tables (when multiple tables are associated with queries, the order specified in SQL is not necessarily carried out, but there are some techniques to specify the order of association)

2. Optimize the MIN() and MAX() functions (to find the minimum value of a column, if the column has an index, you only need to find the leftmost end of the B+Tree index, otherwise you can find the maximum value, see below for specific principles)

3. Terminate the query early (for example: when using Limit, the query will be terminated immediately after finding a satisfactory number of result sets)

4. Optimize sorting (in the old version of MySQL, two transfer sorting is used, that is, the row pointer and the field to be sorted are sorted in memory first, and then the data rows are read according to the sorting result. The new version uses It is a single transfer sort, that is, read all the data rows at once, and then sort them according to the given column. For I/O-intensive applications, the efficiency will be much higher)

Query execution engine

After completing the parsing and optimization phase, MySQL will generate a corresponding execution plan, and the query execution engine will gradually execute the instructions given by the execution plan to get the result. Most of the operations in the entire execution process are completed by calling the interfaces implemented by the storage engine. These interfaces are called handler APIs. Each table in the query process is represented by a handler instance. In fact, MySQL creates a handler instance for each table in the query optimization stage. The optimizer can obtain table related information according to the interface of these instances, including all column names of the table, index statistics, etc. The storage engine interface provides a very rich function, but its bottom layer only has dozens of interfaces, and these interfaces complete most of the operations of a query like building blocks.

Return the result to the client

The final stage of query execution is to return the results to the client. Even if the data cannot be queried, MySQL will still return information about the query, such as the number of rows affected by the query and the execution time.

If the query cache is turned on and the query can be cached, MySQL will also store the results in the cache.

The return of the result set to the client is an incremental and gradual return process. It is possible that when MySQL generates the first result, it will gradually return the result set to the client. In this way, the server does not need to store too many results and consume too much memory, and the client can also get the returned results in the first time. It should be noted that each row in the result set will be sent in a data packet that meets the client/server communication protocol, and then transmitted through the TCP protocol. During the transmission, the MySQL data packets may be cached and then sent in batches.

Back to summarize the entire query execution process of MySQL, which is generally divided into 6 steps:

1. The client sends a query request to the MySQL server

2. The server first checks the query cache, and if it hits the cache, it immediately returns the result stored in the cache. Otherwise go to the next stage

3. The server performs SQL analysis, preprocessing, and the optimizer generates the corresponding execution plan

4. MySQL calls the storage engine API to execute the query according to the execution plan

5. Return the results to the client and cache the query results

Performance optimization recommendations

The following will give some optimization suggestions from three different aspects. But wait, a word of advice: don’t listen to the "absolute truth" you see about optimization, including the content discussed in this article, but verify your execution plan and response time through tests in actual business scenarios Hypothesis.

Table design and data type optimization

When choosing a data type, it is good to follow the small and simple principle. Smaller data types are usually faster, take up less disk and memory, and require fewer CPU cycles for processing. Simpler data types require fewer CPU cycles for calculation. For example, integer types are cheaper than character operations, so integer types are used to store ip addresses and DATETIME is used to store time instead of strings.

Here are a few tips that may be easy to understand:

1. Generally speaking, changing a column that can be NULL to NOT NULL does not help much, but if you plan to create an index on a column, you should set the column to NOT NULL.

2. Specify the width for the integer type, such as INT(11), there is no use. INT uses 4 bytes of storage space, so the value range it represents has been determined, so INT(1) and INT(20) are the same for storage and calculation.

3. UNSIGNED means that negative values ​​are not allowed, which can roughly double the upper limit of positive numbers. For example, the storage range of TINYINT is -128 ~ 127, while the storage range of UNSIGNED TINYINT is 0-255.

4. Generally speaking, there is not much need to use the DECIMAL data type. Even when you need to store financial data, you can still use BIGINT. For example, if you need to be accurate to one ten thousandth, you can multiply the data by one million and use BIGINT to store it. This can avoid the problems of inaccurate floating-point calculations and high cost of accurate DECIMAL calculations.

5. TIMESTAMP uses 4 bytes of storage space, and DATETIME uses 8 bytes of storage space. Therefore, TIMESTAMP can only represent the year 1970-2038, which is much smaller than the range represented by DATETIME, and the value of TIMESTAMP varies depending on the time zone.

6. In most cases, it is not necessary to use enumerated types. One of the disadvantages is that the list of enumerated strings is fixed. To add and delete strings (enumeration options), you must use ALTER TABLE (if you only append to the end of the list) Element, no need to rebuild the table).

7. Don't have too many columns in the table. The reason is that the storage engine API needs to copy data between the server layer and the storage engine layer in a row buffer format, and then decode the buffer content into each column at the server layer. The cost of this conversion process is very high. If there are too many columns and few columns are actually used, it may cause high CPU usage.

8. ALTER TABLE for large tables is very time-consuming. The way MySQL performs most of the table result operations is to create an empty table with a new structure, find all the data from the old table and insert it into the new table, and then delete the old table. Especially when the memory is insufficient, the table is large, and there are large indexes, it takes longer.

Create high-performance indexes

Indexes are an important way to improve the performance of MySQL queries, but too many indexes may lead to high disk usage and high memory usage, thereby affecting the overall performance of the application. You should try to avoid thinking of adding indexes after the fact, because afterwards you may need to monitor a lot of SQL to locate the problem, and the time to add an index must be much longer than the time required to initially add an index. It can be seen that the addition of an index is also very technical .

Index-related data structures and algorithms

The index we usually refer to refers to the B-Tree index, which is currently the most commonly used and effective index for searching data in relational databases. Most storage engines support this index. The term B-Tree is used because MySQL uses this keyword in CREATE TABLE or other statements, but in fact, different storage engines may use different data structures. For example, the InnoDB storage engine uses B+Tree.

The B in B+Tree means balance, which means balance. It should be noted that the B+ tree index cannot find a specific row with a given key value. It only finds the page where the data row is searched. Then the database will read the page into the memory, and then search in the memory, and finally Get the data you want to find.

As the data in the database increases, the size of the index itself increases, and it is impossible to store all of it in memory. Therefore, indexes are often stored on disks in the form of index files. In this case, disk I/O consumption will be generated during the index search process. Compared with memory access, I/O access consumption is several orders of magnitude higher. Can you imagine the depth of a binary tree with millions of nodes? If a binary tree with such a large depth is placed on the disk, every time a node is read, one disk I/O read is required, and the entire search time is obviously unacceptable. So how to reduce the number of I/O accesses in the search process?

An effective solution is to reduce the depth of the tree and turn the binary tree into an m-ary tree (multiple search tree), and B+Tree is a kind of multiple search tree. When understanding B+Tree, you only need to understand its two most important features: First, all keywords (which can be understood as data) are stored in leaf nodes (Leaf Page), and non-leaf nodes (Index Page) and No real data is stored, and all record nodes are stored in the same layer of leaf nodes in order of key value. Second, all leaf nodes are connected by pointers. The following picture shows a simplified B+Tree with a height of 3.

Two search operations can be performed on B+ trees:

1. Search in order from the smallest keyword

2. Starting from the root node, perform a random search

In random search, starting from the root node, the search method is the same as that of the B-tree, except that even if the keyword to be searched is found on a non-terminal node, it does not stop, but continues down to the search containing the search The leaf node of the key. Therefore, in the B+ tree, no matter whether the random search is successful or not, each random search takes a path from the root to the leaf node. If you need to search sequentially, start from the leftmost leaf node that contains the smallest key, without passing through the branch node (that is, the non-terminal node), and follow the pointer to the next leaf node to traverse all keywords.

The figure below shows a possible indexing method. On the left is the data table. There are two columns of seven records. The leftmost is the physical address of the data record (note that logically adjacent records are not necessarily physically adjacent on the disk). In order to speed up the search of Col2, you can maintain a binary search tree as shown on the right. Each node contains an index key and a pointer to the physical address of the corresponding data record. In this way, binary search can be used in O(log2N) Obtain the corresponding data within the complexity. 

 The main reason for using B+ tree for database index is that B tree improves disk IO performance and does not solve the problem of low efficiency of element traversal. It is to solve this problem that the B+ tree came into being. As long as the B+ tree traverses the leaf nodes, the entire tree can be traversed. Moreover, range-based queries in the database are very frequent, and B-trees do not support such operations (or too low efficiency). The B+ tree element traversal efficiency is extremely high, and the structure of the B+ tree is also particularly suitable for searching with a range. For example, to find the number of students aged 18-22 in a school, you can find the first 18-year-old student (at this time to the leaf node) by randomly searching from the root node, and then find it in the order of departure from the leaf node All records that meet the range.

Setting up an index for a table has a price: one is to increase the storage space of the database, and the other is to spend more time inserting and modifying data (because the index will also change accordingly).

High performance strategy

Through the above, I believe you have a general understanding of the data structure of B+Tree, but how does the index in MySQL organize the storage of data? To illustrate with a simple example, if there is the following data table:

CREATE TABLE People(
last_name varchar(50) not ,
first_name varchar(50) not ,
dob date not ,
gender enum(`m`,`f`) not ,
key(last_name,first_name,dob)
);

For each row of data in the table, the index contains the values ​​of the last_name, first_name, and dob columns. The following figure shows how the index organizes data storage.

It can be seen that the index is first sorted according to the first field. When the names are the same, it is sorted according to the third field, which is the date of birth. It is for this reason that the "leftmost principle" of the index is established.

1. MySQL will not create indexes on "non-independent columns". "Independent column" means that the index column cannot be part of an expression, nor can it be a parameter of a function. such as:

select * from where id + 1 = 5

It is easy to see that it is equivalent to id = 4, but MySQL cannot automatically parse this expression. The same is true of using functions.

2. Prefix index

If the column is very long, you can usually index the beginning part of the characters, which can effectively save index space and improve index efficiency.

3. Multi-column index and index order

In most cases, building independent indexes on multiple columns does not improve query performance. The reason is very simple. MySQL does not know which index to choose for better query efficiency, so in the old version, such as MySQL 5.0, it will randomly choose a column index, and the new version will adopt the strategy of merging indexes. As a simple example, in a movie actor table, independent indexes are established on both the actor_id and film_id columns, and then the following query:

select film_id,actor_id from film_actor where actor_id = 1 or film_id = 1

The old version of MySQL will randomly select an index, but the new version does the following optimizations:

select film_id,actor_id from film_actor where actor_id = 1 
union all 
select film_id,actor_id from film_actor where film_id = 1 and actor_id <> 1

1. When there are multiple indexes for intersecting operations (multiple AND conditions, such as where film_id = 1 and actor_id = 1 in the above example), an index containing all related columns is generally better than multiple independent index.

2. When there are multiple indexes for joint operations (multiple OR conditions, such as the query statement in the above example), operations such as merging and sorting the result set require a lot of CPU and memory resources, especially when some of the indexes The selectivity is not high, and the query cost is higher when a large amount of data needs to be returned and merged. So in this case, it is better to take a full table scan.

Therefore, if there is an index merge (Using union appears in the Extra field) when explaining, you should check whether the query and table structure are already optimal. If there is no problem with the query and the table, it can only indicate that the index is built very badly. Should carefully consider whether the index is appropriate, it is possible that a multi-column index containing all related columns is more suitable.

Earlier we mentioned how the index organizes data storage. From the figure, you can see that the order of the multi-column index is very important for the query. Obviously, you should put more selective fields in the index Earlier, in this way, most data that does not meet the conditions can be filtered out through the first field.

After understanding the concept of index selectivity, it is not difficult to determine which field has higher selectivity, and you can find out by checking it, such as:

SELECT * FROM payment where staff_id = 2 and customer_id = 584

Should the index of (staff_id, customer_id) be created or should the order be reversed? Execute the following query, which field selectivity is closer to 1, just index which field in front.

select count(distinct staff_id)/count(*) as staff_id_selectivity,
count(distinct customer_id)/count(*) as customer_id_selectivity,
count(*) from payment

In most cases, there is no problem with using this principle.

4. Avoid multiple range conditions

In actual development, we often use multiple range conditions, for example, we want to query users who have logged in during a certain period of time:

select user.* from user where login_time > '2017-04-01' and age between 18 and 30;

This query has a problem: it has two range conditions, the login_time column and the age column. MySQL can use the index of the login_time column or the index of the age column, but it cannot use them at the same time.

5. Redundant and duplicate indexes

Redundant index refers to the same type of index created in the same order on the same column. This kind of index should be avoided as much as possible and deleted immediately after discovery. For example, there is an index (A, B), and then creating an index (A) is a redundant index. Redundant indexes often occur when a new index is added to the table. For example, someone creates a new index (A, B), but this index does not extend the existing index (A).

In most cases, you should try to expand existing indexes instead of creating new indexes. However, there are rare cases where performance considerations require redundant indexes, such as extending an existing index and causing it to become too large, thereby affecting other queries that use the index.

6, delete long-unused indexes

It is a very good habit to periodically delete some indexes that have not been used for a long time.

I intend to stop here on the topic of indexing. Finally, I want to say that indexing is not always the best tool. Indexes are effective only when the benefits of indexing to improve query speed outweigh the extra work it brings. For very small tables, a simple full table scan is more efficient. For medium to large tables, indexes are very effective. For very large tables, the cost of creating and maintaining indexes increases. At this time, other techniques may be more effective, such as partitioned tables. In the end, it is a virtue to mention the test after explaining.

In our work, the most common thing we use to capture performance problems is to open slow queries and locate SQL with poor execution efficiency. Then when we locate a SQL, it is not finished. We also need to know the execution plan of the SQL, such as Full table scan, or index scan, these need to be completed through EXPLAIN. The EXPLAIN command is the main way to see how the optimizer decides to execute the query. It can help us to understand MySQL's cost-based optimizer in depth, and can also obtain many details of the access strategy that may be considered by the optimizer, and which strategy is expected to be adopted by the optimizer when running SQL statements.

to sum up

Understanding how the query is executed and where the time is spent, coupled with some knowledge of the optimization process will help to better understand MySQL and understand the principles behind common optimization techniques.

Here I recommend an architecture learning exchange group to everyone. Communication learning group number: 478030634 Some video recordings recorded by senior architects will be shared: Spring, MyBatis, Netty source code analysis, principles of high concurrency, high performance, distributed, microservice architecture, JVM performance optimization, distributed architecture, etc. These become the necessary knowledge systems for architects. You can also receive free learning resources and benefit a lot at present

 

Guess you like

Origin blog.csdn.net/yunzhaji3762/article/details/86511300