Explain the execution flow of a query select statement and update update statement in detail

  • Preface
  • The execution flow of a select statement
  • establish connection
  • Query cache
  • Parser and preprocessor
  • Lexical analysis and grammatical analysis (Parser)
  • Preprocessor
  • Query Optimizer
  • What optimizations can the optimizer do
  • The optimizer is not a panacea
  • How the optimizer gets the query plan
  • Storage engine query
  • Return result
  • The execution flow of an update statement
  • Buffer Pool
  • redo log
  • Write-Ahead Logging(WAL)
  • How to flush the redo log
  • bin log
  • The difference between bin log and redo log
  • The execution flow of the update statement
  • Two-phase commit
  • If the two-phase submission method is not used
  • Data recovery rules after downtime
  • to sum up

Preface

This article is based on MySQL 5.7 version.

The previous articles in the MySQL series introduced knowledge about indexes, transactions and locks, so today let us take a look at what steps MySQL has to go through when we execute a select statement and an update statement to return what we want data.

The execution flow of a select statement

From a general perspective, MySQL can be divided into Server layer and storage engine layer. The Server layer includes connectors, query caches, parsers, preprocessors, optimizers, executors, etc. Finally, the Server layer calls the corresponding storage engine layer through API interfaces. As shown in the figure below (the picture comes from "High Performance MySQL"):

Explain the execution flow of a query select statement and update update statement in detail

 


Insert picture description here

According to the flowchart, a select query roughly goes through the following six steps:
1. When the client initiates a request, it will first establish a connection
2. The server will check the cache, if it hits, it will return directly, otherwise continue to the following step
3. Server The client parses the received SQL statement, and then performs lexical analysis, syntax analysis and preprocessing
4. The execution plan is generated by the optimizer
5. The storage engine layer API is called to execute the query
6. The result of the query is returned

The query process can also be represented by the following figure (the picture comes from Ding Qi MySQL45):

Explain the execution flow of a query select statement and update update statement in detail

 


Insert picture description here

establish connection

The first step is to establish a connection. This is easy to understand. It needs to be pointed out that the communication method between the MySQL server and the client uses a half-duplex protocol .

There are three main communication methods: simplex, half-duplex, and full-duplex, as shown in the figure below:

Explain the execution flow of a query select statement and update update statement in detail

 


Insert picture description here

  • Simplex: When communicating, data can only be transmitted in one direction. For example, the remote control, we can only use the remote control to control the TV, but can not use the TV to control the remote control.
  • Half-duplex: During communication, data can be transmitted in both directions, but only one server can send data at the same time. When A sends data to B, then B cannot send data to A, and must wait until after A is sent. B can send data to A. For example, walkie talkie.
  • Full duplex: When communicating, data can be transmitted in both directions and can be transmitted simultaneously. For example, we make calls or use communication software to make voice and video calls.

The half-duplex protocol makes MySQL communication simple and fast, but it also limits the performance of MySQL to a certain extent, because once data is sent from one end, the other end must receive all the data to respond. So when we insert batches, we try to split them into multiple inserts instead of inserting too much data at one time. It is best to bring limit to limit the number of the same query statement to avoid returning too much data at one time.

The size of a single MySQL transmission packet can be controlled by the parameter max_allowed_packet, the default size is 4MB

SHOW VARIABLES LIKE 'max_allowed_packet';

Explain the execution flow of a query select statement and update update statement in detail

 


Insert picture description here

Query cache

After the connection is connected, if the cache is turned on, it will enter the query cache stage. You can check whether the cache is turned on with the following command:

SHOW VARIABLES LIKE 'query_cache_type';

Explain the execution flow of a query select statement and update update statement in detail

 


Insert picture description here

We can see that the cache is turned off by default. This is because MySQL's cache usage conditions are very harsh, and it is matched by a case-sensitive hash value. This means that even if a query statement has an inconsistent space, the cache cannot be used. And once a row of data in the table changes, all caches on this table will be invalidated. So generally we don't recommend to use the cache. The latest version of MySQL 8.0 has removed the cache module.

Parser and preprocessor

After skipping the cache module, the query statement will enter the parser for analysis.

Lexical analysis and grammatical analysis (Parser)

The main task of this step is to check whether the grammar of the SQL statement is correct. Here, we will break our entire SQL statement first, for example: select name from test where id=1, it will be broken into select, name, from, Test, where, id, =, 1 these 8 characters, and can identify keywords and non-keywords, and then generate a data structure based on the SQL statement, also called parse tree (select_lex), as shown below:

Explain the execution flow of a query select statement and update update statement in detail

 


Insert picture description here

Preprocessor

After the previous lexical and grammatical analysis, then at least the grammatical format of one of our sql statements meets the requirements, what else do we need to do next? Naturally, it is to check whether the table name, column name, and other information actually exist. The preprocessing is to check the validity of related information such as table name and field name .

Query Optimizer

After the above steps, here is a valid SQL statement. For a query statement, especially a complex multi-table query statement, we can have many execution methods, and the efficiency of each execution method is different, so at this time, the query optimizer needs to choose the most efficient one. execution way.

The purpose of the query optimizer is to generate different execution plans (Execution Plan) based on the parse tree, and then select an optimal execution plan. MySQL uses a cost-based optimizer. Which execution plan has the least cost, Which one to choose.

We can query the cost through the variable Last_query_cost:

SELECT * FROM test;
show status like 'Last_query_cost';

Explain the execution flow of a query select statement and update update statement in detail

 


Insert picture description here

The result shown in the figure above means that MySQL believes that the SELECT * FROM test query statement requires a random search of at least 2 data pages to complete the above query.
This result is obtained through a series of complex calculations, including the number of pages in each table or index, the cardinality of the index, the length of the index and data rows, and the distribution of the index.

When evaluating the cost, the optimizer does not consider the role of any cache, but assumes that reading any data requires an IO operation.

What optimizations can the optimizer do

The optimizer can do many optimizations for us. Here are some commonly used optimizations:

  • Redefine the order of association. The optimizer does not necessarily follow the correlation order in the query correlation statement we wrote, but will perform the query in the optimized order.
  • Convert external connections to internal connections.
  • Use the equivalent conversion principle. For example, a<b and a=5 will be converted to a=5 and b>5
  • Optimize COUNT(), MIN() and MAX()
  • Estimate and convert to constant expression
  • Covering index scan. If you want to learn more about the coverage index, click here.
  • Query optimization.
  • Terminate the inquiry early. For example, if we use an unfulfilled condition, it will immediately return null.
  • Equivalent propagation.
  • Optimize the IN() statement. In many other databases, in is equivalent to the or statement, but in MySQL, the values ​​in in are sorted first, and then the binary search method is used to determine whether the conditions are met.

In practice, the optimizer can do far more optimizations than those listed above, so sometimes we don’t feel smarter than the optimizer, so in most cases, we can just let the optimizer make optimizations. If some of us To make sure that the optimizer has not selected the optimal query plan, we can also inform the optimizer by adding hint hints in the query, such as force index to force the index or straight_join statement to force the optimizer to associate in the order of the tables we want.

The optimizer is not a panacea

The MySQL optimizer is not omnipotent. It is not always possible to optimize the bad SQL statement we write into an efficient query statement, and there are many reasons why the optimizer makes the wrong choice:

  • Statistics are inaccurate. MySQL evaluation cost depends on the statistical information provided by the storage engine, but the statistical information provided by the storage engine sometimes has a large deviation.
  • The cost estimate of the execution plan is not equal to the actual execution cost. For example, the cache is not considered when estimating the cost, and some data in the actual execution is in the cache.
  • The best that the optimizer thinks may not be the best we need. For example, sometimes we want the shortest time, but the optimizer
  • The optimizer never considers other concurrent queries.
  • The optimizer is not always the optimization of the basic cost. Sometimes it is based on rules. For example, when a full-text index exists and a match() clause is used in the query, the optimizer will still choose the full-text index even if other indexes are better.
  • The optimizer does not count operations beyond its control as costs. Such as the cost of executing stored procedures or user-defined functions.
  • Sometimes the optimizer cannot estimate all the execution plans, so it is also possible to miss the optimal execution plan.

How the optimizer gets the query plan

The optimizer sounds rather abstract, giving people an invisible and intangible feeling, but in fact, we can also turn on optimizer tracking through parameters. The optimizer tracking is turned off by default, because it will affect performance after opening, so it is recommended to Turn it on when you need to locate a problem, and turn it off in time.

SHOW VARIABLES LIKE 'optimizer_trace';
set optimizer_trace='enabled=on';

Next execute a query statement:

SELECT t1.name AS name1,t2.name AS name2 FROM test t1 INNER JOIN test2 t2 ON t1.id=t2.id

At this time, the analysis process of the optimizer has been recorded, and you can use the following statement to query:

SELECT * FROM information_schema.optimizer_trace;

Get the following results:

Explain the execution flow of a query select statement and update update statement in detail

 

The above figure is to see the effect of the data. If you need to operate it yourself, you need to use the shelll command window to execute. The TRACE column in the sqlyog tool is directly queried. The TRACE column information returned in the shell is as follows:

Explain the execution flow of a query select statement and update update statement in detail

 

It can be seen from the outline in the screenshot that this is a json data format.

The tracking information is mainly divided into the following three parts (the above picture does not show all the content, you can try it yourself if you are interested, and remember to close it in time after opening it):

  • Preparation phase (join_preparation): the query statement in expanded_query is optimized sql
  • Optimization stage (join_optimization): all execution plans listed in considered_execution_plans
  • Execution phase (join_execution)

Storage engine query

When the Server layer obtains the execution plan of a SQL statement, it will then call the corresponding API of the storage engine layer to execute the query. Because MySQL's storage engine is plug-in, each storage engine provides some corresponding API calls to the Server.

Return result

Finally, the result of the query is returned to the Server layer. If the cache is turned on, the Server layer will write the data into the cache while returning the data.

MySQL returns query results as an incremental step-by-step return process. For example: when we have processed all the query logic and started to execute the query and generate the first result data, MySQL can begin to gradually transmit data to the client. The advantage of this is that the server does not need to store too many results, thereby reducing memory consumption (this operation can be used to prompt the optimizer through sql _buffer_result, and the force index and straight_join mentioned above are artificially forced to perform what we want. operating).

The execution flow of an update statement

An update statement is actually a complex of add, delete, and check. The query statement needs to go through the process, and all update statements need to be executed once, because the data that needs to be updated must be obtained (query) before the update.

Buffer Pool

InnnoDB's data is placed on the disk, and there is an insurmountable gap between the speed of the disk and the speed of the CPU. In order to improve efficiency, the buffer pool technology is introduced, which is called Buffer Pool in InnoDB.

When reading data from the disk, the page read from the disk will be placed in the buffer pool first, so that the next time the same page is read, it can be directly obtained from the buffer pool.

When updating data, it will first look at whether the data is in the buffer pool or not. If it is, then directly modify the data in the buffer pool. Note that the premise is that we do not need to perform a uniqueness check on this data (because if you want to perform a uniqueness check, you must Load the data in the disk to determine whether it is unique)

If only the data in the Buffer Pool is modified without modifying the data in the disk, this will cause inconsistencies between the data in the memory and the disk, which is also called a dirty page. There is a special background thread in InnoDB to write the data of the Buffer Pool to the disk, and write multiple modifications to the disk at once every certain period of time. This action is called flushing.

So now there is a problem. If we need to write data to the data disk for all updates, then the disk must find the corresponding record and then update it. The IO cost and search cost of the entire process are high. In order to solve this problem, InnoDB has redo log, and adopts Write-Ahead Logging (WAL) solution.

redo log

Redo log, the redo log, is unique to the InnoDB engine and is mainly used for crash-safe.

Write-Ahead Logging(WAL)

Write-Ahead Logging, that is, write the log first, which means that when we perform an operation, the operation will be written to the log first, and then written to the data disk. Then someone will ask, writing to the data table is a disk operation, write Entering the redo log is also a disk operation, and the same is writing to the disk. Why not write the data directly, but write the log first? Isn't this unnecessary?

Imagine that if the data we need is randomly scattered in different sectors of different pages, then when we look for data, it is a random IO operation, and the redo log is written cyclically, which is sequential IO. In a word:
flashing is random I/O, while logging is sequential I/O, which is more efficient. Therefore, the modification is written into the log first, which can delay the timing of flushing, thereby increasing the system throughput

How to flush the redo log

The redo log in InnoDB has a fixed size, which means that the redo log does not grow larger as the file is written, but the space is allocated from the beginning. Once the space is full, the previous space will be overwritten The operation of flashing disk is realized through Checkpoint. As shown below:

Explain the execution flow of a query select statement and update update statement in detail

 

The check point is the current position to be covered. write pos is the position where the log is currently written. When writing the log, it is written cyclically, and the record must be updated to the data file before overwriting the old record. If the write pos and check point overlap, it means that the redo log is full. At this time, you need to synchronize the redo log to the disk.

bin log

Looking at MySQL as a whole, there are actually two parts: one is the Server layer, which mainly does things at the MySQL functional level; the other is the engine layer, which is responsible for specific storage-related matters. The redo log mentioned above is a log unique to the InnoDB engine, and the Server layer also has its own log, called binlog (archive log), also called binary log.

Some people may ask, why are there two logs?
Because at the beginning, there was no InnoDB engine in MySQL. MySQL's own engine is MyISAM, but MyISAM does not support things, nor does it have the capability of crash-safe. Binlog logs can only be used for archiving. So since InnoDB needs to support transactions, it must have crash-safe capabilities, so it uses another set of its own log system, which is based on redo log to achieve crash-safe capabilities.

The difference between bin log and redo log

1. Redo log is unique to the InnoDB engine; binlog is implemented by the Server layer of MySQL and can be used by all engines.
2. Redo log is a physical log, which records "what has been modified on a certain data page"; binlog is a logical log, which records the original logic of this statement, such as "add the c field of the line id=2 1 ".
3. The redo log is written cyclically, and the space will always be used up; the binlog can be written additionally. "Additional write" means that the binlog file will switch to the next one after it is written to a certain size, and will not overwrite the previous log.

The execution flow of the update statement

I have laid so much in front, mainly to let everyone understand the two concepts of redo log and big log first, because the update operation cannot do without these two files, then we formally return to the topic, how is an update statement executed? , Can be represented by the following figure:

Explain the execution flow of a query select statement and update update statement in detail

 

The above figure can be roughly summarized as the following steps:
1. First, query the corresponding record according to the condition of the update statement. If there is a cache, the cache will also be used.
2. The server side calls the InnoDB engine API interface, and the InnoDB engine transfers this data Write to the memory and write to the redo log at the same time, and set the redo log status to prepare
3. Notify the server layer that the data can be formally submitted
4. The server layer writes to the bin log immediately after receiving the notification, and then calls the InnoD corresponding interface to issue a commit Request
5. After InnoDB receives the commit request, it sets the data to the commit state

In the above steps, we noticed that redo log will be submitted twice, which is a two-phase submission.

Two-phase commit

Two-phase commit is the design idea of ​​distributed transactions. First, the requester will send a request to each server, and then wait for the other servers to be ready before notifying the requester that it can submit, and the requester will issue an instruction after receiving the request. Notify all servers to submit together.

Here, redo log belongs to the storage engine layer log, and the bin log belongs to the server layer log. It belongs to two independent log files. The two-phase commit is used to make the two log files logically consistent.

If the two-phase submission method is not used

If there is a statement id=1, age=18, we need to update the age of this data to 19:

  • Write the redo log first and then the binlog.
    Suppose that when the redo log is finished and the binlog is not finished, MySQL crashes. After the restart, because the redo log has been written, the data will be restored automatically, that is, age=19. But because the binlog crashed before it was written, the statement was not recorded in the binlog at this time. Therefore, when the log is backed up later, there is no such statement in the saved binlog. Then one day if we lose the data, we need to use bin log for data recovery and we will find that this update is missing.
  • Write the binlog first and then write the redo log.
    If the binlog is finished and the redo log is not finished, MySQL crashes. After restarting, because the redo log has not been written, automatic recovery cannot be performed, so the data is still age=18, and then if we lose the data one day, we need to use binlog to recover and find that the recovered data age=19 .

Through the above two assumptions, we will find that if the two-phase commit method is not used, there will be data inconsistencies, especially when there is a master-slave library, because the master-slave replication is implemented based on binlog, if redo log and Inconsistent bin log will lead to inconsistent master and slave data.

Data recovery rules after downtime

1. If the transaction in the redo log is complete, that is, there is already a commit mark, then submit directly;
2. If the thing in the redo log is only a complete prepare, judge whether the corresponding transaction binlog exists and is complete: if it is , Then commit the transaction; otherwise, roll back the transaction.

to sum up

This article mainly analyzes the execution process of select and update statements, and in the analysis of the execution process of update statements, it briefly introduces the related concepts of redo log and bin log. This part of the content is not explained in depth in this article, just to let Everyone has to understand the update process and made a simple introduction, such as the relationship between redo log and its corresponding cache, redo log flushing strategy, bin log writing strategy, why do you need redo log with bin log, etc. There is no clear explanation in this article, because the length of this article is limited. After in-depth, it will involve the storage structure of the InnoDB engine and some lower-level knowledge

Guess you like

Origin blog.csdn.net/qq_45401061/article/details/108647298