How is a SQL executed by each component in the MySQL architecture?

Abstract: How is a SQL executed by each component in the MySQL architecture, and what does the executor do? What does the storage engine do? How is the table association query executed step by step in the storage engine and executor? This article takes you to find out!

This article is shared from HUAWEI CLOUD community " How is a SQL executed by each component in the MySQL architecture?" ", Author: Zhuanyeyang__.

1. The execution process of each component of the single-table query SQL in the MySQL architecture

Simply use a picture to illustrate the components of the MySQL architecture, and then analyze it with SQL statements

If the SQL statement is like this

SELECT class_no FROM student WHERE name = 'lcy' AND age > 18 GROUP BY class_no

Where name is the index, let's analyze it in chronological order

1. Client: The client (such as MySQL command line tool, Navicat, MySQL Workbench or other applications) sends SQL queries to the MySQL server.

2. Connector: The connector is responsible for establishing a connection with the client, managing the connection and maintaining the connection. When a client connects to a MySQL server, the connector verifies the client's username and password, and then allocates a thread to process the client's request.

3. Query cache: Query cache is used to cache previously executed queries and their results. When receiving a new query request, MySQL first checks whether the same query and its results already exist in the query cache. If there are matching query results in the query cache, MySQL will directly return the cached results without executing the query again. However, if there are no matching query results in the query cache, MySQL will continue to execute the query. Query caching has been removed in MySQL 8.0 and will not be explained in detail.

4. Analyzer:

    • Parse the query statement and check the syntax.
    • Verify the correctness of table and column names.
    • Generate a query tree.

5. Optimizer: Analyze the query tree, consider various execution plans, estimate the cost of different execution plans, and choose the best execution plan. In this example, the optimizer may choose to use the name index for the query, because name is an indexed column.

6. Executor: According to the execution plan selected by the optimizer, it sends a request to the storage engine to obtain data rows that meet the conditions.

7. Storage engine (such as InnoDB):

    • Responsible for the actual execution of index scanning, such as equivalent query on the name index of the student table, because querying all columns involves returning to the table to access the disk.
    • Before accessing the disk, check whether the required data pages already exist in InnoDB's buffer pool (Buffer Pool). If there are eligible data pages in the buffer pool, use the cached data directly. If the required data page is not in the buffer pool, load the data page from disk into the buffer pool.

8. Actuator:

    • For each found record, judge again whether the record satisfies the index condition name. This is because the data pages loaded into the memory based on the index conditions may also contain records that do not meet the index conditions, so the name condition needs to be judged again, and if the name condition is met, the age > 18 filter condition will continue to be judged.
    • Group the records satisfying the condition according to class_no.
    • The executor returns the processed result set to the client.

Throughout the query execution process, these components work together to execute queries efficiently. The client is responsible for sending the query, the connector manages the client connection, the query cache tries to reuse the previous query results, the resolver is responsible for parsing the query, the optimizer chooses the best execution plan, the executor executes the plan selected by the optimizer, and the storage engine (such as InnoDB) Responsible for managing data storage and access. The synergy of these components enables MySQL to efficiently execute queries and return result sets.

The operation of loading indexed data pages into memory according to the filter conditions of index columns is performed by the storage engine. After loading into memory, the executor will judge the filter conditions of index columns and non-index columns.

2. Where are the keywords of SELECT executed?

According to the order of execution, as follows:

(1) FROM: The FROM clause is used to specify the data tables involved in the query. During query execution, the executor needs to obtain the data of the specified table from the storage engine according to the execution plan selected by the optimizer.

(2) ON: The ON clause is used to specify the connection condition, and it is usually used together with the JOIN clause. During query execution, the executor will obtain records satisfying the conditions from the storage engine according to the conditions in the ON clause. If the join condition involves indexed columns, the storage engine may use the index for optimization.

(3) JOIN: The JOIN clause is used to specify the connection method between tables (such as INNER JOIN, LEFT JOIN, etc.). During query execution, the executor obtains the data of the tables to be joined from the storage engine according to the execution plan selected by the optimizer. Then, the executor performs connection operations on the data according to the type of the JOIN clause and the connection conditions in the ON clause.

(4) WHERE: The executor filters the data returned from the storage engine, and only keeps records that meet the conditions of the WHERE clause. If some filter conditions involve indexes, they have already been filtered at the storage engine layer.

(5) GROUP BY: The executor groups the records that meet the conditions of the WHERE clause according to the columns specified in the GROUP BY clause.

(6) HAVING: After grouping, the executor further filters the grouped records according to the condition of the HAVING clause.

(7) SELECT: The executor obtains query results according to the execution plan selected by the optimizer.

(8) DISTINCT: The executor deduplicates the query results and returns only non-duplicate records.

(9) ORDER BY: The executor sorts the query results according to the columns specified in the ORDER BY clause.

(10) LIMIT: The executor truncates the query results according to the restrictions specified in the LIMIT clause, and only returns some records

3. The execution process of each component of SQL in the MySQL architecture for table association query

SELECT s.id, s.name, s.age, sc.subject, sc.score
FROM student s
JOIN score sc ON s.id = sc.student_id
WHERE s.age > 18 AND sc.subject = 'math' AND sc.score > 80;

In this example, student_id and subject are the joint index, and age is the index.

Let's analyze it in chronological order

1. Connector: When the client connects to the MySQL server, the connector is responsible for establishing and managing the connection. It verifies the username and password provided by the client, determines that the client has appropriate permissions, and establishes a connection.

2. Query cache: The MySQL server will check the query cache before processing the query. If the same query and its result set already exist in the query cache, the server will directly return the result in the cache without performing subsequent query processing. Since query caching has been removed in MySQL 8.0 , we won't discuss it in detail in this example.

3. Parser: The main task of the parser is to parse the SQL query statement to ensure that the query syntax is correct. The parser breaks down the query statement into components such as tables, columns, conditions, etc. In this example, the parser identifies the tables involved (student and score) and the required columns (id, name, age, subject, score).

4. Optimizer: The responsibility of the optimizer is to generate an execution plan based on the information provided by the parser. It analyzes several possible execution strategies and chooses the one with the lowest cost. In this example, the optimizer might analyze various combinations of table scans and index scans and choose the one with the lowest cost.

5. Executor: Process the query according to the execution plan generated by the optimizer, send a request to the storage engine, and obtain the data rows that meet the conditions.

6. Storage engine (such as InnoDB): The storage engine is responsible for managing the storage and retrieval of data.

  • The storage engine first receives requests from executors. Requests may include fetching data rows that satisfy the query criteria, and which scan method to use (such as a full table scan or an index scan).
  • Suppose the executor has decided to use an index scan. In this example, the storage engine may first perform an index scan on the student table (using the age index), and then perform an index scan on the score table (using the joint index of student_id and subject).
  • The storage engine will query the corresponding index structure according to the request. In the student table, the storage engine will find records satisfying the age > 18 condition. In the score table, the storage engine will find records that satisfy the condition of subject = 'math' AND score > 80.
  • Once the records that meet the conditions are found, the storage engine needs to load the data pages where these records are located from disk into memory. The storage engine first checks the buffer pool (InnoDB Buffer Pool) to see if these data pages already exist in memory. If it already exists, there is no need to load it from disk again. If not present, the storage engine loads these data pages from disk into the buffer pool.
  • Records loaded into the buffer pool can be shared by multiple queries, which helps improve query efficiency.

7. Executor: handle operations such as connection, sorting, aggregation, and filtering.

    • Perform the join operation in memory to connect the data rows of the student table and the score table.
    • Filter the joined result set, and only keep the data rows that meet the query conditions (age > 18, subject = 'math', score > 80).
    • Return the filtered data rows to the client as query results.

As mentioned earlier, according to the data pages loaded by the storage engine into the memory according to the index conditions, there may be data that does not meet the index conditions. If the executor does not judge the index conditions again, it cannot determine which records meet the index conditions. Although It has been judged in the storage engine, but there will still be judgments in the executor that the index conditions age > 18, subject = 'math', and score > 80.

4. What is the difference between LEFT JOIN putting the filter condition in the subquery and then associating with it and putting it in the WHERE clause?

first look at the example

query 1

SELECT s.id, s.name, s.age, sc.subject, sc.score
FROM student s
LEFT JOIN score sc ON s.id = sc.student_id
WHERE s.age > 18 AND sc.subject = 'math' AND sc.score > 80;

query 2

SELECT s.id, s.name, s.age, sc.subject, sc.score
FROM (SELECT id, name, age FROM student WHERE age > 18) s
LEFT JOIN (SELECT student_id, subject, score FROM score WHERE subject = 'math' AND score > 80) sc 
ON s.id = sc.student_id

query 3

SELECT s.id, s.name, s.age, sc.subject, sc.score
FROM student s
LEFT JOIN score sc ON s.id = sc.student_id AND s.age > 18 AND sc.subject = 'math' AND sc.score > 80;

Give the conclusion first: queries 2 and 3 are the same, that is, the filter conditions are the same when placed in the subquery and on the top, and only queries 1 and 2 will be discussed later. Query 1 and query 2 are different. There is also a difference between putting the filter condition in the where clause and putting it in the subquery and then correlating the query results.

analyze

From the running results , for query 1

SELECT s.id, s.name, s.age, sc.subject, sc.score
FROM student s
LEFT JOIN score sc ON s.id = sc.student_id
WHERE s.age > 18 AND sc.subject = 'math' AND sc.score > 80;

In this query, a LEFT JOIN is first performed to connect the student table and the score table. The join operation is based on the s.id = sc.student_id condition. The LEFT JOIN operation keeps all the rows in the left table (student table), even if they have no matching rows in the right table (score table). If there is no matching row in the right table, then the column of the right table will be displayed as NULL.

Then, the WHERE clause will filter the joined result set to keep only those rows that meet the conditions of s.age > 18 and sc.subject = 'math' and sc.score > 80. This means that the records with NULL in the right table will be excluded, because the filter condition sc.subject = 'math' and sc.score > 80 in the right table is not satisfied.

For query 2:

SELECT s.id, s.name, s.age, sc.subject, sc.score
FROM (select id, name, age from student where age > 18) s
LEFT JOIN (select subject, score from score where subject = 'math' AND score > 80) sc 
ON s.id = sc.student_id

In this query, we first execute two subqueries. The first subquery selects all rows from the student table with age > 18, while the second subquery selects all rows from the score table with subject = 'math' and score > 80. This means, before performing the join operation, we have filtered the two tables separately.

Next, a LEFT JOIN operation is performed to join the filtered s and result sets of the sc subquery, based on the s.id = sc.student_id condition. Because the LEFT JOIN operation will retain all the rows in the left table (the result set of the s subquery), the records that are NULL in the right table are included.

Result difference:

The main difference between Query 1 and Query 2 is the use of WHERE clauses and subqueries. Query 1 applies filter conditions after the join operation, which may cause the associated records with NULL in the right table to be excluded because of the filter conditions in the right table. However, query 2 has already filtered the data in the table before the join operation, which means that the query result will contain all the records of the filter condition of the left table, as well as the records of the filter condition of the right table and NULL records.

If query 1 wants to keep the NULL records in the right table, you only need to change it to WHERE s.age > 18 AND (sc.student_id is null OR (sc.subject = 'math' AND sc.score > 80)); query 1 like this and 2 will have the same result set.

Let's analyze the difference between these two queries executed in each component of the MySQL architecture

For query 1:

SELECT s.id, s.name, s.age, sc.subject, sc.score
FROM student s
LEFT JOIN score sc ON s.id = sc.student_id
WHERE s.age > 18 AND sc.subject = 'math' AND sc.score > 80;
  1. Connector: The client establishes a connection with the server.
  2. Query Cache: Checks whether the cache exists for the results of this query. If so, return the result directly. Otherwise, continue execution.
  3. Parser: Parse the query statement and check whether the syntax is correct.
  4. Optimizer: optimize the query, generate an execution plan, determine the order of connection and filter conditions, etc.
  5. Executor: Start a request to execute a query.
  6. Storage engine (InnoDB): Read data rows that meet the conditions (s.id = sc.student_id) from the disk or buffer pool. Because it is a left join, it will be associated even if sc.student_id is null.
  7. Executor: Left join the data rows obtained from the storage engine, apply the filter conditions s.age > 18 and sc.subject = 'math' and sc.score > 80 to filter, and return the result set to the client.

When querying conditions that include index columns, the MySQL storage engine will first use the index to locate records that meet the index conditions on the disk. Next, load the data pages corresponding to the index data into the buffer pool in the memory. Then, the executor further filters these records in memory, filtering the data according to the conditions of the index and non-index columns.

When the query involves a non-clustered index, the operation that needs to return to the table will cause both the clustered index and the non-clustered index to be loaded into memory. However, if the query only involves the clustered index (such as primary key query), then only the data pages of the clustered index need to be loaded.

For query 2

SELECT s.id, s.name, s.age, sc.subject, sc.score
FROM (SELECT id, name, age FROM student WHERE age > 18) s
LEFT JOIN (SELECT student_id, subject, score FROM score WHERE subject = 'math' AND score > 80) sc 
ON s.id = sc.student_id
  1. Connector: The client establishes a connection with the server.
  2. Query Cache: Checks whether the cache exists for the results of this query. If so, return the result directly. Otherwise, continue execution.
  3. Parser: Parse the query statement and check whether the syntax is correct.
  4. Optimizer: decides which indexes to use for query optimization, and determines the join order.
  5. Executor: Start request to execute subquery.
  6. Storage engine (InnoDB): First, scan the student table, and load the data page corresponding to the record satisfying the condition s.age > 18 to the buffer pool (if the buffer pool does not have data for this page). Then, use subject = 'math' AND score > 80 to scan the score table, and load the data page corresponding to the record that meets the condition to the buffer pool (if the buffer pool does not have data for this page).
  7. Executor: Apply all filter conditions to the data obtained from the storage engine, store the filtered results in a temporary table , execute the main query, obtain data from the temporary table, and perform a left join between s and sc, according to s.id = sc .student_id combined results. Return the result of the connection to the client.

From here we can see that query 2 is to filter first and then join. The index of each table is very important. If the index is not set, the single table filter will scan the whole table.

When writing SQL, which method should be used for query 1 and query 2?

There are different application scenarios according to different situations. It should be noted that for query 2, the result set of the subquery is stored in a temporary table, and the temporary table will not inherit the original index, including the clustered index and non-clustered index, so the example just now In the temporary table, s.id and sc.student_id are no longer any index columns . For query 1, all records that finally satisfy the association condition s.id = sc.student_id will be loaded into memory and then filtered.

  1. When the amount of filtered data in a single table is small, query 2 may be a better choice, because it can reduce the amount of data associated with the operation, thereby improving query efficiency. In the subquery phase, MySQL will still use the index on the original table for filtering. After the subquery is executed, the filtered data is stored in a temporary table. Therefore, the point that can be optimized in query 2 is to use the index as much as possible when querying a single table.
  2. When the amount of filtered data in a single table is large, query 1 may be more suitable, because it can better utilize the index for association operations. This can reduce the time overhead of association operations. In query 2, because the temporary table does not inherit the index, the time overhead of table association is relatively large.

5. What is the difference between a clustered index and a full table scan?

What is the difference between taking PRIMARY index (clustered index) and full table scan ? To be precise, when the InnoDB storage engine is used, the storage locations of the full table scan data and the clustered index data in the InnoDB table space are the same, that is to say, their memory addresses are also the same. So you can also understand that they are actually operated on the clustered index (the leaf nodes of the clustered index B+ tree are complete user records sorted according to the primary key, including all the fields in the table), the difference is that  

The full table scan scans the leaf nodes of the clustered index B+ tree sequentially from left to right and judges the conditions.

The clustered index uses the binary idea to scan the clustered index B+ tree to the specified range, such as select * from demo_info where id in (1, 2) This condition field is the primary key id, which can be used to quickly perform binary segmentation using the PRIMARY index Inquire.

In MyISAM, the storage locations of full table scan data and index data are separated. However, MyISAM has been replaced by InnoDB and is no longer the recommended storage engine for MySQL. Since MySQL5.5, InnoDB has become the default storage engine for MySQL.

By default, InnoDB uses a shared tablespace file named ibdata1 to store all data and indexes, including clustered indexes and secondary indexes (also known as non-clustered or auxiliary indexes).

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/8703122