Analysis of the underlying operating principle of SQL query

The SQL language is everywhere . SQL is no longer just the exclusive skills of technical personnel, it seems that everyone can write SQL, just like everyone is a product manager. If you are doing back-end development, then CRUD is commonplace. If you are doing data warehouse development, writing SQL may take up most of your working time. When we understand the SELECT syntax, we also need to understand the underlying principles of SELECT execution. Only in this way can we have a deeper understanding of SQL. This article shares will gradually decompose the execution process of SQL, I hope it will be helpful to you.

data preparation

This article aims to explain the execution process of SQL queries, and will not involve too complicated SQL operations, mainly involving two tables: citizen and city , the specific data are as follows:

CREATE TABLE citizen ( 
    name CHAR ( 20 ), 
    city_id INT ( 10 ) 
);


CREATE TABLE city (
    city_id INT ( 10 ), 
    city_name CHAR ( 20 ) 
);

INSERT INTO city
VALUES
	( 1, "上海" ),
	( 2, "北京" ),
	( 3, "杭州" );
	
	
INSERT INTO citizen
VALUES
("tom",3),
("jack",2),
("robin",1),
("jasper",3),
("kevin",1),
("rachel",2),
("trump",3),
("lilei",1),
("hanmeiei",1);

Query execution order

The query statement involved in this article is as follows. The main is to join the citizen table and the city table, and then filter out the city_name != "Shanghai" data, and then group by city_name, and count the cities with a total number of more than 2 in each city, as follows:

Check for phrases

SELECT 
    city.city_name AS "City",
    COUNT(*) AS "citizen_cnt"
FROM citizen
  JOIN city ON citizen.city_id = city.city_id 
WHERE city.city_name != '上海'
GROUP BY city.city_name
HAVING COUNT(*) >= 2
ORDER BY city.city_name ASC
LIMIT 2

Steps

The writing sequence of the above SQL query statement is:

SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY ...

But the execution order is not like this, the specific execution order is as follows:

1. Get data ( From, Join )
2. Filter data ( Where )
3. Group ( Group by )
4. Group Filtering ( Having )
5. Return to the query field ( Select )
6. Sorting and paging ( Order by & Limit / Offset )

Screaming tip: This article aims to explain the underlying principles of general SQL execution, without considering its optimization techniques, such as predicate pushdown, projection pushdown, and so on.

The underlying principle of execution

In fact, the SQL execution order mentioned above is the so-called underlying principle. When we execute the SELECT statement, each step will generate a virtual table (virtual table) , and when the next step is executed, the virtual table will be used as input . The point to note is that these processes are transparent to the user.

You can notice that SELECT is executed from the FROM step first. At this stage, if multiple tables are JOINed, they will go through the following steps:

Get data ( From, Join )

First, the Cartesian product will be obtained through CROSS JOIN, which is equivalent to obtaining the virtual table vt1-1;
Then filter by ON condition, the virtual table vt1-1 is used as input, and the virtual table vt1-2 is output;
Add external rows. We use left join, right link or full join, which will involve external rows, that is, add external rows on the basis of virtual table vt1-2, and get virtual table vt1-3

Filter data ( Where )

After the above steps, we have obtained a final virtual table vt1, on which the where filter is applied to filter out the data that does not meet the conditions through the filter conditions, and the virtual table vt2 is obtained.

Group ( Group by )

After the where filter operation, vt2 is obtained. Next, perform the GROUP BY operation to get the virtual table vt3 in the middle.

Group Filter ( Having )

On the basis of the virtual table vt3, use having to filter out the aggregate data that does not meet the conditions, and get vt4.

Return to query field ( Select )

When we have completed the conditional filtering part, we can filter the fields extracted from the table, that is, enter the SELECT and DISTINCT stages. First, the target field is extracted in the SELECT phase, and then the duplicate rows are filtered out in the DISTINCT phase, and the intermediate virtual tables vt5-1 and vt5-2 are obtained respectively.

Sorting and Paging ( Order by & Limit / Offset )

After we extract the desired field data, we can sort according to the specified field, which is the ORDER BY phase, and get the virtual table vt6. Finally, on the basis of vt6, take out the record of the specified row, which is the LIMIT stage, and get the final result, which corresponds to the virtual table vt7

Detailed execution step analysis

Step 1: Get data ( From, Join )

FROM citizen
JOIN city

The first step in the process is to execute the statement in the From clause, and then execute the Join clause. The result of these operations is the Cartesian product of the two tables.

name	city_id	city_id	city_name
tom	3	1	Shanghai
tom	3	2	Beijing
tom	3	3	Hangzhou
jack	2	1	Shanghai
jack	2	2	Beijing
jack	2	3	Hangzhou
robin	1	1	Shanghai
robin	1	2	Beijing
robin	1	3	Hangzhou
jasper	3	1	Shanghai
jasper	3	2	Beijing
jasper	3	3	Hangzhou
kevin	1	1	Shanghai
kevin	1	2	Beijing
kevin	1	3	Hangzhou
rachel	2	1	Shanghai
rachel	2	2	Beijing
rachel	2	3	Hangzhou
trump	3	1	Shanghai
trump	3	2	Beijing
trump	3	3	Hangzhou
lilei	1	1	Shanghai
lilei	1	2	Beijing
lilei	1	3	Hangzhou
hanmeiei	1	1	Shanghai
hanmeiei	1	2	Beijing
hanmeiei	1	3	Hangzhou

After the execution of FROM and JOIN is over, the required rows will be filtered according to the ON condition of the JOIN

ON citizen.city_id = city.city_id

name	city_id	city_id	city_name
tom	3	3	Hangzhou
jack	2	2	Beijing
robin	1	1	Shanghai
jasper	3	3	Hangzhou
kevin	1	1	Shanghai
rachel	2	2	Beijing
trump	3	3	Hangzhou
lilei	1	1	Shanghai
hanmeiei	1	1	Shanghai

Step 2: Filter data ( Where )

After obtaining the rows that meet the conditions, they will be passed to the Where clause. This will evaluate each row using conditional expressions. If the calculation result of the row is not true, it will be deleted from the collection.

WHERE city.city_name != '上海'

name	city_id	city_id	city_name
tom	3	3	Hangzhou
jack	2	2	Beijing
jasper	3	3	Hangzhou
rachel	2	2	Beijing
trump	3	3	Hangzhou

Step 3:分组 (Group by)

The next step is to execute the Group by clause, which groups rows with the same value. After that, all Select expressions will be evaluated by group instead of row by row.

GROUP BY city.city_name

GROUP_CONCAT(citizen.`name`)	city_id	city_name
jack,rachel	2	Beijing
tom,jasper,trump	3	Hangzhou

Step 4:分组过滤 (Having)

对分组后的数据使用Having子句所包含的谓词进行过滤

HAVING COUNT(*) >= 2

Step 5:返回查询字段 (Select)

在此步骤中，处理器将评估查询结果将要打印的内容，以及是否有一些函数要对数据运行，例如Distinct，Max，Sqrt，Date，Lower等等。本案例中，SELECT子句只会打印城市名称和其对应分组的count(*)值，并使用标识符“ City”作为city_name列的别名。

SELECT 
    city.city_name AS "City",
	COUNT(*) AS "citizen_cnt"

city	citizen_cnt
北京	2
杭州	3

Step 6:排序与分页 (Order by & Limit / Offset)

查询的最后处理步骤涉及结果集的排序与输出大小。在我们的示例中，按照字母顺序升序排列，并输出两条数据结果。

ORDER BY city.city_name ASC
LIMIT 2

city	citizen_cnt
北京	2
杭州	3

总结

本文主要剖析了SQL语句的执行顺序和底层原理，基本的SQL查询会分为六大步骤。本文结合具体事例，给出了每一步骤的详细结果，这样会对其执行的底层原理有更加深刻的认识。

公众号『大数据技术与数仓』，回复『资料』领取大数据资料包