GaussDB SQL query statement execution process analysis

[Live broadcast preview] Will large models replace programmers? "

This article is shared from the Huawei Cloud Community " [GaussTech Issue 2] Analysis of the GaussDB SQL Query Statement Execution Process ", author: GaussDB database.

The importance of SQL to relational databases is self-evident. Like a conductor of an orchestra, he guides the correct interpretation of the work and the harmony and unity of the rhythm. As a new generation of relational distributed database, Huawei Cloud GaussDB has excellent technical performance and industry competitiveness. Many people are curious about the key technologies of GaussDB and have left messages on the forum:

How are GaussDB SQL statements executed?

What is the principle of GaussDB SQL engine?

What are the key technical points of the GaussDB SQL engine?

…….

Today we will start with the GaussDB SQL engine and learn about the execution process of GaussDB SQL query statements, including the principles and key technical points of the GaussDB SQL engine.

If you have any questions or key technical points of interest during the understanding process, you can participate in [Yunka Q&A] to uncover the mystery of GaussDB SQL engine, interact and win gifts , and leave messages for interaction. Experts will meet one-on-one Answer questions and you'll have a chance to get rewards for asking questions.

↓↓↓↓ The following is the text

First, let’s briefly introduce the system structure of GaussDB, and then analyze the execution process of GaussDB SQL query statements.

GaussDB system architecture

GaussDB has two deployment forms: centralized (Figure 1) and distributed (Figure 2) on Huawei Cloud, as shown in the following figure:

Figure (1) Centralized

Figure (2) Distribution

During the execution of GaussDB SQL statements, the following key roles are involved:

GTM	The Global Transaction Manager is responsible for generating and maintaining globally unique information such as global transaction IDs, transaction snapshots, timestamps, and sequence information.
CN	Coordinator Node. Responsible for receiving access requests from applications and returning execution results to the client; responsible for decomposing tasks and scheduling task shards to be executed in parallel on each DN. Each CN is connected to each DN, and each CN holds a copy of metadata with the same metadata content.
DN	Data Node. Responsible for storing business data (supporting row storage, column storage, and hybrid storage), executing data query tasks, and returning execution results to CN.

Among them, DN is mainly responsible for the execution of GaussDB SQL statements. Its logical architecture is shown in the following figure:

Figure (3) GaussDB logical architecture

GaussDB includes two main engines: SQL Engine and Storage Engine . The SQL engine is sometimes called a query processor, and its main functions are SQL parsing, query optimization, and query execution. SQL parsing performs lexical analysis, syntax analysis, and semantic analysis on the input SQL statement to generate a query tree. After the query tree undergoes rule optimization (RBO) and cost optimization (CBO), an execution plan is generated. The executor extracts, operates, updates, deletes and other operations on relevant data based on the execution plan to achieve the results that the user wants to query.

The storage engine is responsible for managing all data I/O. It includes data read and write processing (processing I/O requests for rows, indexes, pages, allocations, and row versions), data buffer management (Buffer Pool), and transaction manager. Among them, transaction management involves the lock mechanism (Lock) and transaction log management (XLOG) that maintain ACID attributes.

Between the SQL engine and the storage engine is the AM (Access Method) layer. AM encapsulates the storage layer to support multiple storage engines (Astore, Ustore, etc.). The SQL layer does not call the storage layer directly, but through the AM layer. The AM layer will call different execution bodies according to different storage engines.

It can be seen from the GaussDB logical architecture diagram that the GaussDB architecture design follows the design principles of modern software system abstraction and hierarchical decoupling, including: unified transaction mechanism, unified log system, unified concurrency control system, unified meta-information system, and unified cache management system. Therefore, the GaussDB technical architecture has the following main features:

Supports SQL optimization, execution, and storage layer decoupling;
Supports pluggable storage engines.

Execution process of SQL query statement

The execution process of a SQL query (SELECT) statement is as follows:

Figure (4) Query statement execution process

As can be seen from the above figure, a SQL statement needs to generate a query tree through SQL parsing, query optimization to generate an execution plan, and then transfer the execution plan to the query executor to perform physical operator execution operations.

SQL is a descriptive language between relational calculus and relational algebra. It absorbs the description of some logical operators in relational algebra, while abandoning the "procedural" part of relational algebra. The main function of SQL parsing is to compile a SQL statement into a query tree composed of relational operators, which usually includes lexical parsing, syntax parsing, and semantic parsing sub-modules.

Rule optimization (RBO) performs equivalent relational algebra transformation on the basis of the query tree, converting a SQL statement into a more efficient equivalent SQL, and plays a key role in the database optimizer. Especially in complex queries, it can bring orders of magnitude improvements in performance.

Query execution is to execute SQL query statements according to the execution plan. The rationality of the choice of underlying storage method will affect query execution efficiency.

parser

GaussDB SQL parsing usually includes lexical parsing, syntax parsing, and semantic parsing:

1. Lexical analysis : Identify the keywords, identifiers, operators, terminals, etc. supported by the system from the query statement, and determine the inherent part of speech of each word.

The SQL standard defines SQL keywords and grammatical rule information. During the lexical analysis process, GaussDB divides a SQL statement into independent atomic units based on keyword information and interval information, and each unit is displayed as a word.

2. Grammar analysis : According to the defined SQL grammar rules, use the words generated in the lexical analysis to match the grammar rules and generate the corresponding Abstract Syntax Tree (AST).

3. Semantic parsing : Check the validity of the syntax tree, check whether the corresponding tables, columns, functions, and expressions in the syntax tree have corresponding metadata, and convert the abstract syntax tree into a query tree.

The process of semantic parsing is also the process of validity semantic binding. Through the inspection of semantic analysis, the abstract syntax tree is converted into a query tree. Query trees can be represented in the form of relational algebra expressions.

optimizer

The optimizer is a very important means to improve query efficiency. It includes two parts: rule optimization and query optimization.

Rule Optimization (RBO)

Rule optimization is an equivalent relational algebra transformation based on the query tree. Since it is an optimization form based on relational algebra, it can also be called algebraic optimization. Although the results obtained by two relational algebra expressions are exactly the same, their execution costs may be very different, which forms the basis of rule optimization.

Rule optimization follows two basic principles:

(1) Equivalence: The output results of the original statement and the rewritten statement are the same.

(2) Efficiency: The rewritten statement takes less time to execute than the original statement and uses resources more efficiently.

GaussDB implements some key rule optimization techniques, such as:

Predicate pushdown : trigger conditional filtering earlier and reduce the number of rows processed

Elimination of redundant operations : Eliminate redundant tables and columns to reduce the amount of calculations

Subquery promotion : after promotion, more join orders can be matched

Outer-To-Inner conversion : Inner Join can match more join orders

Sublink promotion : reduce subplan and broadcast operations

Eliminate unequal joins : reduce NestLoop and Broadcast operations

In the process of serving a large number of customers, GaussDB abstracts business SQL usage patterns and implements some advanced rewriting rules. In future columns, we will introduce GaussDB’s rule optimization technology in detail.

Query optimization

Query optimization is based on the output of "rule optimization" and combined with the internal statistical information of the database to plan the specific execution method of the SQL statement, that is, the execution plan. Based on different optimization methods, query optimization technology can be divided into:

(1) CBO (Cost Based Optimization, cost-based query optimization): Estimate the cost of the candidate execution paths corresponding to the SQL statement, and select the lowest-cost execution path from the candidate paths as the final execution plan.

(2) ABO (AI Based Optimization, query optimization based on machine learning): Through continuous learning of historical experience, ABO abstracts the pattern of the target scenario, forms a dynamic model, and adaptively optimizes the user's actual scenario. , thereby obtaining the optimal execution plan.

GaussDB uses CBO-based optimization technology and combines it with ABO to actively explore modeling efficiency, estimation accuracy and adaptability. The steps are as follows:

Figure (5) Query optimization steps

Statistical information model : Statistical information is the cornerstone of calculating the plan path cost. The accuracy of statistical information plays a crucial role in row number estimation and cost estimation in the cost estimation model, which directly affects the quality of the query plan. The characteristics of GaussDB base table data include distinct values, MCV (Most Common Values) values, histograms, etc.

Row Estimation : After a constraint determines the selection rate, the number of rows that need to be processed for each planned path can be determined, and the number of pages that need to be processed can be calculated based on the number of rows to prepare for cost estimation.

Cost Estimation : Estimate the execution costs of different operators based on the amount of data. The sum of the costs of each operator is the total cost of the plan.

When the planned path processes pages, there is an I/O cost, and when the planned path processes tuples (for example, expression evaluation on tuples), there is a CPU cost. Since GaussDB is a distributed database, transmitting data between CN and DN will incur communication costs. Therefore, the total cost of a plan can be expressed as:

Total cost = IO cost + CPU cost + communication cost

Path search: Process the connection path search process by solving the path optimal algorithm (dynamic programming, genetic algorithm), and find the optimal connection path with the minimum search space.

GaussDB uses a combination of bottom-up and random search modes. The search process is also a process of transforming from a query tree to an execution plan. For example, each table can have different scan operators, and logical join operators can also be converted into a variety of different physical join operators.

Plan generation: Convert the query execution path into an execution plan (PlanTree) and provide it to the executor for execution.

Query optimization can take a long time, especially when dealing with complex queries. Plan caching is an important feature of GaussDB. It can cache the execution plan of a query statement so that the execution plan in the cache can be directly used the next time the same query is executed, thereby avoiding repeated calculations and optimizing query performance.

[ Key technical points ] CBO + ABO: By introducing AI algorithms, the CBO model is improved, giving the query optimizer the ability to dynamically adjust evaluation results based on data.

[ Key technical points ] Plan cache: GaussDB’s plan cache has the ability to adaptively select and automatically update plans. It can automatically select the best cache plan for different parameter configurations to ensure the stability and optimization of query performance.

Distributed query optimization

As a native distributed database, distributed query optimization technology is particularly important.

The GaussDB distributed architecture makes full use of the computing resources of each node, and its overall performance increases linearly as the node scale expands. In order to maximize performance and resource utilization under a distributed architecture, GaussDB supports four distributed plans, namely CN lightweight plan, FQS (Fast Query Shipping) plan, Stream plan and Remote-Query plan, as shown in the following figure:

Figure (6) Four distributed plans

CN lightweight: statements are directly sent to a single DN for execution (LIGHT_PROXY)
- Execution principle: CN directly delivers the statement QPBE message to the corresponding DN through the socket.

Applicable scenarios: The statement can be executed directly on a DN (single shard statement).

FQS (Fast Query Shipping) statement issuance: plan for issuing SQL statements

（REMOTE_FQS_QUERY）
- Execution principle: CN directly generates a RemoteQuery plan without passing through the optimizer, and sends it to DN through the executor logic. Each DN generates an execution plan based on the pushdown statement and executes it, and the execution results are summarized on the CN.

- Applicable scenarios: Statements can be completely pushed down to multiple DNs for execution, and no data interaction is required between DNs.
STREAM plan delivery: Distributed SQL plan distribution plan (STREAM)
- Execution principle: CN generates an execution plan with stream operators based on the original statement through the optimizer, and sends it to DN for execution. There is data interaction (stream node) during the DN execution process. The stream operator establishes connections between DNs for data exchange, and CN summarizes execution results. DN undertakes most of the calculations.

- Applicable scenarios: complex statements with data interaction between CN and DN, and between DN and DN during execution .
Remote-Query plan: distributed plan for issuing some SQL statements (REMOTE_QUERY)
- Execution principle: CN generates a RemoteQuery plan from part of the original statement through the optimizer, and sends each RemoteQuery to DN. After DN is executed, the intermediate result data is sent to CN. After CN collects it, it performs the execution calculation of the remaining execution plan. Therefore , CN undertakes most of the calculations.

Applicable scenarios: There are very few scenarios that do not meet the first three generation conditions , and the performance is very poor.

In a distributed architecture, the data of the same table will be distributed to different data nodes. When creating a table, you can choose to hash or randomly distribute the data on each table. In order to correctly perform a join operation between two tables, it may be necessary to redistribute the data of the two tables. Therefore, GaussDB's distributed execution plan adds three Stream operators that make the data form a specific distribution method.

Figure (7) Stream operator

When generating a distributed path, it will be considered whether the data on the two tables and the connection conditions are in the same data node. If not, the corresponding data distribution operator will be added. The redistributed Stream operator is selected based on the principle of reducing the flow of data between DN nodes.

It is precisely based on the reasonable use of Stream operators that it is possible for GaussDB to process large-scale data in a distributed architecture. Optimization of Stream operators is also an important part of GaussDB query optimization.

Figure (8) GaussDB distributed query optimization technology

[ Key technical points ] Distributed query optimization: four distributed execution plans and three Stream operators.

Actuator

The instruction received by the executor is the execution plan generated by the optimizer for the SQL query statement. The execution plan is composed of some execution operators (Operators), expressions, etc. It mainly operates on the relationship set and finally outputs the user's desired desired result set. The following are several common types of execution operators:

1. Scan Plan Node

The scanning node is responsible for extracting data from underlying data sources, which may be from the file system or the network. Generally speaking, scanning nodes are located at the leaf nodes of the execution tree, serving as the data input source for execution, typically representing SeqScan, IndexScan, and SubQueryScan.

Key features: input data, leaf nodes, expression filtering

2. Control Plan Node

Control operators generally do not map algebraic operators, but are operators introduced for the executor to complete some special processes, such as Limit, RecursiveUnion, and Union.

Key Features: Used to control data flow

3. Materialize Plan Node

Materialized operators generally refer to algorithm requirements. When doing operator logic processing, the lower-layer data is required to be cached. Because the amount of data returned by the lower-layer operators cannot be predicted in advance, it is necessary to consider in the algorithm that all the data cannot be placed. Memory conditions, such as Agg, Sort.

Key feature: All data needs to be scanned before returning

4. Join Plan Node operators are designed to handle the most common association operations in databases. They are divided into MergeJoin, NestLoop Join, and HashJoin according to different processing algorithms and data input sources.

Key Features: Related Query

5. Other operators

The architecture and technology of the executor also determine the overall operating efficiency of database query execution. The GaussDB execution engine fully combines the characteristics of modern hardware technology and uses a variety of modern software technologies such as vectorization engines and LLVM for efficient execution.

Figure (9) GaussDB fully parallel execution architecture

GaussDB also uses a variety of technologies to improve query execution performance during the execution of distributed plans. For example, when executing complex queries, the re-execution operator will be pushed down to the DN node for execution, such as the AGG operator. When the pushdown operator is executed, the locality of the data will be considered and the calculation will be performed locally as much as possible to reduce the transmission overhead of data in the network.

[Key technical points] Fully parallel execution architecture : MPP, SMP, LLVM, SIMD Fully parallel execution, giving full play to the ultimate performance of the system, making full use of CPU resources to improve query performance. We will introduce the technologies in full parallel execution one by one later.

If you have any questions, or are interested in key technical points, you can unveil the mystery of GaussDB SQL engine in [Yunka Q&A], interact with each other, and leave a message in the event post to win gifts, and experts will answer questions one-on-one. , and also have the opportunity to receive incentives for asking questions.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~