[Advanced articles] Detailed explanation of MySQL's SQL parsing principle


insert image description here

0. Preface

  1. Have you gained an in-depth understanding of the SQL parsing process in MySQL and the specific roles played by each link in the parsing process?
  2. Are you curious about how MySQL parses a SQL statement into a series of "Item" and "TABLE_LIST", and finally completes the execution of instructions and feedback of results through these
  3. Do you know what MySQL does during query optimization and execution?

So, let's take a deep look at MySQL's SQL parsing principle. This article refers to "Application of SQL Analysis in Meituan" by the Meituan technical team

1. SQL parsing process

1. Lexical analysis

This step is mainly to decompose the SQL statement into each lexical unit (Token).
For example, the SQL statement "SELECT * FROM table WHERE id=1" will be decomposed into "SELECT", "*", "FROM", "table", "WHERE", "id", "=", " 1 " and other lexical units.

Lexical analysis is the first step in the SQL parsing process. It mainly decomposes the input SQL text into Tokens one by one. This process is similar to when we read an article, we understand the meaning of the article word by word from left to right. In computers, these symbols or phrases are called Tokens.

In an SQL statement, Token may be a keyword, such as "SELECT", "FROM", "WHERE", etc.; it may also be some identifier, such as table name, field name, etc.; it may also be some operator, such as " +", "-", "*", "/", etc.; it may also be some literal values, such as strings, numbers, dates, etc.

The goal of lexical analysis is to identify these Tokens and prepare for the syntax analysis stage. The lexical analyzer ignores all spaces, tabs, newlines, etc., and usually generates an internal data structure, such as a Token sequence, for use in subsequent grammatical analysis stages

2. Syntax Analysis

Based on the lexical analysis, the grammatical analysis will check whether the combination of lexical units conforms to the grammatical rules according to the predefined SQL grammatical rules, and build a grammatical parsing tree (Parse Tree). If the SQL statement does not conform to the grammatical rules, parsing fails and the SQL statement is considered invalid.

Suppose we have a simple SQL query:SELECT name FROM student WHERE age > 20;

In the lexical analysis stage, this statement will be decomposed into a series of Token,

SELECT, name, FROM, student, WHERE, age, >, 20, ;

In the grammatical analysis stage, these Tokens will be organized into a grammatical tree according to the grammatical rules of SQL.

insert image description here

In this syntax tree, each node represents a grammatical unit, such as "SELECT", "FROM" and "WHERE" represent different SQL clauses, "name" and "student" represent column names and table names, ">" Represents the comparison operation, "20" represents the value of the comparison.

This syntax tree reflects the grammatical structure of the SQL statement and provides the basis for subsequent semantic analysis and query optimization.
It may be a bit abstract to talk about, but if you read the content of the compilation principle, it should be easy to understand

For students who have not been exposed to compiler implementations, they will definitely be curious about how to generate such a syntax tree. The principles behind it are all in the scope of the compiler. You can refer to an article on Wikipedia and the reference books in this link. I also read part of the content in the process of learning MySQL source code. Due to too much content involved in the compiler, I have limited energy and time, so I will not do too much research. From an engineering point of view, learning how to use Bison to build syntax trees to solve practical problems may be more helpful to our work. Below I will discuss the process based on Bison.

4. Parsing tree

A parse tree is a tree structure, each node represents a grammatical structure (such as an expression, a clause, etc.). The root node of the tree represents the entire SQL statement, and the leaf nodes of the tree represent the lexical units. By traversing this tree, we can get the structure and semantics of the SQL statement.
Yes, you described it very accurately. A parse tree (or parse tree, derivation tree) is a tree diagram used to represent an input conforming to a given grammar. In computer science, especially in the design and implementation of compilers, parse trees play an important role. It is usually an intermediate step of the compiler or interpreter, which checks the syntax of the input and converts it into an internal data structure to facilitate subsequent processing steps.

For SQL statements, by constructing its syntax analysis tree, we can better understand the structure of SQL statements and perform corresponding operations, such as query optimization, statement rewriting, etc.

insert image description here

In this tree, each node represents a grammatical structure, from the entire SQL statement (root node), to each clause (non-leaf node), and then to a specific lexical unit (leaf node). By traversing this tree, we can get the structure and semantics of the SQL statement.

5. MySQL syntax analysis tree generation process

MySQL uses a tool called Bison to generate a parser. Bison will automatically generate a program that can assemble lexical units into a parse tree according to the grammar rules we provide.

Bison is an open source parser generator developed by the GNU Project. It can automatically generate the corresponding parser according to the given context-free grammar.

In MySQL,Bison is mainly responsible for assembling the lexical units output by the lexical analyzer (generated by Flex) into a syntax analysis tree. In fact, Bison does not directly generate the output of the tree structure, but generates a top-down recursive descent parser (or LR parser), which gradually constructs the syntax analysis tree by calling predefined actions and reduction operations

Specifically, MySQL's Bison input file defines a series of productions (that is, the rules of the context-free grammar) and the actions associated with them. Whenever a Bison parser recognizes a production on the input stream, it executes the corresponding action. These actions mainly include creating new syntax structure objects (such as expressions, queries) and adding them to the current syntax parse tree.

Therefore, through Bison, MySQL can parse the results of lexical analysis and generate a corresponding parsing tree for further processing, such as query optimization and execution plan generation.

About 生成语法分析器If you do JAVA development, you should know the most famous ANTLR (A powerful parser generator that supports multiple languages ​​including Java, C#, Python, JavaScript, Ruby, Swift, etc.) is simply a powerful and invincible existence, and it is used for reference or direct reference in almost all syntax analysis frameworks. There is also JavaCC (Java Compiler Compiler)a Java parser generator.

6. Core data structures and their relationships

In MySQL, the core data structures are "Item" and "TABLE_LIST". "Item" represents an expression, and "TABLE_LIST" represents a table. They are connected together through various relationships (such as connections, subqueries, etc.), and together constitute the structure of the SQL statement.

insert image description here

  1. Item: In MySQL, "Item" is an abstract concept used to represent an expression in a SQL statement. This expression may be a constant, a variable, a function call, or a more complex expression. For example, 在SQL语句"SELECT a + b FROM t"中,"a + b"就是一个"Item".

  2. TABLE_LIST: This data structure represents a table in the SQL statement. It contains the table name, alias, and other table-related information. "TABLE_LIST" is an important data structure that needs to be parsed and processed first when MySQL processes tables in SQL statements.

These two structures occupy an important position in the processing flow of MySQL. When parsing a SQL statement, MySQL will first parse the statement into a series of "Item" and "TABLE_LIST", and then perform various complex calculations and operations based on these "Item" and "TABLE_LIST" in the query optimization and execution phase .

7. Application of SQL analysis

The main application of SQL parsing is to execute SQL statements in the database. In addition, it is also used in various database tools, such as performance optimization tools, SQL audit tools, etc.

  1. Removal of useless conditions: In the process of SQL parsing, we can find out and remove those useless conditions by analyzing the syntax parse tree. For example, "WHERE 1=1" is a useless condition.

  2. SQL feature generation: By analyzing the syntax parsing tree, we can extract various features of SQL statements, such as query tables, query columns, and used functions. These features can be used for SQL classification, SQL similarity calculation and other tasks.

2. Reference documents

It is recommended that you read this article written by Meituan in great detail.
"The Application of SQL Parser in Meituan Author: Guangyou" https://tech.meituan.com/2018/05/20/sql-parser-used-in-mtdp.html

Guess you like

Origin blog.csdn.net/wangshuai6707/article/details/132588539