Detailed explanation of the SQL statement parsing process

This article mainly uses Flex and Bison tools to implement a simple SQL parser, and finally generate an abstract syntax tree!

The following first introduces the principles of Flex and Biosn respectively, and then gives a complete Demo of the SQL parser!

1. Enter SQL statement

2. Flex lexical analyzer 

2.1 Principle of Flex

1. Use the flex tool to define regular expression rules to match different types of lexical units; for example, you can define the following rules: 

  • Matching keywords: SELECT, FROM, WHERE, HAVING, etc.
  • Match Identifier: Start with a letter or underscore followed by a letter, number or underscore.
  • Matching operators: such as =, <, >, +, etc.
  • Matching constants: including integers, floating point numbers, strings, etc.

2. Generate lexical analyzer code: according to the defined lexical rules, use the Flex tool to generate the corresponding lexical analyzer code;

3. Input query string: Provide the query string to be parsed as input to the same method analyzer;

4. Scanning and matching: the lexical analyzer reads characters one by one from the input string and tries to match them with the defined lexical rules;

5. Generating lexical units: When the lexical analyzer matches a lexical rule, it generates the corresponding lexical unit and returns it to the parser. Each token usually contains two pieces of information:

  • Lexical unit type (token type): Indicates the type of the lexical unit, such as keywords, identifiers, operators, etc.;
  • Lexical unit value (tokenvalue): Indicates the specific value of the lexical unit;

6. Continue scanning: the lexical analyzer will continue to read characters from the input string, and repeat steps 4 and 5 until the entire query string is completely parsed into a series of lexical units;

7. Return lexical unit sequence: When the entire query string is parsed, the lexical analyzer will return a sequence containing all lexical units to the parser for subsequent grammatical analysis processing;

2.2 Flex file code structure

The flex file code is as follows:

%option noyywrap
%{
definition
%}

%%
rules
%%
Code

(1) %option specifies some characteristics of flex scanning. yywrap is usually defined when scanning multiple files. Some commonly used options are:

  • Noyywrap: tell flex not to use the yywrap function;
  • yylineno: will tell flex to generate an integer variable named yylineno to save the current line number;
  • case-insensitive regular expression rules are case-insensitive;

(2) The definitio part is the definition part, including importing header files, variable declarations, function declarations, comments, etc., and this part will be copied to the output .c file as it is.

(3) The rules part defines the lexical rules, using regular expressions to define the lexical, and the following {} is the action code when the corresponding lexical is scanned; "|" is a special symbol, indicating that the next pattern applies the same action; regular expression If no action is specified after the pattern, the corresponding pattern will be ignored.

(4) The code part is the code of C language. yylex is a function of flex, use yylex to start scanning.

2.3 Commonly used variables in Flex files

(1) yytext: yytext is a global character array in Flex, which is used to store the text of the currently matched lexical unit. In the lexical rules, when a pattern is matched, the matched text can be obtained through yytext.

(2) yylength: yylength is a global integer variable in Flex, which is used to store the length of the currently matched lexical unit. In the lexical rules, the length of the matched text can be obtained by yylength.

(3) yylval: yylval is a common union in Bison, which is used to pass values ​​between the lexical analyzer and the parser. It can store values ​​of different types, defined as needed. In lexical rules, additional information can be passed to the parser by modifying the value of yylval.

2.4 Specific cases of Flex files

1. Create a file named lexer.l, which contains the lexical rules;

%{
#include <stdio.h>
%}

%%
SELECT                  { printf("Keyword: SELECT\n"); }
FROM                    { printf("Keyword: FROM\n"); }
WHERE                   { printf("Keyword: WHERE\n"); }
AND                     { printf("Keyword: AND\n"); }
OR                      { printf("Keyword: OR\n"); }

[0-9]+                  { printf("Number: %s\n", yytext); }

[A-Za-z_][A-Za-z0-9_]*  { printf("Identifier: %s\n", yytext); }
[=><]+                  { printf("Operator: %s\n", yytext); }
[ \t\n]                 ; // Skip whitespace

.                       { printf("Unknown: %s\n",yytext); }
 
%%

int main() {    
    yylex();   
    return 0;
}

2. Use the flex command to compile the lexer.l file and generate the lexical analyzer code 

(1) Execute the following statement to generate lexical analyzer code

flex lexer.l

(2) The result generated by the lexical analyzer

lex.yy.c

(3) Compile the generated lexical analyzer code to generate an executable file

gcc -o lexer lex.yy.c -lfl

(4) Run the executable file and enter some arithmetic expressions for testing

./lexer

输入:SELECT * FROM table;

(5) The execution results are as follows

illustrate:

  • -ll: This is the linking option for older versions of the Flex generator (eg Flex 2.5.4). It indicates to the linker that a library file named libl.a or libl.so is to be used. In previous versions, the default name of the lexical analyzer generated by Flex was lex.yy.c, and the name of the library file began with "l", so using -ll was a traditional way.
  • -lg: This is the linking option for newer versions of the Flex generator (eg Flex 2.5.35). Similar to older versions of -ll, it instructs the linker to use a library file named libg.a or libg.so. This new way is to avoid naming conflicts with other tools and libraries.
  • -lfl: This is an option related to the lexer library generated by Flex. -lfl means the linker will use a library file named libfl.a or libfl.so. This library contains the runtime support functions required by Flex.

Notice:

        If the flex lexical analyzer compiles .l it reports an error:

        /opt/h/devtoolset-11/root/usr/ibexec/gcex86.64-redhat-linux/11/ld: cannot find -lfn

solution:

        This error indicates that the linker cannot find a library file named -if. This is usually because the libfl library is missing on your system, or the path to the library file is not configured correctly. To resolve this issue, you can try the following steps:

1. Confirm whether the library has been installed: First, please make sure that the libfl library has been installed on your system. You can try using your package manager to install it. On Red Hat based systems, you may need to execute a command similar to the following:

yum install flex-devel

2. Check the library file path: If the library is installed, but the linker still cannot find it, it may be because the library file path is not configured correctly. You can try specifying the path to the library file manually. For example, assuming the libfl library file is located in the /usr/lib64 directory, you can use the following link:

gcc -o my program lex.yy.c -L/usr/lib64 -1f1

3. Update the library file cache: If you have recently installed the libfl library, but the linker still cannot find it, you may need to update the library file cache. Run the following command to update the library file cache:

sudo ldconfig

 3. Bison parser

        Bison (GNU Bison) is a tool for generating parsers based on an extended version of the Yacc (Yet Another Compiler Compiler) tool. Bison takes a context-free grammar as input and generates a LALR(1) (Look-Ahead LR(1)) parser.

3.1 Bison principle

(1) Defining a grammar: Use Bison's grammar to define a context-free grammar. This grammar describes the grammatical rules of the language to be analyzed.

(2) Generate parser code: Run the Bison tool, taking the defined grammar as input. Bison will generate a parser C source code file according to the grammar.

(3) Compile the parser: use the C compiler to compile the generated C source code file into an executable parser.

(4) Run the parser: Pass the input to be analyzed to the generated parser, and the parser will analyze it according to the defined grammar.

(5) Syntactic analysis: The parser uses the LALR (1) algorithm for grammatical analysis. It reads the input symbol stream and uses the state transition table to deduce whether the input symbol sequence conforms to the grammar rules.

(6) Grammatical error handling: If the input symbol sequence does not conform to the grammar rules, the parser will detect a grammatical error. At this point, Bison will call the yyerror function for error handling, and you can customize the yyerror function to handle errors.

(7) Semantic actions: Semantic actions can be specified in grammar rules during parsing. Semantic actions are code fragments executed during parsing to build abstract syntax trees, perform semantic actions, etc.

(8) Generate an abstract syntax tree: Through semantic actions, the parser can build an abstract syntax tree (AST), which represents the structure of the input conforming to the grammar rules.

(9) Subsequent processing: Once the parser completes the syntax analysis and generates an abstract syntax tree, you can perform further semantic analysis, code generation and other subsequent processing as needed.

3.2 Bison file code structure

  The Bison file code is as follows:

%{
// C 代码和头文件的声明
#include <stdio.h>
// 在这里可以定义全局变量和函数等
%}
// Bison 的选项部分
%option verbose   		// 控制 Bison 解析器的详细输出

// Bison 的声明部分    
%token NAME       	    // 定义终结符或标记的名称
%token NUMBER

%left ‘+’ ‘-‘           // 定义运算符的优先级和结合性
%left ‘*’ ‘/’

%{
// 在这里可以编写更多的 C 代码
%}// Bison 的规则部分

%%
// 语法规则的定义
expression : expression '+' expression           
            | expression '-' expression           
            | expression '*' expression           
            | expression '/' expression           
            | '(' expression ')'           
            | NUMBER           ;
// 更多的语法规则...
%%

// C 代码部分(选项中的 %{ ... %} 和规则部分中的 %% 之间的部分)
// 在这里可以编写与语法规则相关的 C 代码
int main() {    
    yyparse();  // 调用 Bison 生成的解析函数    
    return 0;
}

  The writing format of the bison file is basically the same as that of the flex file, but the definition syntax of the rules is different.

3.3 Special symbols commonly used in Bison files

(1) “Grammar”

        A "grammar" is a set of rules that describe a programming language or the grammatical structure of a language. These rules define the language's syntax (syntax), which combinations are valid, legal statements and expressions, and how they fit together. Grammar rules are expressed in the form of productions, which contain combinations of terminals and non-terminals.

        Grammar rules are represented in Bison files in the form of BNF (Backus-Naur Form) or EBNF (Extended Backus-Naur Form). BNF is a formal representation used to define context-free grammars (Context-Free Grammar), which are used to specify the grammar rules of programming languages.

expression : expression '+' term
          		| expression '-' term
           		| term;

(2) %start

        The %start directive is used to specify the starting nonterminal of the grammar. The start non-terminal symbol is the entry point of the syntax analysis, that is, from which grammar rule to build the parse tree or parse tree.

%start program

%%

statements : statement

                   | statements statement;

statement : assignment

                   | if_statement

                   | while_statement

                   | /* ... other statement types ... */ ;

        %start program specifies the start nonterminal as program. This means that the parser will start from the program rule, gradually expand other non-terminal symbols, and finally build a parse tree. In actual grammar rules, the choice of the start nonterminal depends on the grammatical structure of the language you want to analyze.

(3)$

        In grammar rules, $ is used to refer to symbols or values ​​on the right-hand side of the current production. For example, on the right side of a production, $1 refers to the first element (terminal or nonterminal) on the right side of the production, $2 refers to the second element, and so on. These references are used to pass values ​​from the right side of the production to the left side of the production. Note: The starting subscript of the production formula is 1.

(4)$$

        In grammar rules, $$ is used to refer to the result of the current production. When the Bison parser finishes analyzing a production and computes its result, the result is assigned to $$. This is typically used to build nodes of a parse tree or provide results for higher level grammar rules.

(5)|

        | is used to indicate a choice between multiple productions. It is used in context-free grammars to define different production forms of nonterminals. Each production is separated by a vertical bar, indicating that they are one of the possible forms of the nonterminal.

3.4 Specific cases of bison files

1. Create a file named parser.l, which contains the lexical rules;

%{
#include <stdio.h>
#include <stdlib.h>
%}

//定义终结符
%token SELECT INSERT UPDATE DELETE FROM WHERE 
%token INTO VALUES SET
%token ID INT STRING

%%

//定义规则
	
statement: SELECT columns FROM table WHERE condition ';'
         	| INSERT INTO table '(' columns ')' VALUES '(' values ')' ';'
         	| UPDATE table SET assignments WHERE condition ';'
         	| DELETE FROM table WHERE condition ';'
         	;

columns: ID
       	| columns ',' ID
       	;

table: ID
     	;

assignments: ID '=' value
           	| assignments ',' ID '=' value
           	;

values: value
      	| values ',' value
      	;

value: INT
     	| STRING
     	;

condition: ID '=' value
         	;

%%

int main() {
    	yyparse();
    	return 0;
}

int yyerror(const char *s) {
    	printf("Error: %s\n", s);
    	return 0;
}

2. Use the bison command to compile the lexer.l file

bison -d parser.y

        This will generate two files, parser.tab.c and parser.tab.h. Next, you can compile these files with your compiler project and link them into your code.

4. Complete Demo demonstration of SQL parser

Demos and demos are coming out next week! ! !

If you want to know the corresponding parser part of the abstract syntax tree generation and compilation principle , see the link below:

https://mp.csdn.net/mp_blog/creation/editor/132252320

Guess you like

Origin blog.csdn.net/weixin_47156401/article/details/132514880