[Compilation principle] Bison source file structure principle

0x00 What is BISON

BISON is used for the automatic generation of the parser, and this tool can be downloaded from the Internet. Taking some time to learn the usage of this tool and use it for the analysis of the SQL language allows us to focus on grammar rules instead of writing specific analysis functions. For the entire DBMS, the use of automated tools to automatically generate language processing programs makes the language analysis module one of the most reliable and easy to maintain modules.

Structure of a BISON source file

We need to write BISON's source program (gramma.y) according to BISON's requirements, and then BISON translates it into a C file. So BISON is a translator for compiled programs. A BISON source file usually consists of eight parts:

1. Free definition part:

%{
%}

This part is copied intact by BISON into the output .C file.

two. The UNION structure of the grammar stack

The parser uses a stack to store the reduced syntactic components. The stack is represented by an array . Each element of the array needs to be able to describe each syntactic component, so a UNION is used:

%union
{
}

Each item in Union is each non-terminal symbol of a grammar rule ; take integer four expressions as an example:

exp : exp ‘+’ exp
     | exp ‘-‘ exp
     | exp ‘*’ exp
     | exp ‘/’ exp
     | ‘(‘ exp ‘)’
     | lt_integer
;

lt_integer: LT_INTEGER;

There are two grammar rules, corresponding to two nonterminal symbols: exp is an expression, and lt_integer is an integer constant ( LT_INTEGER is a word returned by the lexer that is recognized as an integer ). Correspondingly, this union can be written as:

%{
  par_exp_t*      exp;
  int             lt_integer;
};

Where par_exp_t is used to describe the information of the identified exp, and int stores the value of the identified integer. The above example is very simple, so the union has only two fields; in DM's parser, this union has about 490 fields, that is, about 490 grammar rule productions.

three. Type declarations for nonterminals

The UNION type of the analysis stack is defined above, and it is also necessary to map the field names to the grammatical non-terminal symbols:

%type <</SPAN>字段名>  非终结符号

In the example above, this part should be written as:

%type exp
%type lt_integer

It may seem redundant, each line is a simple repetition. But the former represents the corresponding field name in UNION, and the latter is a syntax symbol; if we change UNION to:

%{
par_exp_t*  eeee;
int          iiii;
};

Then the corresponding type declaration needs to be changed to:

%type <eeee> exp
%type <iiii> lt_integer;

This inconsistent writing method will actually cause confusion, so in the DM system, the above consistent writing method is adopted.

Four: Word Statement

The input to the parsing is consecutive words of definite meaning. The following words need to be declared that the parser supports:

%token LT_INTEGER

For SQL syntax, keywords such as: SELECT, FROM, WHERE, etc., can be defined as words:

%token KW_SELECT, KW_FROM
%token KW_WHERE

5. Determine operator precedence

%left ‘-‘ ‘+’
%left ‘*’ ‘/’
%left ‘(‘ ‘)’

%left means that it is left-associative, which means that the production on the left side is reduced first and reflected in the expression calculation:

1 + 2 + 3 别识别为:((1 + 2) + 3), 而不是 (1 + (2 + 3))

Symbols of lower priority are listed first, and symbols of higher finite priority are listed last; those on the same row represent the same priority. Therefore, the above writing method conforms to the principle of "multiply and divide first, then add and subtract, and parentheses take precedence".

In addition to %left, there are also %right, %nonassoc, etc., which are used for right combination or no combination, etc. You can view the detailed description of bison.

6. Beginning of declaration syntax

%start exp

This is to inform bison that this is a nonterminal that the grammar ultimately needs to reduce.

7. Definition of grammar rules

This is the core definition part of the parser, starting with %%, and the syntax rules for expressions are listed earlier:

%%
exp : exp ‘+’ exp
     | exp ‘-‘ exp
     | exp*’ exp
     | exp ‘/’ exp
     | ‘(‘ exp ‘)’
     | lt_integer
;

lt_integer: LT_INTEGER;

Eight. Freely added C source code

After the definition part of the grammar rules, you can start with %% to define the auxiliary code of C. This part of the code will be copied intact into the output .C file.

Attaching grammatical rules with reduced actions

A reduction action is a piece of C code whose function is to call the code whenever the parser recognizes a syntax symbol to complete a certain action. Usually, we use this code to establish the hook action between the current syntax node and the child node. The reduction action should immediately follow the grammar rule.

如上例:语法分析器 <wbr>BISON

Only two of the sub-rules are listed here, and the four statements A, B, C, and D constitute the statement block of the first sub-rule:

A:

Generates a structure for the identified exp, pointing to it with . $$is a special flag defined by bison, whose meaning is the specification element of the current grammar stack. $$If there is no reduction action code, it is assigned NULL by default . new_node is a function that needs to be written by yourself to generate each child node, and PAR_EXP is a pre-defined constant. Obviously, for different rules, different constant types need to be defined. Functions like new_node are usually placed in the last section of the .y file.

B:

It is used to distinguish which sub-rule specification is used. Here, tag = 1 is used to represent the '+' operation of two sub-expressions

C:

Keep the first subexpression; $1 represents the corresponding value in the syntax stack where the first syntax component of this production is located

D:

Retain the second subexpression; $3 represents the corresponding value in the syntax stack where the third syntax component of this production is located; note that the '+' here also occupies a position, use $2, here because there is tag=1, it has been put The corresponding information is saved to $$, so you don't need to worry about it.

E:

This is a rather special statement, it $$assigns to a global. Because exp is a start symbol, when parsing ends, this g_root is the root of the syntax tree.

F:

Because the parenthesized expression is equivalent to the original expression, you can directly assign $2 to $$it, and there is no need to generate a par_exp node.

The final function yyparse()

yyarse() is the main function of the analyzer generated by bison. Call yyarse(), and if all goes well, g_root in the above example will point to a completed syntax tree.

Error handling

If the input string has syntax errors, the parser will stop parsing. Before exiting the yyparse() function, a yyerror(char*s) function will be called. This function needs to be defined by the user in order to capture some meaningful Information such as: the line number where the grammatical error occurred, nearby words, etc.

0x01 reference

[http://pkmonster.blog.51cto.com/390780/79366]

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326865736&siteId=291194637