Make your own compiler: implement lexical analysis in C language

For compiler design and development, proof that you can get started effectively is that you can make a compiler for the C language. After completing the C language compiler, you have written the first hello world program in the field of compilation principles. So in order to confirm that the GoLex we developed is fully functional, let's see if it can accurately understand the grammar of the C language.

First, we modify a regular expression parsing bug and make the following modifications to the term function in RegParser.go:

...
else {
            /*
                匹配 "." 本质上是匹配字符集,集合里面包含所有除了\r, \n 之外的ASCII字符
            */
            start.edge = CCL
            if r.lexReader.Match(ANY) {
                for i := 0; i < ASCII_CHAR_NUM; i++ {
                    if i != int('\r') && i != int('\n') {
                        start.bitset[string(i)] = true
                    }
                }
                //bug here
                //越过 '.'
                r.lexReader.Advance()
...

If the above code is not modified, we will fall into an infinite loop when parsing the expression "(.)". In addition, the Advance function of LexReader also needs to be modified for bugs. The modifications are as follows:

func (l *LexReader) Advance() TOKEN {
...
if l.inquoted || sawEsc {
        l.currentToken = L
        //bug here
        //当读取到反斜杠时代码会进入 esc()函数,在里面我们已经越过了反斜杠后面的字符,因此这里代码会导致我们越过反斜杠后面两个字符
        //if sawEsc {
        //    //越过 / 后面的字符
        //    l.currentInput = l.currentInput[1:]
        //}
    } else {
        l.currentToken = l.tokenMap[l.Lexeme]
    }
...
}

In addition, the following bug modifications are made to EpsilonClosure in nfa_interpretion.go:

func EpsilonClosure(input []*NFA) *EpsilonResult {
...
if node.edge == EPSILON {
            //bug here
            /*
                result.results 是当前 epsilon 集合,应该判断它是否包含了给定节点,而不是在输入的 input
                中判断,因为 node 就来自于 input 最后的节点
            */
            if node.next != nil && stackContains(result.results, node.next) == false {
                input = append(input, node.next)
            }

            //bug here
            if node.next2 != nil && stackContains(result.results, node.next2) == false {
                input = append(input, node.next2)
            }
...
}

We have commented out part of the code above, and the reasons for commenting out are also explained in the code comments. There is also a bug in ii_advance() of input.c in CLex, which is modified as follows:

int ii_advance() {
 ...
   //bug here,
    int c = *Next;
    Next++;
    return c;
}

Next, let’s see how to set the content of input.lex. First, let’s look at the header content of the template file:

%{
    /*
    C 语言语法解析,yyout.h 用于定义字符串标签值,search.h 定义关键字表的查询接口
    */
#include "yyout.h"
#include "search.h"
%}

We include two header files at the head of the template file. yyout.h is mainly used to define a series of enumeration values, which respectively correspond to the labels of strings in C language code, such as ID, STRING, etc. search.h is defined in the keyword For the function definitions for binary search in the table, let’s look at the contents of these files respectively.

Create yyout.h in the CLex project, its content is as follows:

//
// Created by MAC on 2023/11/30.
//

#ifndef UNTITLED_YYOUT_H
#define UNTITLED_YYOUT_H

/*         token                   value                    lexeme          */
#define    _EOI                     0                        /*输入结束标志*/
#define    NAME                     1                        /*变量名 int a;*/
#define    STRING                   2                        /*字符串常量 char* c="abc";*/
#define    ICON                     3                        /*整型常量或字符串常量 1,2,3 'a', 'b', 'c';*/
#define    FCON                     4                        /*浮点数常量*/
#define    PLUS                     5                        /* + */
#define    MINUS                    6                        /* - */
#define    START                    7                        /* * */
#define    AND                      8                        /* & */
#define    QUEST                    9                        /* ? */
#define    COLON                    10                       /* ? */
#define    ANDAND                   11                       /* && */
#define    OROR                     12                       /* ||  */
#define    RELOP                    13                       /* > >= < <= */
#define    EQUOP                    14                       /* == != */
#define    DIVOP                    15                       /* / % */
#define    OR                       16                       /* |  */
#define    XOR                      17                       /* ^ */
#define    SHIFTOP                  18                       /* >> << */
#define    INCOP                    19                       /* ++ -- */
#define    UNOP                     20                       /* ! ~  */
#define    STRUCTOP                 21                       /* . -> */
#define    TYPE                     22                       /* int float char long ...*/
#define    CLASS                    23                       /* extern static typedef ...*/
#define    STRUCT                   24                       /* struct union */
#define    RETURN                   25                       /* return */
#define    GOTO                     26                       /* goto */
#define    IF                       27                       /* if */
#define    ELSE                     28                       /* else */
#define    SWITCH                   29                       /* switch */
#define    BREAK                    30                       /* break */
#define    CONTINUE                 31                       /* continue */
#define    WHILE                    32                       /* while */
#define    DO                       33                       /* do */
#define    FOR                      34                       /* for */
#define    DEFAULT                  35                       /* default */
#define    CASE                     36                       /* case */
#define    SIZEOF                   37                      /* sizeof */
#define    LP                       38                       /* (  左括号 */
#define    RP                       39                       /* ) 右括号 */
#define    LC                       40                       /* { 左大括号 */
#define    RC                       41                       /* } 右大括号 */
#define    LB                       42                       /* [ 左中括号 */
#define    RB                       43                       /* } 右中括号 */
#define    COMMA                    44                       /* , */
#define    SEMI                     45                       /* ; */
#define    EQUAL                    46                       /* = */
#define    ASSIGNOP                 47                       /* += -= */
#endif //UNTITLED_YYOUT_H

Add search.h and set the content as follows:

//
// Created by MAC on 2023/11/30.
//

#ifndef UNTITLED_SEARCH_H
#define UNTITLED_SEARCH_H
/**
 在关键字表中进行折半查找
*/
extern char* bsearch(char* key, char* base, int nel, int elsize, int (*compare)());
#endif //UNTITLED_SEARCH_H

We will explain the implementation logic of the bsearch function in detail later. Next, we will follow the macro definitions of some regular expressions in the template file.

%{
    /*
    C 语言语法解析,yyout.h 用于定义字符串标签值,search.h 定义关键字表的查询接口
    */
#include "yyout.h"
#inlucde "search.h"
#include <stdio.h>
#include <stdarg.h>
void handle_comment();
void yyerror(char* fmt, ...);
%}
let     [_a-zA-z]
alnum   [_a-zA-Z0-9]
h       [0-9a-fA-F]
o       [0-7]
d       [0-9]
suffix  [UulL]
white   [\x00-\s]
%%

In the above definition, let represents a character, which includes the lower bar. alnum is a combination of characters and numbers, including the lower bar. h corresponds to a hexadecimal number, o corresponds to an octal number, and suffix corresponds to an integer suffix. white treats all characters in the range from 0 to spaces in ASCII as spaces.

Next let’s look at the definition of regular expressions in C language

"/*"        {handle_comment();}
\"(\\.|[^\"])*\"  {printf("this is a string: %s\n", yytext); /*return STRING*/;}
\"(\\.|[^\"])*[\r\m]   yyerror("Adding missing \" to string constant\n")

The first regular expression is to match the string "/". In C language, it means entering the comment part. Once we encounter these two characters, we Call the handle_comment() function for processing. There is an expression that is difficult to understand above, that is \ " \.| [ ^ \ " ] ) * \ " It should be noted here that the backslash is used for escaping, \" It means that the double quotation mark here is an ordinary character, and it does not represent the special symbols in the regular expression. This expression matches string constants in the C language, for example:

char* ptr = "hello world!";

In the above code, the hello world string can match the expression we defined above. That is, once we encounter the beginning of a double quote, we enter the string recognition stage until we encounter the second double quote. Starting from the first double quote, we need to treat all characters that are not double quotes as characters in the string. This is also the function of the expression [^\”]. It should be noted that we also specifically match \ \ . , where the first backslash is an escape character, that is, after the first double quote, all combinations of backslash plus one character are recognized as a specific character, for example:

char* ptr = "hello \n world!";

Note that \n in the above code represents one character, which is the newline character, not two characters. Expression \ " ( \ \ . | [ ^ \ " ] ) *[\r\m] All characters in the string must be on the same line, and carriage returns or line feeds cannot be used to separate the string into two lines. In addition, in the above template code, we added a function yyerror that outputs errors, and we implemented it in the template function. This function is essentially a wrapper for printf, except that it outputs to the standard error output, which is actually the console. At the same time, it uses The variable-length parameter mechanism of the C language is introduced, so that we can input any number of parameters to it. The implementation is given below. We use incremental development to first see if GoLex can correctly handle the current template content. First, we first give the current entire template content:

%{
    /*
    C 语言语法解析,yyout.h 用于定义字符串标签值,search.h 定义关键字表的查询接口
    */
#include "yyout.h"
#include "search.h"
void handle_comment();
%}

let     [_a-zA-z]
alnum   [_a-zA-Z0-9]
h       [0-9a-fA-F]
o       [0-7]
d       [0-9]
suffix  [UulL]
white   [\x00-\s]
%%
"/*"        {handle_comment();}
\"*\\.|[^\"])*\"    return STRING;
\"(\\.|[^\"])*[\r\n]   yyerror("Adding missing \" to string constant\n")
%%

void handle_comment() {
    int i;
    while (i = ii_input()) {
        if (i < 0) 
            ii_flushbuf();  //放弃当前识别的字符串
        else if (i == '*' && ii_lookahead(1) == '/') {
            //识别到注释末尾
            printf("at the end of comment...");
            ii_input();
            break;
        }
    }

    if (i == 0) {
        yyerror("End of file in comment\n");
    }       
}

void main() {
    ii_newfile("/Users/my/Documents/CLex/input.txt");
    yylex();
}

You need to pay attention to the implementation of the handle_comment function. It discards all characters that appear after / until it encounters
\
 / until. We run GoLex and copy the contents of the generated lex.yy.c to main.c in CLex to see how it works. In CLex's input.txt file, we set the content for testing the regular expression:

/*this is a c comment*/
"this is a c string that contains \s \t and \" "
"this is a error c comment

After completing the above work, we compile CLex and see its running results:

ef353ab0ff1c960869f81a216920c4b2.png
From the above results, it can be seen that the recognition results of the given regular expression are consistent with expectations. We continue to add new expressions to the template file and see if the recognition effect is correct. The new expressions added are as follows:

'.'|
'\\.'|
'\\{o}({o}{0}?)?'
'\\x{h}({h}{h}?)?'|
'0{o}*{suffix}?'|
0x{h}+{suffix}?|
[1-9]{d}*{suffix}?    return ICON;

In the above expression, match the right part of the equal sign of the following code:

int a = 'a';  //匹配'.'
int b = '\t';  //匹配 '\\.'
int c = '\123';  //匹配 \\{o}({o}{o}?)?
int d = '\x123'; //匹配 '\\x{h}({h}{h}?)?'
int e = 012L; //匹配 0{o}{suffix}?
int f = 0x123L;  //匹配 0x{h}+{suffix}?
int g = 123L;  //匹配 [1-9]{d}*{suffix}?

You can see that the above values ​​all correspond to the definition of integer types in the C language. Let’s look at the definition of floating point numbers in C language:

({d}+|{d}+\.{d}*|{d}*\.{d}+)([eE][-+]?{d}+)?[fF]?   return FCON;

Examples of the above expression matching are as follows:

float a = 1f;
float b = 3.14;
float c = 314e-10; //3.14

After completing the above content, run GoLex to generate lex.yy.c, copy its content to main.c of CLex, and add the following string to input.txt in CLex for testing:

'a'
'b'
'\t'
'\f'
'\123'
012
0123L
0x123
0x123L
123
123L
3.14
123.456
1e3
123e+3
123.456e+3
1e-3
123.456e-4f

Finally, after we compile and run CLex, the results are as follows:

at the end of comment...
this is a string: "this is a c string that contains \s \t and \" "
find ICON: 'a'
find ICON: 'b'
find ICON: '\t'
find ICON: '\f'
find ICON: '\123'
find ICON: 012
find ICON: 0123L
find ICON: 0x123
find ICON: 0x123L
find ICON: 123
find ICON: 123L
find FCON: 3.14
find FCON: 123.456
find FCON: 1e3
find FCON: 123e+3
find FCON: 123.456e+3
find FCON: 1e-3
find FCON: 123.456e-4f
ERROR on line 4, near <"this is a error c comment
>
Adding missing " to string constant

Then we add lexical analysis of c language operators and add the following content to input.lex:

"("    {printf("it is LP\n"); /*return LP;*/}
")"    {printf("it is RP\n"); /*return RP;*/}
"{"    {printf("it is LC\n"); /*return LC;*/}
"}"    {printf("it is RC\n"); /*return RC;*/}
"["    {printf("it is LB\n"); /*return LB;*/}
"]"    {printf("it is RB\n"); /*return RB;*/}

"->"|
"."    {printf("struct operator:%s\n", yytext); /*return STRUCTOP;*/}

"++"|
"--"    {printf("INCOP: %s\n", yytext); /*return INCOP;*/}
"*"     {printf("START OP\n"); /*return START;*/}
[~!]    {printf("UNOP:%s\n", yytext); /*return UNOP;*/}
"*"     {printf("START OP\n"); /*return START;*/}
[/%]     {printf("DIVOP: %s\n", yytext); /*return DIVOP;*/}
"+"     {printf("PLUS\n"); /*return PLUS;*/}
"-"     {printf("MINUS\n"); /*return MINUS;*/}
<<|>>   {printf("SHIFTOP: %s\n",yytext); /*return SHIFTOP;*/}
[<>]=?  {printf("RELOP: %s\n", yytext); /*return RELOP;*/}
[!=]=   {printf("EQUOP: %s\n", yytext); /*return EQUOP;*/}
[*/%+-&|^]=|
(<<|>>)=  {printf("ASSIGN OP: %s\n", yytext); /*return ASSIGNOP;*/}
"="     {printf("EQUAL: %s\n", yytext); /*return EQUAL;*/}
"&"     {printf("AND: %s\n", yytext); /*return AND;*/}
"^"     {printf("XOR: %s\n", yytext); /*return XOR;*/}
"|"     {printf("OR: %s\n", yytext); /*return OR;*/}
"&&"    {printf("ANDAND: %s\n", yytext); /*return ANDAND;*/}
"||"    {printf("OROR: %s\n", yytext); /*return OROR;*/}
"?"     {printf("QUEST: %s\n", yytext); /*return QUEST;*/}
":"     {printf("COLON: %s\n", yytext); /*return COLON;*/}
","     {printf("COMMA: %s\n", yytext); /*return COMMA;*/}
";"     {printf("SEMI: %s\n", yytext); /*return SEMI;*/}

Then execute GoLex to generate a new lex.yy.c, copy it to CLex's main.c, and add the following new content for testing in CLex's input.txt:

(
)
{
}
[
]
->
.
++
--
/
%
+
-
<<
>>
<
>
<=
>=
!=
==
*=
/=
+=
-=
&=
|=
^=
<<=
>>=
=
^
|
&&
||
?
:
,
;

Then we execute CLex to parse the new string added, and the final result is as follows:

t is LP
it is RP
it is LC
it is RC
it is LB
it is RB
struct operator:->
struct operator:.
INCOP: ++
INCOP: --
DIVOP: /
DIVOP: %
PLUS
MINUS
SHIFTOP: <<
SHIFTOP: >>
RELOP: <
RELOP: >
RELOP: <=
RELOP: >=
EQUOP: !=
EQUOP: ==
ASSIGN OP: *=
ASSIGN OP: /=
ASSIGN OP: +=
MINUS
EQUAL: =
AND: &
EQUAL: =
ASSIGN OP: |=
ASSIGN OP: ^=
ASSIGN OP: <<=
ASSIGN OP: >>=
EQUAL: =
XOR: ^
OR: |
ANDAND: &&
OROR: ||
QUEST: ?
COLON: :
COMMA: ,
SEMI: ;

Finally, we also need to complete keyword identification. There are many specific strings in the C language that have special functions. They cannot be used as variable names, such as int, float, struct, etc. When lexical analysis encounters these specific strings , they need to be used as reserved words or keywords, and they cannot be recognized as variable names. Therefore, our approach is to first identify the current string, and then query them in the keyword table to see what characters are recognized. Whether the string is a reserved word or keyword, we continue to add the following to input.lex in GoLex:

{let}{alnum}*  {return id_or_keyword(yytext);}
.    {yyerror("Illegal character<%s>\n", yytext);}
%%
//用于表示关键字表中的一个字段
typedef struct {
    char* name;
    int val;
} KWORD;

KWORD Ktab[] = {
    {"auto", CLASS},
    {"break", BREAK},
    {"case", CASE},
    {"char", TYPE},
    {"continue", CONTINUE},
    {"default", DEFAULT},
    {"do",  DO},
    {"double", TYPE},
    {"else", ELSE},
    {"extern", CLASS},
    {"float", TYPE},
    {"for", FOR},
    {"goto", GOTO},
    {"if", IF},
    {"int", TYPE},
    {"long", TYPE},
    {"register", CLASS},
    {"return", RETURN},
    {"short", TYPE},
    {"sizeof", SIZEOF},
    {"static", CLASS},
    {"struct", STRUCT},
    {"switch", SWITCH},
    {"typedef", CLASS},
    {"union", STRUCT},
    {"unsigned", TYPE},
    {"void", TYPE},
    {"while", WHILE}
};

int cmp(KWORD*a, KWORD* b) {
    return strcmp(a->name, b->name);
}

int id_or_keyword(char* lex) {
    KWORD* p;
    KWORD  dummy;
    dummy.name = lex;
    p = (KWORD*)bsearch(&dummy, Ktab, sizeof(Ktab)/sizeof(KWORD), sizeof(KWORD),cmp);
    if (p) {
        printf("find keyword: %s\n", yytext);
        return p->val;
    }

    printf("find variable :%s\n", yytext);
    return NAME;
}

{let}{alnum}* means that the variable name must start with an underscore or a letter, and can be followed by letters or numbers. The structure KWORD is used to define a field in the keyword table, and KTab is used to define keywords in the C language. Special attention needs to be paid here, that is, the strings in each entry in KTab are arranged in ascending order, that is, "auto"< ;"break"<...<"while", this arrangement is so that we can perform a half search in the table later. When a string is parsed that satisfies the rules for variable names, id_or_keyword will be called. It will search the currently recognized string in the KTab table. If the corresponding entry can be found, it means that the current string is a keyword in the C language. Otherwise, it is an ordinary variable name. For the effect of the modified code running this time, please search coding Disney on station B to view the debugging demonstration process. The code download address for this article:
Link: https://pan. baidu.com/s/1ekBNQ94ajhswWVQSBIVZ7g Extraction code: wsir

Guess you like

Origin blog.csdn.net/tyler_download/article/details/134985568