Write your own compiler: a basic introduction to GoLex programs

Our purpose in this section is to convert a given regular expression into a non-deterministic finite state automaton data structure, which will further generate a jump table to realize the function of string matching. Let's look at the input first. The input is a file with the suffix lex. The basic content is as follows:

%{
    FCON = 1
    ICON = 2
%}
D  [0-9]
%%
(e{D}+)?
%%

There are a series of tool chains in compiler development. The first one in the chain is called lex. Its function is that you can input the regular expression corresponding to the string recognition into a file, such as the above, and then execute lex, which reads Input the file, and then output the code file based on C language. This code file actually converts the regular expression into the corresponding executable C code. After we compile the generated code, we can get executable code that can recognize specific characters. string of programs.

The lex we are going to implement this time is based on the python language. First, it is divided into several parts in the lex file. The first part is composed of %{ %}. It is actually a piece of python code, usually a variable definition. The lex program will This part of the content is directly copied to the given target file, assuming we name the target file output.py, then the statement:

FCON = 1
ICON = 2

It will be copied directly to the top of output.py, and we will explain the specific content in detail later. The next part is the macro definition, which is the part from %} to %%. The corresponding statement here is:

D  [0-9]

This is similar to the macro definition of the C language. The above string indicates that the character D indicates that the input is ten numbers from 0 to 9. The last part is between two %%, this part defines the specific form of the regular expression, corresponding to the above content is:

(e{D}+)?

Our finished code will read this statement, parse it character by character, and finally build a nondeterministic finite state automaton similar to the following:

23383f0175540afeb406cb3c0bc5d56e.png

For the specific function demonstration after the completion of this project, you can search for coding Disney at Station B. After the project in this section is completed, we will generate specific python code to implement the state machine given above. Let's take a look at the basic directory structure to implement the code:
5fd04225d4eca2cf411ad81120d33ed9.png
the most complex in the code is LexerReader.go and egParser.go, the former is responsible for reading information from the input file input.lex, when it reads the information of the regular expression , it needs to accomplish two purposes, the first is to convert the read characters into tokens, for example, read the characters "(", it returns the corresponding token: LEFT_PARAN, this function is the same as the previous lexical analysis, the second function is Expand the macro definition, in the regular expression (e{D}+)?, when it reads {D}, it will convert it into [0-9] corresponding to the macro definition for parsing.

Like the previous lexical analysis, RegParser is used to identify token components and then build a finite state automaton. It also needs to identify regular expression strings based on specific grammatical rules. The details will be explained in detail in later chapters. Next, let's look at the output of the program:
0fd91988b799150085e18a37d62a1dc1.png
this part is the debugging output information. From the screenshot above, we can see that it outputs the calling sequence of a series of functions. The above function calls occur in RegParser.go, because the syntax analysis involves a series Recursive calls, so we need to print out its call stack for easy analysis. The output of the above information is mainly realized by debugger.go, and the final output is NFA, which is the information of the finite state automaton: you can see the printed
f4642748aee5b867309a2d12eba78295.png
9f1fe7c216335954787897ea967a5c67.png
information It is consistent with the state machine jump information given above. In the next section, we will enter the specific implementation explanation of the code. For more information, please search coding Disney at station b.

Guess you like

Origin blog.csdn.net/tyler_download/article/details/127002965