[Compilation principle] Python compilation principle lexical analysis

table of Contents

1. The purpose of the experiment

2. Experimental tasks

Three, experimental principle

1 Basic concepts of lexical analysis

2 Direct scan method

3 Finite state automata

4 Introduction to flex

Fourth, the experimental process

1 Direct scan method

2 Use flex lexical analysis generator for lexical analysis

5. Experimental results

1 Direct scan method

2 flex lexical analysis generator for lexical analysis

Reference

appendix

1 Direct scan code

2 hide-digits.l file


1. The purpose of the experiment

1. Master the direct scanning method;

2. Understand regular expressions and finite state automata;

3. Know how to use FLEX and other tools to complete lexical analysis.

2. Experimental tasks

1. It is required to introduce the direct scanning method and finite automata in the principle part;

2. The code realizes the direct scanning algorithm (must do);

3. Use FLEX and other tools to complete lexical analysis (additional).

Three, experimental principle

1 Basic concepts of lexical analysis

Lexical analysis is also called word segmentation. At this stage, the compiler scans the source file from left to right and divides its character stream into word tokens. A token is a string of characters in the source file that cannot be further divided, similar to a word in English or a word in Chinese.

Figure 3.1 Schematic diagram of lexical analysis

The number of words in English is limited, and the types of tokens available in programming languages ​​are also limited and very few. Generally speaking, tokens in programming languages ​​are: constants (integers, decimals, characters, strings, etc.), operators (arithmetic operators, comparison operators, logical operators), separators (commas, semicolons, parentheses, etc.) , Reserved words, identifiers (variable names, function names, class names, etc.), etc.

After the lexical analyzer scans a complete token, it immediately creates a new TokenRecord, records the type of the token in the type field of this structure, records its literal value in the corresponding subdomain in the value field, and sets this TokenRecord structure Pass it to the next stage of parsing module for use, and then scan the next token. In this way, from the perspective of the syntax analysis module, the source program becomes a continuous token stream.

2 Direct scanning method

The idea of ​​the direct scanning method is very simple. In each round of scanning, it is judged which type of token belongs to according to the first character, and then a complete token is scanned by different strategies, and then the next round of scanning is performed. In TinyC, if only some simple cases are considered, according to the first character of the token, all types of tokens can be divided into the following 7 categories:

  1. Type A single-character operator

Type A single-character operators include: +, -, *, /, %. This token has only one character. If the first character scanned in this round is the above character, it will immediately return the token represented by this character, and then move To the next character, start the next round of scanning.

  1. Type B single-character operator and double-character operator

Type B single-character operators include: <> =!, And double-character operators include: <=, >=, ==, !=. If the first character scanned in this round is a B-type single-character operator, first check whether the next character is "=", if it is, then return the token represented by these two characters, if not, return this character to represent The token. For example, if the scan finds ">", check whether the next character is "=", if yes, return T_GREATEEQUAL, otherwise return T_GREATTHAN.

  1. Keywords and identifiers

All start with a letter or underscore, and only consist of letters, underscores, or numbers. If the first character scanned in this round is a letter or an underline, scan backwards until the first character that is neither a letter, an underline or a number is encountered, at which time a complete word is scanned , And then, check whether the word is a keyword, if it is, return the token represented by the keyword, if not, return T_IDENTIFIER and the literal value of the word.

  1. Integer constant

The integer constant starts with a number. If the first character scanned in this round is a number, it will scan backwards until the first non-digit character is encountered, and then return T_INTEGERCONSTANT and this number.

  1. String constant

Start and end with double quotation marks. If the first character in this round of scanning is a double quotation mark, scan backwards until the first double quotation mark is encountered, and then return T_STRINGCONSTANT and this string.

  1. blank

If the first character scanned in this round is a space, skip this character.

  1. Annotation

The note only considers the situation starting with #. If the first character in this round of scanning is #, then this line of character stream will be skipped directly.

3 Finite State Automata

A finite automaton is a hypothetical machine used to determine whether a string (sentence) matches a regular expression. It has an alphabet Σ, a state set S, and a conversion function T. When it is in a certain state When it reads a character (it must be a character in the alphabet), it will automatically switch to another state according to the current state and the read character. It has an initial state and some so-called acceptance states.

Its working process is as follows: First, the automaton is in the initial state, and then it starts to read the character string. Each time it reads a character, it switches to the next state according to the current state and the read character, until the end of the character string. The automaton is in its accepting state, it means that the character string is accepted by this automaton. As shown below:

 

Figure 3.2 Typical Finite State Automata

The circle in the figure above represents various states. Each arrow and the characters on the header represent the state conversion table. The automaton has only one initial state. An arrow without characters is used to point to this state. This can be considered as the entrance of the automaton. Automata can have one or more acceptance states, which are represented by double circles. The alphabet of the automaton in the above figure is {a,b}, and the initial state is S1. When it reads an a, it goes to state S2. If it reads b, it goes to S4, and then one by one. A character changes its state. If the automaton is in its accepting state when the character ends, it means that the character string is accepted by it. After observation, it can be seen that the character strings accepted by the automata in this figure are "ab", "abb", "abbb", etc., that is, this automata is equivalent to the regular expression ab+.

Any regular expression has an equivalent finite state automata, and any finite state automata also has an equivalent regular expression. At the same time, the finite state automata only needs to scan the string once, and the judgment speed is very fast.

In a word, the matching judgment of regular expressions can be carried out by constructing finite state automata. The general idea of ​​constructing finite state automata is to construct a basic automata first, and then build a complex automata based on the structure of regular expressions. However, the specific algorithm for constructing finite state automata is very complicated, and the tool flex can be borrowed to perform lexical analysis based on regular matching.

4 Introduction to flex

Flex is a fast lexical analysis generator. It can construct a finite state automaton (a C function) from the word segmentation matching pattern written by users with regular expressions. At present, many compilers use it to generate lexical analyzers.

Fourth, the experimental process

1 direct scan method

The direct scanning method has a simple idea, and the amount of code is very small, scan1.py is only 100 lines of code. But the disadvantage is that the speed is slow, the token of the identifier type needs to be scanned at least 2 times, and string search and comparison are required. And it is not easy to extend, only suitable for languages ​​with simple syntax.

The specific code is detailed in Appendix 1.

2 lexical analysis flex lexical analyzer generator

This part realizes that the input continuous number string is replaced with the specified character, here are all replaced with the character'? ', the specific steps are as follows:

  1. Install flex;
  2. Create a new hide-digits.l file, see appendix 2 for specific content ;
  3. Run this file: flex hide-digits.l ;
  4. At this time, there is an additional "lex.yy.c" file in the directory, compile and run the C file;
  5. Type any key in the terminal and press Enter to type'#' and the program will exit.

5. Experimental results

1 direct scan method

The running result is shown in Figure 5.1, which analyzes the identifier type of the program code.

Figure 5.1 Running result of direct scan method

2 flex lexical analysis generator for lexical analysis

*Note: Bold characters are output results, and white characters are input strings.

Abcedfs

Abcedfs

12456789

?

Assdfasa1564

Assdfasa?

Adsfa123asdfaf56

Adsfa? Asdfaf?

...

#

It can be seen from the above results that this program can convert consecutive numeric strings in the input string to'? '; It shows that the flex lexical analysis generator is more efficient, because we have implemented lexical analysis with less code, which greatly saves the effort of writing code compared with the direct scanning method.

Reference

  1. Lexical analysis: https://pandolia.net/tinyc/ch7_lexical_basic.html
  2. Use flex for lexical analysis: https://pandolia.net/tinyc/ch8_flex.html

 

 

appendix

1 Direct scan code

# -*- coding: utf-8 -*-

 

single_char_operators_typeA = {

    ";", ",", "(", ")", "{", "}", "[",

    "]", "/", "+", "-", "*", "%", ".",

    ":"

}

 

single_char_operators_typeB = {

    "<", ">", "=", "!"

}

 

double_char_operators = {

    ">=", "<=", "==", "~="

}

 

reservedWords = {

    "class", "for", "while", "if", "else",

    "return", "break", "True", "False", "raise", "pass"

    "in", "continue", "elif", "yield", "not", "def"

}

 

class Token:

    def __init__(self, _type, _val = None):

        if _val is None:

            self.type = "T_" + _type;

            self.val = _type;

        else:

            self.type, self.val = _type, _val

   

    def __str__(self):

        return "%-20s%s" % (self.type, self.val)

 

class NoneTerminateQuoteError(Exception):

    pass

 

def isWhiteSpace(ch):

    return ch in " \t\r\a\n"

 

def isDigit(ch):

    return ch in "0123456789"

 

def isLetter(ch):

    return ch in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

 

def scan(s):

    n, i = len(s), 0

    while i < n:

        ch, i = s[i], i + 1

       

        if isWhiteSpace(ch):

            continue

       

        if ch == "#":

            return

       

        if ch in single_char_operators_typeA:

            yield Token(ch)

        elif ch in single_char_operators_typeB:

            if i < n and s[i] == "=":

                yield Token(ch + "=")

            else:

                yield Token(ch)

        elif isLetter(ch) or ch == "_":

            begin = i - 1

            while i < n and (isLetter(s[i]) or isDigit(s[i]) or s[i] == "_"):

                i += 1

            word = s[begin:i]

            if word in reservedWords:

                yield Token(word)

            else:

                yield Token("T_identifier", word)

        elif isDigit(ch):

            begin = i - 1

            aDot = False

            while i < n:

                if s[i] == ".":

                    if aDot:

                        raise Exception("Too many dot in a number!\n\tline:"+line)

                    aDot = True

                elif not isDigit(s[i]):

                    break

                i += 1

            yield Token("T_double" if aDot else "T_integer", s[begin:i])

        elif ord(ch) == 34: # 34 means '"'

            begin = i

            while i < n and ord(s[i]) != 34:

                i += 1

            if i == n:

                raise Exception("Non-terminated string quote!\n\tline:"+line)

            yield Token("T_string", chr(34) + s[begin:i] + chr(34))

            i += 1

        else:

            raise Exception("Unknown symbol!\n\tline:"+line+"\n\tchar:"+ch)

if __name__ == "__main__":

    print( "%-20s%s" % ("TOKEN TYPE", "TOKEN VALUE"))

    print ("-" * 50)

    for line in open("scan1.py"):

        for token in scan(line):

            print (token)

2 hide-digits.l文件

%%

[0-9]+  printf("?");

#       return 0;

.       ECHO;

%%

 

int main(int argc, char* argv[]) {

    yylex();

    return 0;

}

 

int yywrap() {

    return 1;

}

 

Guess you like

Origin blog.csdn.net/weixin_43442778/article/details/114971609