From zero to write a compiler (a): input system and lexical analysis

Foreword

From the half-completed copy of a semi change the C compiler to Java byte code into some time now, always wanted to write a series to review and sort out the process of writing a compiler , it can be considered study notes. Starting today to write it.

The complete code for the project in C2j-Compiler

I will start to write a C language interpreter, traversing the AST directly executed directly, and then generate the code will be added after the section, which is compiled into Java byte code

Most use the C language support, specifically to look at the link above, of course, still has the toy class compiler level than toys.

Officially begin

A complete compiler probably the main part of these

  • lexical analysis

General use finite state automaton or manually write achieved, this step is output token sequence

  • Gramma analysis

Divided into top-down and bottom-up parsing, generally recursive descent, LL (1), LR (1), LALR (1) implemented in several ways. This step is the output of the syntax tree

  • Semantic Analysis

Semantic analysis of the main task is to generate the symbol table, and found not to conform to the semantics of the statement, the output of this step or AST

  • Code Generation

Herein generally generates a relatively independent intermediate language proximate the bottom (IR) with the platform, this step input AST, outputs an IR

  • Code optimization

This step of the work and the name suggests, is to optimize the code, improve performance, etc.

  • Target code generation

This step task is to generate a platform-dependent assembly language

Above is almost the entire general sense of the compiler, but can also include calling the linker assembler to generate an executable file

Time is limited C2j-Compiler level in three steps is to traverse the AST directly generate the target Java bytecode, without any optimization. Lexical analysis using a hand-written parsing using LALR (1) parsing table

Input System

For a thousand lines of source file dollars to build a system to improve the efficiency of input input it is necessary.

There are three documents entered into the system

  • FileHandler.java
  • DiskFileHandler.java
  • Input.java

FileHadnler

As an interface input, DiskFileHandler implement this interface to achieve read from the file. There are three main methods

void open();
int close();
int read(byte[] buf, int begin, int end);

Wherein the read is to copy the specified data to the specified length and the buffer start position is specified buffer

The complete source code in my warehouse dejavudwh

Input

Input is the key point of the entire input system implementation, which utilizes a buffer to improve the efficiency of input, which is the first part of the contents of the file into the buffer, when the input pointer is about to cross the danger zone, the buffer re input, so you can read the entire piece to the contents of the file, to avoid IO many times.

inputAdvance a forward position to obtain input, before getting input, will first check is not required buffer flush

public byte inputAdvance() {
        char enter = '\n';

        if (isReadEnd()) {
            return 0;
        }

        if (!readEof && flush(false) < 0) {
            //缓冲区出错
            return -1;
        }

        if (inputBuf[next] == enter) {
            curCharLineno++;
        }

        endCurCharPos++;

        return inputBuf[next++];
}

Flush main logic is to determine the next pointer is not beyond the danger area, or the force which is required to force the flush true, is called to fill the buffer fillbuf

private int flush(boolean force) {
        int noMoreCharToRead = 0;
        int flushOk = 1;

        int shiftPart, copyPart, leftEdge;
        if (isReadEnd()) {
            return noMoreCharToRead;
        }

        if (readEof) {
            return flushOk;
        }

        if (next > DANGER || force) {
            leftEdge = next;
            copyPart = bufferEndFlag - leftEdge;
            System.arraycopy(inputBuf, leftEdge, inputBuf, 0, copyPart);
            if (fillBuf(copyPart) == 0) {
                System.err.println("Internal Error, flush: Buffer full, can't read");
            }

            startCurCharPos -= leftEdge;
            endCurCharPos -= leftEdge;
            next  -= leftEdge;
        }

        return flushOk;
}
private int fillBuf(int startPos) {
        int need;
        int got;
        need = END - startPos;
        if (need < 0) {
            System.err.println("Internal Error (fill buf): Bad read-request starting addr.");
        }

        if (need == 0) {
            return 0;
        }

        if ((got = fileHandler.read(inputBuf, startPos, need)) == -1) {
            System.err.println("Can't read input file");
        }

        bufferEndFlag = startPos + got;
        if (got < need) {
            //输入流已经到末尾
            readEof = true;
        }

        return got;
}

lexical analysis

Lexical analysis work that the input stream is divided into a source file a token, Lexer output may be similar to <if, keyword>. Recognize identifiers, numbers, keywords in this section.

Lexer in a total of two files:

  • Token.java
  • Lexer.java

Token

Token is mainly used to identify each Token, used mainly in the Lexer in as NAME expressed identifier, NUMBER to represent numbers, STRUCT to represent struct keyword.

//terminals
NAME, TYPE, STRUCT, CLASS, LP, RP, LB, RB, PLUS, LC, RC, NUMBER, STRING, QUEST, COLON,
RELOP, ANDAND, OR, AND, EQUOP, SHIFTOP, DIVOP, XOR, MINUS, INCOP, DECOP, STRUCTOP,
RETURN, IF, ELSE, SWITCH, CASE, DEFAULT, BREAK, WHILE, FOR, DO, CONTINUE, GOTO,

Lexer

Input and reading input stream prior Lexer is the use, to the output stream Token

public void advance() {
        lookAhead = lex();
}

Lexer main logic Lex is in (), each time using inputAdvance read from the input stream, until met whitespace or newline represents at least one end of the Token, (where the double quotes is met if the string is not in the spaces as whitespace handling) , and then began to analyze.

Cut out only part of the code is too long, the logic is very simple, another began to write when he did not deal with the comment, not add to it later

for (int i = 0; i < current.length(); i++) {
                length = 0;
                text = current.substring(i, i + 1);
                switch (current.charAt(i)) {
                    case ';':
                        current = current.substring(1);
                        return Token.SEMI.ordinal();
                    case '+':
                        if (i + 1 < current.length() && current.charAt(i + 1) == '+') {
                            current = current.substring(2);
                            return Token.INCOP.ordinal();
                        }

                        current = current.substring(1);
                        return Token.PLUS.ordinal();

                    case '-':
                        if (i + 1 < current.length() && current.charAt(i + 1) == '>') {
                            current = current.substring(2);
                            return Token.STRUCTOP.ordinal();
                        } else if (i + 1 < current.length() && current.charAt(i + 1) == '-') {
                            current = current.substring(2);
                            return Token.DECOP.ordinal();
                        }

                        current = current.substring(1);
                        return Token.MINUS.ordinal();

                        ...
                        ...
}                        

Enter here the system and lexical analysis is over.

Work lexical analysis stage is to convert the input character stream for a particular Token. This step is to identify the process of combining characters, mainly identification numbers, identifiers, keywords and other processes. This part should be the entire compiler easiest part

Also my github blog: https://dejavudwh.cn/

Guess you like

Origin www.cnblogs.com/secoding/p/11367511.html