From zero to write a compiler (C): a few basic data structure syntax analysis of

The complete code for the project in C2j-Compiler

EDITORIAL

This series count myself as some records in the process of learning to write a compiler, the algorithm and the like have no record of what principle, want to know the principles of the dragon books are written in a very clear, but I began as a very down to see, up to now have not yet complete reading it like a book to those who already have the basis of written.

Parse the bag in a total of eight files, syntax analysis phase is to write everything in it

  • Symbols.java
  • Production.java
  • SyntaxProductionInit.java
  • FirstSetBuilder.java
  • ProductionManager.java
  • ProductionsStateNode.java
  • StateNodeManager.java
  • LRStateTableParser.java

The complete code for the project in C2j-Compiler

SyntaxProductionInit syntax to initialize

In the last say, even to verify the sentence is correct or not, naturally need grammar, that is, given the appropriate syntax derivations

All grammar initialization is done in the SyntaxProductionInit

///EXT_DECL_LIST ->EXT_DECL_LIST COMMA EXT_DECL
right = getProductionRight(new int[]{Token.EXT_DECL_LIST.ordinal(), Token.COMMA.ordinal(), Token.EXT_DECL.ordinal()});
production = new Production(productionNum, Token.EXT_DECL_LIST.ordinal(), 0, right);
productionNum++;
addProduction(production, false);

For example, following on the variables corresponding to the C language of the declaration of derivation, PROGRAM whole derivation of the start symbol, EXT_DEF_LIST is the list of claims, EXT_DEF_LIST here -> EXT_DEF_LIST EXT_DEF need to pay attention to what is left recursive case, LR syntax can be handled , before this can be seen on the blog.

For example EXT_DECL_LIST -> EXT_DECL -> VAR_DECL plurality of variable name can derive a variable declaration statement name or variable name declared plurality of variable names declared + + comma

VAR_DECL identifier may be a pointer or a multiple

I.e., a constant derived from the leaf node, reading terminator, to finally derive the start symbol, and the input stream has finished

/*
*   PROGRAM -> EXT_DEF_LIST
*
*  EXT_DEF_LIST -> EXT_DEF_LIST EXT_DEF
*
*  EXT_DEF -> OPT_SPECIFIERS EXT_DECL_LIST  SEMI
*             | OPT_SPECIFIERS SEMI
*
*
*  EXT_DECL_LIST ->   EXT_DECL
*                   | EXT_DECL_LIST COMMA EXT_DECL
*
*  EXT_DECL -> VAR_DECL
*
*  OPT_SPECIFIERS -> CLASS TTYPE
*                   | TTYPE
*                   | SPECIFIERS
*                   | EMPTY?
*
*  SPECIFIERS -> TYPE_OR_CLASS
*                | SPECIFIERS TYPE_OR_CLASS
*
*
*  TYPE_OR_CLASS -> TYPE_SPECIFIER
*                   | CLASS
*
*  TYPE_SPECIFIER ->  TYPE
*
*  NEW_NAME -> NAME
*
*  NAME_NT -> NAME
*
*  VAR_DECL -> | NEW_NAME
*
*              | START VAR_DECL
*
*/

In the initialization process of derivation grammar to build a total of three data structures

private HashMap<Integer, ArrayList<Production>> productionMap = new HashMap<>();
private HashMap<Integer, Symbols> symbolMap = new HashMap<>();
private ArrayList<Symbols> symbolArray = new ArrayList<>();
  • Left key is the derivation of ProductionMap, value i.e., a corresponding one or more production
  • Similarly SymbolMap ProductionMap, key derivation is left, value on the right is the production of one or more

Symbol Similarly Production, is used to indicate the production, but a little bit different, and further comprising a terminator, will have different effects on the back,

//Symbols
public int value;
public ArrayList<int[]> productions;
public ArrayList<Integer> firstSet = new ArrayList<>();
public boolean isNullable;

If a non-terminal symbol, it can be deduced that the empty set, then this is what we call nullable nonterminal nonterminal

  • Symbols symbolArray stores each object will have a different role in the back

Production production category

When it comes to the verification process on a syntax that is where a bunch of syntax production in the corresponding deduced the answer, Production class is to represent a production

private int dotPos = 0;
    private int left;
    private ArrayList<Integer> right;
    private ArrayList<Integer> lookAhead = new ArrayList<>();
    private int productionNum = -1;

    public Production(int productionNum, int left, int dot, ArrayList<Integer> right) {
        this.left = left;
        this.right = right;
        this.productionNum = productionNum;
        lookAhead.add(Token.SEMI.ordinal());

        if (dot >= right.size()) {
            dot = right.size();
        }
        this.dotPos = dot;
}
  1. left and right is the production of left and right, are used to represent the values ​​before Token
  2. lookahead set of forward-looking, used to refer to later
  3. Like dos construct the automaton aid, after use in detail
  4. This production is productionNum the corresponding number, this number is initialized when a given grammar

summary

This one introduces several data structures, these data structures are built on the basis of finite state automata after. Originally wanted to build a state machine also written in this one, but then if to add, too voluminous, plus a portion of the module will not feel a bit scattered, so the automaton constructed to write the next one.

Also my github blog: https://dejavudwh.cn/

Guess you like

Origin www.cnblogs.com/secoding/p/11367530.html