词法分析器实现过程（java和c++实现）

词法分析器是编译原理的一个实验，本文将会详细给出实现的具体步骤，利用java进行示例讲解，源码（包含java和c++两种实现方式）可在 https://download.csdn.net/download/qq_40121502/10926516 下载。

一、实验目的

设计、编写一个词法分析程序，加深对词法分析原理的理解。

二、实验原理

词法分析是从左向右一个字符、一个字符地读入源程序，扫描每行源程序的符号，依据词法规则，识别单词。执行词法分析的程序称为词法分析器，将给定的程序通过词法分析器，识别出一个个单词符号。单词符号常采用统一的二元式表示：（单词种别码，单词符号自身的值），单词种别码是语法分析所需要的信息，而单词符号自身属性值是编译其他阶段需要的信息。

三、实验内容

（1）程序利用C++或Java或其他编程语言，给出其词法的产生式（词法部分也可以用regular expression），包含数学运算表达式，赋值，函数调用，控制流语句（分支或循环），类型声明等基本元素。
（2）词法实验部分要求写出该语言的词法分析程序。要求识别出词法分析单元，给出词法单元的分类，按照不同类别填入相应的表中，表中给出词法单元在源程序中的位置。

四、实验方法

（1）词法的产生式（或regular expression）转换成NFA
（2）NFA确定化为DFA
（3）DFA最小化
（4）根据最小化后的DFA写出词法分析程序。
（5）使用选定的编程语言写几个测试用例文件作为输入，测试词法分析程序是否正确。

五、实验设计

（1）关键字：
“char”,“int”,“float”,“double”,“final”,“if”,“else”,“while”,“do”,“for”,“break”,“continue”,”void”,”return”
（2）运算符：=、==、>、<、>=、<=、+、-、*、/
（3）界限符：( ) [ ]{ } , : ;
（4）标识符（ID）：用字母、数字、下划线的连接用来表示变量名、过程名等的单词称为标识符（不能以数字开头、不与关键字重名、不包含$、#等无法识别的字符）
（5）常量（NUM）：整数、小数、浮点数。
（6）词法分析阶段通常忽略空格、制表符和换行符等。
根据以上的分类与分析，设置该语言中的单词符号及种别编码如下表：
在这里插入图片描述

六、状态图分析

1.关键字

本次设计中关键字有17个，分别包括：
“char”,“short”,“int”,“long”,“float”,“double”,“final”,“static”,“if”,“else”,“while”,
“do”,“for”,“break”,“continue”,“void”, “return”。
关键字都是由小写字母组成，在程序中，将17个关键字保存在一个String类型的数组中，然后做一次循环，将字符串逐个与17个关键字对比，相同则取出对应的种别码，存入ArrayList中。
关键字的正规表达式为：
key->letter( letter )*
letter->[a-z]
按照RE—>NFA—>DFA—>最小化DFA的顺序，过程如图所示：
在这里插入图片描述

2.运算符

本实验设计了十个运算符，分别为：=、==、>、<、>=、<=、+、-、、/
运算符表达式为：
operation -> ( = | == | > | < | >= | <= | + | - | * | / )
按照RE—>NFA—>DFA—>最小化DFA的顺序，过程如图所示：在这里插入图片描述

3.界限符

本实验设计的界限符有：( ) [ ]{ } , : ;
界限符表达式为：
delimiter -> ( ( | ) | [ | ] | { | } | , | : | ; )*
从RE—>NFA—>DFA—>最小化DFA的顺序，过程如图所示：
在这里插入图片描述

4.标识符

用字母、数字、下划线的连接用来表示变量名、过程名等的单词称为标识符（不能以数字开头、不与关键字重名、不包含$、#等无法识别的字符）。
标识符的正规表达式为：
ID ->( letter | _ ) ( letter | digit | _ )*
letter->[a-zA-Z]
digit->[0-9]
按照RE—>NFA—>DFA—>最小化DFA的顺序，过程如图所示：
在这里插入图片描述

5.常量

将整数，浮点数归并后分为无符号数和有符号数。
因为有符号数，除了开始有符号外，之后的判断与无符号数是一致的。所以在这里只针对无符号数的正规表达式构造NFA和DFA。
无符号数正规表达式：
NUM -> digits op_fra op_exp
其中：
digits -> d(d)*
d -> [0-9]
op_fra-> .digits | ε
op_exp -> ( E (+|-| ε) digits) | ε
将上述表达式整合得无符号数NUM的正规表达式为：
NUM->d(d)|d(d).d(d)|d(d)(E(+|-|ε)d(d))|d(d).d(d)E(+|-|ε)d(d)
d -> [0-9]
按照RE—>NFA—>DFA—>最小化DFA的顺序，过程如图所示：
在这里插入图片描述

七、数据结构

（1）使用ArrayList<String>来存放从文件内读取进来的每个单词（符号）。
（2）自定义类Pair，相当于键值对，用于存放单词种别码和对应的单词，包含一个Integer类型的key和一个String类型的value。
（3）使用ArrayList<Pair>来存放词法分析的结果。

八、实现代码

1.定义关键字

首先定义了一系列关键字及其对应的种别码，用于表示词法分析的结果。

// 单词种别码, 1-17为关键字种别码
public static final int CHAR = 1;
public static final int SHORT = 2;
public static final int INT = 3;
public static final int LONG = 4;
public static final int FLOAT = 5;
public static final int DOUBLE = 6;
public static final int FINAL = 7;
public static final int STATIC = 8;
public static final int IF = 9;
public static final int ELSE = 10;
public static final int WHILE = 11;
public static final int DO = 12;
public static final int FOR = 13;
public static final int BREAK = 14;
public static final int CONTINUE = 15;
public static final int VOID = 16;
public static final int RETURN = 17;

// 20为标识符种别码
public static final int ID = 20;

// 30为常量种别码
public static final int NUM = 30;

// 31-40为运算符种别码
public static final int AS = 31; // =
public static final int EQ = 32; // ==
public static final int GT = 33; // >
public static final int LT = 34; // <
public static final int GE = 35; // >=
public static final int LE = 36; // <=
public static final int ADD = 37; // +
public static final int SUB = 38; // -
public static final int MUL = 39; // *
public static final int DIV = 40; // /

// 41-49为界限符种别码
public static final int LP = 41; // (
public static final int RP = 42; // )
public static final int LBT = 43; // [
public static final int RBT = 44; // ]
public static final int LBS = 45; // {
public static final int RBS = 46; // }
public static final int COM = 47; // ,
public static final int COL = 48; // :
public static final int SEM = 49; // ;

// -1为无法识别的字符标志码
public static final int ERROR = -1;
public static int errorNum = 0; // 记录词法分析错误的个数

2.单词分割

对于输入的代码段，进行单词的分割，需要识别判断出分割的位置，最后将拆分好的单词序列放入ArrayList<String>中保存，用于后续的词法分析。

测试文件读入内存

final String blank = " ";
final String newLine = "\n";
Scanner scanner = new Scanner(System.in);
System.out.print("请输入测试文件的路径：");
String filepath = scanner.next();
StringBuffer sb = new StringBuffer();
String fileStr = "";
try {
    InputStream is = new FileInputStream(filepath);
    String line; // 用来保存每行读取的内容
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    line = reader.readLine(); // 读取第一行
    while (line != null) { // 如果 line 为空说明读完了
        sb.append(line); // 将读到的内容添加到 buffer 中
        sb.append("\n"); // 添加换行符
        line = reader.readLine(); // 读取下一行
    }
    reader.close();
    is.close();
    fileStr = sb.toString();
} catch (IOException e) {
    e.printStackTrace();
}

System.out.println("文件内容如下：\n" + fileStr);

单词分割

采用循环读取的方式将输入的内容进行读取分析
getFirstChar()：去除字符串开头的空格和换行符，找到第一个有效字符的位置
getWord()：使用正则表达式进行匹配，获取一个单词
当文件读取结束时将抛出IndexOutOfBoundsException异常，捕获这个异常，在catch块中进行后续的词法分析。

int begin = 0, end = 0;   //分别表示单词的第一个和最后一个字符位置
ArrayList<String> arrayList = new ArrayList<>();
try {
    do {
        begin = getFirstChar(fileStr, begin);
        String word = getWord(fileStr, begin);
        end = begin + word.length() - 1; // 单词最后一个字符在原字符串的位置
        if (word.equals("")) {
            break;
        }
        if (!(word.equals(blank) || word.equals(newLine))) {
            arrayList.add(word);
        }
        begin = end + 1;
    } while (true);
} catch (IndexOutOfBoundsException e) { // 文件读取结束
    // 词法分析
}

/**
 * 去除字符串开头的空格和换行符，找到第一个有效字符的位置
 * @param str 目标字符串
 * @param begin 开始位置
 * @return 第一个有效字符在字符串中的位置
 */
public static int getFirstChar(String str, int begin) {
    while (true) {
        if (str.charAt(begin) != ' ' && str.charAt(begin) != '\n' ) {
            return begin;
        }
        begin++;
    }
}

/**
 * 获取一个单词
 * @param str 目标字符串
 * @param begin 查找的开始位置
 * @return 单词
 */
public static String getWord(String str, int begin) {
    String regEx = "\\s+|\\n|\\+|-|\\*|/|=|\\(|\\)|\\[|]|\\{|}|,|:|;"; // 正则
    Pattern p = Pattern.compile(regEx);
    Matcher m = p.matcher(str);
    int end;
    if (m.find(begin)) {
        end =  m.start();
    } else {
        return "";
    }
    if (begin == end) {
        return String.valueOf(str.charAt(begin));
    }
    return str.substring(begin, end);
}

3.词法分析

将分割好的单词和最开始定义的一系列关键字进行比对匹配，每个单词的分析结果由Pair类进行保存，Pair类仅代表键值对的形式，键为单词种别码，值为该单词。
Pair类定义如下：

public static class Pair {
    Integer key;
    String value;
    Pair(Integer key, String value) {
        this.key = key;
        this.value = value;
    }
}

词法分析函数：根据单词的长度进行分析比对

// 词法分析函数
public static ArrayList<Pair> analyse(ArrayList<String> arrayList) {
    ArrayList<Pair> result = new ArrayList<>();
    for (int i = 0; i < arrayList.size(); i++) {
        if (arrayList.get(i).length() == 1) {
            if (arrayList.get(i).equals("=")) { // 运算符"="
                if (arrayList.get(i+1).equals("=")) { // 若后面跟的是"="，则是运算符"=="
                    result.add(new Pair(EQ, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否则是运算符"="
                    result.add(new Pair(AS, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals(">")) { // 运算符">"
                if (arrayList.get(i+1).equals("=")) { // 若后面跟的是"="，则是运算符">="
                    result.add(new Pair(GE, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否则是运算符">"
                    result.add(new Pair(GT, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals("<")) { // 运算符"<"
                if (arrayList.get(i+1).equals("=")) { // 若后面跟的是"="，则是运算符"<="
                    result.add(new Pair(LE, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否则是运算符"<"
                    result.add(new Pair(LT, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals("+")) { // 运算符"+"
                if ((arrayList.get(i-1).equals("=") || arrayList.get(i-1).equals("("))
                    && isNum(arrayList.get(i+1))) { // 判断是否是有符号常量（正数）
                    result.add(new Pair(NUM, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否则是运算符"+"
                    result.add(new Pair(ADD, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals("-")) { // 运算符"-"
                if ((arrayList.get(i-1).equals("=") || arrayList.get(i-1).equals("("))
                        && isNum(arrayList.get(i+1))) { // 判断是否是有符号常量（负数）
                    result.add(new Pair(NUM, arrayList.get(i) + arrayList.get(++i)));
                } else { // 否则是运算符"-"
                    result.add(new Pair(SUB, arrayList.get(i)));
                }
            } else if (arrayList.get(i).equals("*")) { // 运算符"*"
                result.add(new Pair(MUL, arrayList.get(i)));
            } else if (arrayList.get(i).equals("/")) { // 运算符"/"
                result.add(new Pair(DIV, arrayList.get(i)));
            } else if (arrayList.get(i).equals("(")) { // 界限符"("
                result.add(new Pair(LP, arrayList.get(i)));
            } else if (arrayList.get(i).equals(")")) { // 界限符")"
                result.add(new Pair(RP, arrayList.get(i)));
            } else if (arrayList.get(i).equals("[")) { // 界限符"["
                result.add(new Pair(LBT, arrayList.get(i)));
            } else if (arrayList.get(i).equals("]")) { // 界限符"]"
                result.add(new Pair(RBT, arrayList.get(i)));
            } else if (arrayList.get(i).equals("{")) { // 界限符"{"
                result.add(new Pair(LBS, arrayList.get(i)));
            } else if (arrayList.get(i).equals("}")) { // 界限符"}"
                result.add(new Pair(RBS, arrayList.get(i)));
            } else if (arrayList.get(i).equals(",")) { // 界限符","
                result.add(new Pair(COM, arrayList.get(i)));
            } else if (arrayList.get(i).equals(":")) { // 界限符":"
                result.add(new Pair(COL, arrayList.get(i)));
            } else if (arrayList.get(i).equals(";")) { // 界限符";"
                result.add(new Pair(SEM, arrayList.get(i)));
            } else if (arrayList.get(i).charAt(0) >= '0' && arrayList.get(i).charAt(0) <= '9') { // 判断是否是一位数字常量
                result.add(new Pair(NUM, arrayList.get(i)));
            } else if (isLetter(arrayList.get(i).charAt(0))) { // 判断是否是一位字母标识符
                result.add(new Pair(ID, arrayList.get(i)));
            } else { // 否则是无法识别的字符
                result.add(new Pair(ERROR, arrayList.get(i)));
                errorNum++;
            }
        } else if ((arrayList.get(i).charAt(0) >= '0' && arrayList.get(i).charAt(0) <= '9')
                    || arrayList.get(i).charAt(0) == '.') { // 判断是否是正确的常量
            if (!isNum(arrayList.get(i))) { // 不是常量，则是无法识别的字符
                result.add(new Pair(ERROR, arrayList.get(i)));
                errorNum++;
            } else if ((arrayList.get(i+1).charAt(0) == '+' || arrayList.get(i+1).charAt(0) == '-')
                    && isNum(arrayList.get(i+2))) { // 判断是否是有符号的常量
                result.add(new Pair(NUM, arrayList.get(i) + arrayList.get(++i) + arrayList.get(++i)));
            } else { // 否则是无符号的常量
                result.add(new Pair(NUM, arrayList.get(i)));
            }
        } else if (isKeyID(arrayList.get(i)) != 0) { // 判断是否为关键字
            result.add(new Pair(isKeyID(arrayList.get(i)), arrayList.get(i)));
        } else if (isLetter(arrayList.get(i).charAt(0)) || arrayList.get(i).charAt(0) == '_') { // 判断是否为标识符（以字母或者下划线开头）
            result.add(new Pair(ID, arrayList.get(i)));
        } else { // 否则是无法识别的单词
            result.add(new Pair(ERROR, arrayList.get(i)));
            errorNum++;
        }
    }
    return result;
}

catch块内的处理部分：调用词法分析函数，输出结果到文件

ArrayList<Pair> result = analyse(arrayList);
StringBuffer outputBuffer = new StringBuffer("词法分析结果如下：\n<单词种别码，单词>          //所属类别\n");
System.out.println("词法分析结果如下：\n<单词种别码，单词>          //所属类别");
for (Pair pair : result) {
    outputBuffer.append("<   " + pair.key + "  ,  " + pair.value + "  >          ");
    System.out.print("<   " + pair.key + "  ,  " + pair.value + "  >          ");
    if (pair.key > 0 && pair.key < 20) {
        outputBuffer.append("//关键字\n");
        System.out.println("//关键字");
    } else if (pair.key == 20) {
        outputBuffer.append("//标识符\n");
        System.out.println("//标识符");
    } else if (pair.key == 30) {
        outputBuffer.append("//常量\n");
        System.out.println("//常量");
    } else if (pair.key > 30 && pair.key <= 40) {
        outputBuffer.append("//运算符\n");
        System.out.println("//运算符");
    } else if (pair.key > 40 && pair.key < 50) {
        outputBuffer.append("//界限符\n");
        System.out.println("//界限符");
    } else if (pair.key == -1) {
        outputBuffer.append("//无法识别的符号\n");
        System.out.println("//无法识别的符号");
    }
}
outputBuffer.append("词法分析结束！共" + errorNum + "个无法识别的符号\n");
System.out.println("词法分析结束！共" + errorNum + "个无法识别的符号");

System.out.print("请输入输出文件的路径：");
String outputPath = scanner.next();
// 将StringBuffer的内容输出到文件
File outputFile = new File(outputPath);
if (outputFile.exists()) {
    outputFile.delete();
}
PrintWriter writer = null;
try {
    outputFile.createNewFile();
    writer = new PrintWriter(new FileOutputStream(outputFile));
    writer.write(outputBuffer.toString());
} catch (IOException e1) {
    e1.printStackTrace();
} finally {
    if (writer != null) {
        writer.close();
    }
}

九、执行结果

在这里插入图片描述

至此，词法分析器编写完成。如有问题，欢迎交流指正。