PL\0编译原理实验(南航)二：词法分析

原理

关于自动机部分的原理这里不详细展开

对理论不清楚的可以参考

https://www.cnblogs.com/X-Jun/p/11029594.html

陈火旺那本编译原理教材

词法分析

词法分析是对编写的PL\0代码进行第一次处理，把整个代码文本分割成一个个单词，后续的语法分析只需要获取每一个单词即可

每个单词只能是关键字、标识符、运算符、分隔符和数字，其他的都算非法字符，需要报错

这里是对于代码文本逐行读取，然后先按照空格进行分割，最后再对分割的每个单词进行详细的分割，最后每个单词都会存放在列表token_list里

每个单词需要记录三个属性，value(值)、line_num(所在行号)、attribute(属性)，值是文本内容用于处理，行号是为了提示报错的地方，属性是为了说明该单词是关键字还是标识符、运算符、数字，为了语法分析使用

数据结构

token_list：存放的是所有的单词根据下标token_index访问都属于全局变量

每个token有三个属性，分别是value、line_num、attribute

扫描二维码关注公众号，回复： 12646142 查看本文章

token['value'] = value

token['line_num'] = line_num

token['attribute'] = attribute

词法分析应该没有难度，注意细心考虑周全即可

代码实现

'''词法分析部分所需的关键字表 算术符和分隔符表以及生成的token'''
key_word = ['program', 'const', 'var', 'procedure', 'begin', 'end', 'if', 'then', 'while', 'do', 'call', 'read',
            'write']  # 程序的关键字
symbol = ['+', '-', '*', '/', '(', ')', '=', ',', ';']  # 算术符和分隔符
token_list = []
token_index = 0


# 按照空格分割成一个个单词,单词中会包括运算符、语法符号等,需要进一步划分
def deal_word(word, line_num):
    length = len(word)
    index = 0
    while index < length:
        token = dict()
        value = ''  # token的值
        attribute = ''  # token的属性
        if word[index].isalpha():  # 字母开头 标识符或关键字
            while index < length and (word[index].isalpha() or word[index].isdigit()):
                value += word[index]
                index += 1
            if value in key_word:  # 判断是否是关键字
                attribute = 'keyword'
            else:
                attribute = 'identifier'
        elif word[index].isdigit():  # 数字开头 数字
            while index < length and word[index].isdigit():
                value += word[index]
                index += 1
            value = int(value)
            attribute = 'number'
        elif word[index] == ':':  # :开头 赋值符号:=
            value += word[index]
            index += 1
            if index < length and word[index] == '=':
                value += word[index]
                index += 1
            attribute = value
        elif word[index] == '<':  # <开头 可能是 < <= <>
            value += word[index]
            index += 1
            if index < length and (word[index] == '=' or word[index] == '>'):
                value += word[index]
                index += 1
            attribute = value
        elif word[index] == '>':  # >开头 > >=
            value += word[index]
            index += 1
            if index < length and word[index] == '=':
                value += word[index]
                index += 1
            attribute = value
        elif word[index] in symbol:  # 如果是语法符号
            value = word[index]
            attribute = word[index]
            index += 1
        else:
            print('行数' + str(line_num) + ':非法字符' + word[index])
            sys.exit(0)
        # 填充到token表
        token['value'] = value
        token['line_num'] = line_num
        token['attribute'] = attribute
        token_list.append(token)


# 每次调用返回一个token
def get_token():
    global token_index
    if token_index >= len(token_list):
        sys.exit(0)
    token = token_list[token_index]
    token_index += 1
    return token


if __name__ == '__main__':
    # 第一步词法分析,读取代码文件分解成单词,每个单词会记录内容、属性、行号
    with open('code.txt', 'r') as file:
        line_num = 1
        for line in file.readlines():
            for word in line.strip().split():
                deal_word(word, line_num)
            line_num += 1