C++ Implementation of Compilation Principle Lexical Analyzer

1. Understanding and explanation of the topic

The course of Compilation Principles is one of the core courses of computer science majors. It is a subject that studies what software is, why it works, and how it works. The improvement of the compilation system will directly have a profound impact on the execution efficiency and execution principle of the upper application program. The purpose of the compilation principle is to translate the source language into the target language. The process is divided into five stages: lexical analysis, syntax analysis, semantic analysis intermediate code generation, intermediate code optimization, and code generation. The problem solved in this experiment is the first stage of compiling the program - lexical analysis.

The lab requires implementing a syntax analyzer that can analyze a programming language. I take the grammar of c/c++ language as an example, and use c/c++ to program to implement an analyzer that can perform lexical analysis on c/c++ language.

The goal of the lexical analyzer is : to process the C language source program, filter out useless symbols, judge the legitimacy of words in the source program, decompose the correct words, and store them in the file in the form of binary groups.

The work done by the lexical analyzer is mainly to compile and preprocess the source program (remove comments, useless carriage returns and line feeds to find the included files, etc.), and then decompose the entire source program into words. These words have and There are only five categories, namely identifiers, reserved words, constants, operators, and delimiters.

The object of lexical analysis is a single character, the purpose is to form them into a valid word (string); and the analysis of grammar is to use the result of lexical analysis as input to analyze whether it conforms to the grammatical rules and perform semantic analysis under the guidance of grammar , and finally generate a quadruple (intermediate code), optimize (optional) and finally generate the target code. Therefore, lexical analysis is the basis of the entire compilation program. If the lexical analysis is not done well, the effect of the subsequent compilation stage will be seriously affected. two

2. Program function and framework

1. Recognize the class of words

A total of words that need to be recognized are divided into five categories: identifiers, reserved words, constants, operators, and delimiters . In order to deal with various statements of the c language to a greater extent, I define these five types of words as: the
first type: identifier letter (letter | digit) * infinite set
The second type: constant (digit) + infinite set
The third category: reserved words (32)
auto break case char const continue
default do double else enum extern
float for goto if int long
register return short signed sizeof static
struct switch typedef union unsigned void
volatile while

the fourth category: delimiter '/ ' , '//', () { } [ ] " " ', etc.
Fifth category: operators <, <=, >, >=, =, +, -, ,
/, ^, etc.
Then for all countable symbols Encoding:
<$,0>
<auto,1>

<while,32><+,33><-,34><*,35></,36><<,37><<=,38> <>,39><>=,40><=,41>
<==,
<&&,53><|,54><||,55><%,56><~,57><<<,58>left shift<>>,59>right shift<[,60> <]
, 61><{,62><},63><,64><.,65><?,66><:,67><!,68>"[", "]", "{", "} "
<constant 99, value><identifier 100, identifier pointer>

In the above binary group, the left side is the symbol of the word, and the right side is the category code. Among them, the constant and the identifier are a bit special, because they are infinite sets, so the constant is represented by itself, the category code is 99, and the identifier uses the identifier symbol The pointer of the watch indicates (of course, it can also be displayed by itself, which is easier to observe), and the category code is 100. According to the above agreement, once the category code syn=63 is seen, the word '}' is uniquely determined.

2. Program framework design

After determining the types of words, the program needs to implement the following functions:
reserved word recognition function: int SearchRWord(char RW[][20],char s[])
letter discrimination function: bool IsLetter(char letter)
number discrimination function: bool IsDigit(char digit)
preprocessing program: void PreProcessing(char r[], int pProject)
scanner (algorithm core):void Scanner(int &syn, char resourceProject[], char token[], int &pProject)

Among the above functions, the core is the scanner. The realization of the scanner is mainly based on the DFA theory. It is designed to realize the character-by-character scanning and discrimination in the read-in and preprocessed character stream, and then recognize the words separated by spaces one by one with the realized finite automaton algorithm, and find its related two-tuple sequence: (self value, type code) to achieve the purpose of identifying the type of word.

The basic processing flow of the program is described as follows:
(1) The lexical analysis program opens the source file, reads the content of the file until it encounters the '$' end-of-file character, and then ends the reading.
(2) Preprocess the read file, scan from beginning to end, remove the contents of // and /* */, and some useless symbols that affect program execution, such as line feed, carriage return, and tab wait. But be careful not to remove spaces at this time, because spaces are useful in lexical analysis, such as int i=3; this statement, if you remove spaces, it becomes "inti=3", which loses the original intention of the program. Therefore spaces cannot be removed at this time.
(3) Select the following to scan the source file from beginning to end, and start scanning from the beginning. At this time, the scanning program will first ask whether the current character is a space. If it is a space, continue to scan the next character until it is not a space, and then Ask if the character is a letter, if so, identify identifiers and reserved words; if the character is a number, then determine the number. Otherwise, judge the possible situations of this character in turn. If you have gone through all the possibilities and still don’t know who it is, it will be regarded as an error symbol, output the error symbol, and then end. Every time a word is successfully recognized, the word will be stored in token[]. Then determine the category code of this word, and finally carry out the recognition of the next word. This is what the scanning program does. It can be said that this program fully realizes certain functions of determining finite automata, such as identifying identifiers, identifying numbers, and so on. For simplicity, the numbers here are just integers.
(4) The main control program is mainly responsible for judging the type code syn of each recognition, and responding differently to different word types, such as inserting the identifier into the identifier table. For a reserved word, output the category code and mnemonic of the reserved word, and so on. Until encountering syn=0; the program ends.

3. Design Description

1. Variable storage

In the second part of the report, I introduced five types of words: identifiers, reserved words, constants, operators, and delimiters.
In the program implementation, I set the following data structures to store the above five types of words respectively:

/*******保留字表*******/
static char RWord[32][20] = {
    
    	
"auto", "break", "case", "char",	
"const", "continue","default", "do",	
"double", "else", "enum", "extern",	
"float", "for", "goto", "if",	
"int", "long","register", "return",	
"short", "signed", "sizeof", "static",	
"struct", "switch", "typedef", "union",	
"unsigned", "void","volatile", "while" };
/*******保留字表*******/
/*******界符与运算符表*******/
static char OandD[36][10] = {
    
    	
"+", "-", "*", "/", "<", "<=",	
">", ">=", "=", "==","!=", ";",	
"(", ")", "^", ",", "\"", "\'",	
"#", "&","&&", "|", "||", "%",	
"~", "<<", ">>", "[", "]", "{",	
"}", "\\", ".", "\?", ":", "!" };
/*******界符与运算符表*******/
/*******标识符表*******/
static char IDtable[1000][50] = {
    
     "" };//初始为空
/*******标识符表*******/

2. Implementation of basic functions

/*******识别保留字*******/
int SearchRWord(char RW[][20],char s[]) {
    
    
 for (int i = 0;i<32;i++) {
    
    
  if (strcmp(RW[i], s) == 0) {
    
    
   //所识别的单词和保留字表中;
   //存储的保留字比较;
   //正确则返回其种别码;
   return i + 1;
  }
 }
 //不匹配则返回-1
 //即:该单词可能是标识符或者错别字
 return -1;
}
/*******识别保留字*******/
/*******字母判别*******/
bool IsLetter(char letter){
    
    
 //C/c++语言中允许下划线也为标识符的一部分可以放在首部或其他地方
 if (letter >= 'a'&&letter <= 'z' || letter >= 'A'&&
 letter <= 'Z' || letter == '_')
  return true;
 else
  return false;
}
/*******字母判别*******/
/*******数字判别*******/
bool IsDigit(char digit){
    
    
 if (digit >= '0'&&digit <= '9')
  return true;
 else
  return false;
}
/*******数字判别*******/
/*******预处理程序,去除错误字符和注释*******/
void PreProcessing(char r[], int pProject){
    
    
 char tempString[10000];
 int count = 0;
 for (int i = 0; i <= pProject; i++){
    
    
  if (r[i] == '/'&&r[i + 1] == '/'){
    
    
   //若为单行注释“//”,则去除注释后面的东西,直至遇到回车换行
   while (r[i] != '\n'){
    
    
    i++;//向后扫描
   }
  }
  if (r[i] == '/'&&r[i + 1] == '*'){
    
    
   //若为多行注释“/*......*/”则去除该内容
   i += 2;
   while (r[i] != '*' || r[i + 1] != '/'){
    
    
    i++;//继续扫描
    if (r[i] == '$'){
    
    
     printf("注释出错,没有找到 */,程序结束!!!\n");
     exit(0);
    }
   }
   i += 2;//跨过“*/”
  }
  if (r[i] != '\n'&&r[i] != '\t'&&r[i] != '\v'&&r[i] != '\r'){
    
    
   //若出现无用字符,则过滤;否则加载
   tempString[count++] = r[i];
  }
 }
 tempString[count] = '\0';
 strcpy(r, tempString);//产生净化之后的源程序,将处理后的源程序字符串重新返回
}
/*******预处理程序,去除错误字符和注释*******/

The preprocessing process needs to filter the input character stream, purify and delete comments, line breaks, invalid characters, error characters, etc. to obtain a program character stream with only spaces. Note: Space is an important decomposition symbol to distinguish five types of words, so it cannot be purified.

3. Implementation of the scanner

As the core function of lexical analysis, the scanner's main function is to realize the classification of each word in the purified character stream, and generate the corresponding two-tuple of the word to write to the file. The main theory is constructed using DFA theory.

/*******分析模块,词法分析器的核心*******/
//该模块主要的原理支撑是DFA的状态转换图的设计
void Scanner(int &syn, char resourceProject[], char token[], int &pProject){
    
    
 int i, count = 0;//count用来做token[]的指示器,收集有用字符
 char ch;//作为判断使用
 ch = resourceProject[pProject];
 while (ch == ' '){
    
    //过滤空格,防止程序因识别不了空格而结束
  pProject++;
  ch = resourceProject[pProject];
 }
 for (i = 0; i < 20; i++){
    
    //每次收集前先清零
  token[i] = '\0';
 }
 if (IsLetter(resourceProject[pProject])){
    
    //开头为字母
  token[count++] = resourceProject[pProject];//收集
  pProject++;//下移
  while (IsLetter(resourceProject[pProject]) || IsDigit(resourceProject[pProject])){
    
    //后跟字母或数字
   token[count++] = resourceProject[pProject];//收集
   pProject++;//下移
  }//多读了一个字符既是下次将要开始的指针位置
  token[count] = '\0';
  syn = SearchRWord(RWord, token);//查表找到种别码
  if (syn == -1){
    
    //若不是保留字则是标识符
   syn = 100;//标识符种别码
  }
  return;
 }
 else if (IsDigit(resourceProject[pProject])){
    
    //首字符为数字
  while (IsDigit(resourceProject[pProject])){
    
    //后跟数字
   token[count++] = resourceProject[pProject];//收集
   pProject++;
  }//多读了一个字符既是下次将要开始的指针位置
  token[count] = '\0';
  syn = 99;//常数种别码
 }
 else if (ch == '+' || ch == '-' || ch == '*' || ch == '/' || ch == ';' || ch == '(' || ch == ')' || ch == '^'
  || ch == ',' || ch == '\"' || ch == '\'' || ch == '~' || ch == '#' || ch == '%' || ch == '['
  || ch == ']' || ch == '{' || ch == '}' || ch == '\\' || ch == '.' || ch == '\?' || ch == ':'){
    
    
  //若为运算符或者界符,查表得到结果
  token[0] = resourceProject[pProject];
  token[1] = '\0';//形成单字符串
  for (i = 0; i < 36; i++)
  {
    
    //查运算符界符表
   if (strcmp(token, OandD[i]) == 0){
    
    
    syn = 33 + i;//获得种别码,使用了一点技巧,使之呈线性映射
    break;//查到即推出
   }
  }
  pProject++;//指针下移,为下一扫描做准备
  return;
 }
 else  if (resourceProject[pProject] == '<'){
    
    //<,<=,<<
  pProject++;//后移,超前搜索
  if (resourceProject[pProject] == '='){
    
    
   syn = 38;
  }
  else if (resourceProject[pProject] == '<'){
    
    //左移
   pProject--;
   syn = 58;
  }
  else{
    
    
   pProject--;
   syn = 37;
  }
  pProject++;//指针下移
  return;
 }
 else  if (resourceProject[pProject] == '>'){
    
    //>,>=,>>
  pProject++;
  if (resourceProject[pProject] == '=')
   syn = 40;
  else if (resourceProject[pProject] == '>')
   syn = 59;
  else{
    
    
   pProject--;
   syn = 39;
  }
  pProject++;
  return;
 }
 else  if (resourceProject[pProject] == '='){
    
    //=.==
  pProject++;
  if (resourceProject[pProject] == '=')
   syn = 42;
  else{
    
    
   pProject--;
   syn = 41;
  }
  pProject++;
  return;
 }
 else  if (resourceProject[pProject] == '!'){
    
    //!,!=
  pProject++;
  if (resourceProject[pProject] == '=')
   syn = 43;
  else{
    
    
   syn = 68;
   pProject--;
  }
  pProject++;
  return;
 }
 else  if (resourceProject[pProject] == '&'){
    
    //&,&&
  pProject++;
  if (resourceProject[pProject] == '&')
   syn = 53;
  else{
    
    
   pProject--;
   syn = 52;
  }
  pProject++;
  return;
 }
 else  if (resourceProject[pProject] == '|'){
    
    //|,||
 pProject++;
 if (resourceProject[pProject] == '|')
  syn = 55;
 else{
    
    
  pProject--;
  syn = 54;
 }
 pProject++;
 return;
 }
 else  if (resourceProject[pProject] == '$')//结束符
  syn = 0;//种别码为0
 else{
    
    //不能被以上词法分析识别,则出错。
  printf("error:there is no exist %c \n", ch);
  exit(0);
 }
}
/*******分析模块,词法分析器的核心*******/

Its main processing flow is: first start scanning from the beginning. At this time, the scanning program first asks whether the current character is a space. If it is a space, continue to scan the next character until it is not a space, and then ask whether the character is a letter. Identification of identifiers and reserved words; if the character is a number, the number is judged. Otherwise, judge the possible situations of this character in turn. If you have gone through all the possibilities and still don’t know who it is, it will be regarded as an error symbol, output the error symbol, and then end. Every time a word is successfully recognized, the word will be stored in token[]. Then determine the category code of this word, and finally carry out the recognition of the next word. This is what the scanning program does. It can be said that this program fully realizes certain functions of determining finite automata, such as identifying identifiers, identifying numbers, and so on. For simplicity, the numbers here are just integers.

4. Operation of the main function

The main function is mainly to initialize the necessary variables and to read and write files. Set the resourceProject variable to store the character stream read from the txt text. For convenience, I only set the maximum value of the read-in character stream to 10,000 characters of char type. Set the token variable in order to store each recognized word so as to find its corresponding category code. syn is the storage of the current word category code, I set ' ' as the scan end identifier of the read-in txt text, and its syn value is 0, when the read-in ' ' is the scan end identifier of the read-in txt text, Its syn value is 0, when read in'' is the identifier of the scanning end point of the read t x t text , and its s y n value is 0. When ' ' is read in , it means that the scanned character stream reaches the end, that is, when syn==0, exit Scanner, end of scan, end of lexical analysis. The pProject variable is a source program pointer, which always points to the currently recognized character position. The operation process of the main function is: first open the preset txt text and read all the characters in it into the resourceProject variable, then call the preprocessing program to obtain the purified character stream, overwrite and store it in the resourceProject variable. Then call the scanner to recognize each word, at this time syn=-1 initially, pProject=0. After processing, store the processing result in a txt file and output it.

int main(){
    
    
 //打开一个文件,读取其中的源程序
 char resourceProject[10000];
 char token[20] = {
    
     0 };
 int syn = -1, i;//初始化
 int pProject = 0;//源程序指针
 FILE *fp, *fp1;
 if ((fp = fopen("F:\\大三下课程\\编译原理(必修)\\词法分析器\\zyr_rc.txt", "r")) == NULL){
    
    //打开源程序
  cout << "can't open this file";
  exit(0);
 }
 resourceProject[pProject] = fgetc(fp);
 while (resourceProject[pProject] != '$'){
    
    //将源程序读入resourceProject[]数组
  pProject++;
  resourceProject[pProject] = fgetc(fp);
 }
 resourceProject[++pProject] = '\0';
 fclose(fp);
 cout << endl << "源程序为:" << endl;
 cout << resourceProject << endl;
 //对源程序进行过滤
 PreProcessing(resourceProject, pProject);
 cout << endl << "过滤之后的程序:" << endl;
 cout << resourceProject << endl;
 pProject = 0;//从头开始读
 if ((fp1 = fopen("F:\\大三下课程\\编译原理(必修)\\词法分析器\\zyr_compile.txt", "w+")) == NULL){
    
    //打开源程序
  cout << "can't open this file";
  exit(0);
 }//F:\\大三下课程\\编译原理(必修)\\词法分析器\\zyr_compile.txt
 while (syn != 0){
    
    
  //启动扫描
  Scanner(syn, resourceProject, token, pProject);
  if (syn == 100){
    
    //标识符
   for (i = 0; i < 1000; i++){
    
    //插入标识符表中
    if (strcmp(IDtable[i], token) == 0)//已在表中
     break;  
    if (strcmp(IDtable[i], "") == 0){
    
    //查找空间
     strcpy(IDtable[i], token);
     break;
    }
   }//F:\大三下课程\编译原理(必修)\词法分析器\zyr_rc.txt
   printf("(标识符  ,%s)\n", token);
   fprintf(fp1, "(标识符   ,%s)\n", token);
  }
  else if (syn >= 1 && syn <= 32){
    
    //保留字
   printf("(%s   ,  --)\n", RWord[syn - 1]);
   fprintf(fp1, "(%s   ,  --)\n", RWord[syn - 1]);
  }
  else if (syn == 99){
    
    //const 常数
   printf("(常数   ,   %s)\n", token);
   fprintf(fp1, "(常数   ,   %s)\n", token);
  }
  else if (syn >= 33 && syn <= 68){
    
    
   printf("(%s   ,   --)\n", OandD[syn - 33]);
   fprintf(fp1, "(%s   ,   --)\n", OandD[syn - 33]);
  }
 }
 for (i = 0; i < 100; i++){
    
    //插入标识符表中
  if (strcmp(IDtable[i],"")==0)
   break;
  printf("第%d个标识符:  %s\n", i + 1, IDtable[i]);
  fprintf(fp1, "第%d个标识符:  %s\n", i + 1, IDtable[i]);
 }
 fclose(fp1);
 return 0;
}

4. Test data and running results

Test content:
file path and its file name: F:\\大三下课程\\编译原理(必修)\\词法分析器\\zyr_rc.txt
file content:

int main(){
    
    
 //打开一个文件,读取其中的源程序
 char resourceProject[10000];
 char token[20] = {
    
     0 };
 int syn = -1, i;//初始化
 int pProject = 0;//源程序指针
 FILE *fp, *fp1;
 if ((fp = fopen("F:\\大三下课程\\编译原理(必修)\\词法分析器\\zyr_rc.txt", "r")) == NULL){
    
    //打开源程序
  cout << "can't open this file";
  exit(0);
 }
 resourceProject[pProject] = fgetc(fp);
 while (resourceProject[pProject] != '$'){
    
    //将源程序读入resourceProject[]数组
  pProject++;
  resourceProject[pProject] = fgetc(fp);
 }
 resourceProject[++pProject] = '\0';
 fclose(fp);
 cout << endl << "源程序为:" << endl;
 cout << resourceProject << endl;
 //对源程序进行过滤
 PreProcessing(resourceProject, pProject);
 cout << endl << "过滤之后的程序:" << endl;
 cout << resourceProject << endl;
 pProject = 0;//从头开始读
 if ((fp1 = fopen("F:\\大三下课程\\编译原理(必修)\\词法分析器\\zyr_compile.txt", "w+")) == NULL){
    
    //打开源程序
  cout << "can't open this file";
  exit(0);
 }
 $
 

The output result stores the file path and its file name: F:\\大三下课程\\编译原理(必修)\\词法分析器\\zyr_compile.txt
the running result is:
(int , --)
(identifier,main)
(( , --)
() , --)
({ , --)
(char , --)
(identifier,resourceProject)
([ , --)
(constant, 10000)
(] , --)
(; , --)
(char , --)
(identifier, token)
([ , --)
(constant, 20)
(] , --)
(= , --)
({ , --)
(constant, 0)
(} , --)
(; , --)
(int , --)
(identifier, syn)
( = , --)
(- , --)
(constant, 1)
(, , --)
(identifier,i)
(; , --)
(int , --)
(identifier, pProject)
(= , - -)
(constant, 0)
(; , --)
(identifier,FILE)
(* , --)
(identifier,fp)
(, , --)
(* , --)
(identifier,fp1)
(; , --)
(if , --)
(( , --)
(( , --)
(identifier,fp)
(= , --)
(identifier,fopen)
(( , --)
(" , --)
(identifier,D)
(: , - -)
(\ , --)
(\ , --)
(identifier,zyr_rc)
(. , --)
(identifier,txt)
(" , --)
(, , --)
(" , --)
(identifier,r)
(" , --)
() , --)
() , --)
(== , --)
(identifier, NULL)
() , --)
({ , --)
( identifier,cout)
(<< , --)
(< , --)
(" , --)
(identifier,can)
(' , --)
(identifier,t)
(identifier,open)
(identifier,this)
(identifier,file)
(" , --)
(; , --)
(identifier character, exit)
(( , --)
(constant, 0)
() , --)
(; , --)
(} , --)
(identifier, resourceProject)
([ , --)
(identifier, pProject )
(] , --)
(= , --)
(identifier,fgetc)
(( , --)
(identifier,fp)
() , --)
(; , --)
(while , --)
( ( , --)
(identifier,resourceProject)
([ , --)
(identifier,pProject)
(] , --)
(!= , --)
(' , --)
the first identifier:
main 2 identifiers: resourceProject
3rd identifier: token
4th identifier: syn
5th identifier: i
6th identifier: pProject 7th
identifier: FILE 8th
identifier: fp
9th identifier: fp1
10th identifier:
fopen 11th identifier: D
12th identifier: zyr_rc
13th identifier: txt
14th identifier: r
15th identifier: NULL 16th identifier:
cout 17th
identifier: can
18th identifier: t
identifier 19: open
identifier 20: this
identifier 21: file
identifier 22: exit
identifier 23: fgetc

V. Summary

In this experiment, the algorithm part is not difficult. As long as you have a certain proficiency in DFA, the scanner module is easy to write. The more troublesome thing is that the more characters there are in the five types, the longer the program will be. But in order to be able to identify most of the programs, I still chose a relatively large subset and put in a certain amount of effort. In the end, the processing power of this lexical analyzer can still handle most c/c++ programs. At the same time, it deepened my understanding of characters. The readability of the program is not bad. What my program does not implement is the separation of all compound operations, but the principle is the same, such as "+=", just scan forward after the logic of "+", so there is no addition. The deepest feeling is that learning the principle of compiling must do experiments and write programs, so that you can improve your hands-on ability and deepen your understanding of difficulties. And more so. Based on this experiment, I have a basic understanding of the compiler, and I also have some ideas for the subsequent grammatical analysis experiments. It is fully prepared for better study and practice in the future.

Guess you like

Origin blog.csdn.net/weixin_42529594/article/details/105622166
Recommended