Compilation Principles Longshu-Lexical Analysis

lexical analysis

The role of lexical analyzer

The main task of the lexical analyzer is to read the input characters of the source program, combine them into morphemes, and generate and output a sequence of lexical units. Each lexical unit corresponds to a lexeme.

Sometimes, the lexical analyzer can be divided into two cascaded processing stages:

  • The scanning phase is mainly responsible for completing some simple processing that does not require the generation of lexical units, such as deleting comments and compressing multiple consecutive whitespace characters into one character.
  • The lexical analysis stage is the more complex part, it will process the output of the scan stage and generate lexical units

Lexical units, patterns and morphemes

  • Token consists of a token name and an optional attribute value
  • A pattern describes the possible forms that a lexeme of a lexical unit may have
  • A lexeme is a sequence of characters in the source program that matches the pattern of a certain lexical unit and is recognized by the lexical analyzer as an instance of the lexical unit.

 In many programming languages, the following categories cover most or all lexical units:

  • Each keyword has a lexical unit. The pattern of a keyword is the keyword itself
  • A lexical unit representing an operator. It can represent a single operator or a class of operators.
  • a token representing all identifiers
  • One or more tokens representing constants, such as numbers and literal strings
  • Each punctuation mark has a lexical unit, such as left and right brackets, commas, and semicolons.

input buffer

buffer pair

sentry mark

By placing a symbol that will not appear in the text as a sentinel (eof character) at the end, you can combine the check of whether the end of the buffer is reached when moving the pointer and the check of what character is read into one.

Any eof that does not occur at the end of a buffer indicates that the end of input has been reached

Lexical unit specification

string and language

The length of string s, usually denoted as |s|, refers to the number of occurrences of symbols in s. Banaba is a string of length 6, and a string of length 0 is called an empty string, represented by

terms for parts of a string

regular expression

regular definition

A regular definition of the language corresponding to the C identifier:

Regular expression extensions

Identification of lexical units

The lexical analyzer is responsible for eliminating whitespace by having it recognize a "lexical unit" ws defined below

When we recognize ws, we will not return it to the syntax analyzer, but continue lexical analysis starting from the characters after this blank.

State transition diagram

A state transition diagram has a set of nodes or circles called states

The edges in the state diagram point from one state of the graph to another. The label of each edge contains one or more symbols.

Because the last character is not part of the identifier, we must back the input one position * means back one position

Lexical analyzer generation tool Lex

It supports the use of regular expressions to describe the pattern of each lexical unit, thus giving a specification for a lexical analyzer. The input representation method of the Lex tool is called the Lex language, and the tool itself is called the Lex compiler. At its core, the Lex compiler converts the input pattern into a state transition diagram, generates the corresponding implementation code, and stores to the file lex.yy.c

Lex usage

Conflict resolution in Lex

finite automaton

Nondeterministic Finite Automata (NFA)

A nondeterministic finite automaton (NFA) consists of the following parts:

  • A finite set of states S

 Transformation diagram for NFA in language that recognizes regular expression ( a|b ) *abb

conversion table

Each row of the table corresponds to a state, and each row corresponds to a sum of input symbols \epsilon. An entry corresponding to a given state and a given input is the value obtained by applying the NFA's transformation function to these parameters. If the transformation function does not give the value corresponding to a certain state - enter the right information and we will \phiput it into the corresponding table entry

  Conversion table corresponding to 3-24

As long as there is a path whose label sequence is a certain symbol string that can reach an acceptance state from the start state, NFA will accept this symbol string.

A language defined (or accepted) by an NFA is the set of label strings on all paths from a start state to an accepted state.

Deterministic Finite Automata (DFA)

Guess you like

Origin blog.csdn.net/zaizai1007/article/details/133242290