Lexical analysis - using a regular grammar

(Travel around [http://www.cnblogs.com/naturemickey] Copyright, all rights reserved)


In my previous article, " according to the principles of design ideas compiled a calculator " in roughly speaking the structure and construction ideas compiler.

This part of the lexical analysis separate out more about them.

 

First, what is the lexical analysis

Lexical analysis is the first stage of the compiler. It is a program of the text input, the output of each lexical unit in this text.

Examples or according to the previous article, we enter a short program text (10 + pow (2, 3)) * sqrt (4) - 1 to the lexical analysis program, lexical analysis program will be adjacent to constitute a single letter lexical units combined into a list of lexical units, as follows:

( 10 + pow ( 2 , 3 ) ) * sqrt ( 4 ) - 1



This is the lexical analysis done all the work.

 

Second, what is a regular grammar

In the previous article, for some lexical analysis, I do not use regular grammar, this is because the last article we realize language is very simple and very easy to hand-draw a map DFA. But if we are to achieve the language is relatively complex, it is not easy to directly draw the map, so we need the help of other simpler way to represent the lexical structure and uses a set of algorithms to our representation becomes DFA .

Relatively common representation is lexical structure "regular grammar."

For example, a regular grammar - the following numbers grammar and language development in most of who were very similar:

digit                       -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

unsigned_integer  -> digit digit *

unsigned_number -> unsigned_integer (( . unisgned_integer ) | Ɛ ) 

1. In grammar "->" symbol on the left is a grammar Hinako expression, the right is the expression grammar.

2. The first expression means: a digit decimal number is a bit --0 or 1 or 2 or 3 or 9 .......

3. The second expression means: a unsigned_integer is the beginning of a digit, followed by zero or more digit-- off here, "*" symbol indicates zero or more - in fact, that is to say, at least a number, not a maximum number of digits all together, is a unsigned_integer.

4. The third expression looks complicated, but in fact will explain a little hard to understand - enclosed in parentheses is a group of partial structure, for example, (unsigned_integer.) And unsigned_integer That is to link a group. And there are a Greek letter Ɛ, this letter indicates empty, that there is no meaning. So (. (Unisgned_integer) | Ɛ) says that such a structure: the root of a lot of numbers behind the decimal point as may be empty. So unsigned_integer (. (Unisgned_integer) | Ɛ) says: nothing can start by one or more digits behind the decimal or root (in fact, the familiar double the simplest type of representation).

 

This form of representation is extremely similar to the "regular expression", but the expression is more positive than two things:

1. representation can be referenced in a regular expression grammar in the name of other expressions of this grammar has been defined, such as: unsigned_integer it cites digit, and a regular expression is a whole, can not reference other shows.

2. In a regular expression grammar, there may be "empty", he said there is no free regular expression expressed.

But conversely, in a regular grammar, you may not be referenced in an expression other expressions, you can not use an empty representation, so regular grammar becomes a series regular expressions, for example: in the previous example digit digit *, if we go to a reference point expression, can be expressed as (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) *, such representation is equivalent to the original, but also a legal regular expression.

 

Here we use grammar to describe the regular expression language lexical calculated in the previous article, it will roughly look like this:

INT   -> (0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*

NUM -> int | (Int.) | (Int. INT) | (. Int)

FUN -> (pow) | (sqrt)

VAR -> (a | b | ... | z | A | B | ... | Z)(a | b | ... | z | A | B | ... | Z | 0 | 1 | 2 | ... | 8 | 9) *

ADD -> +

SUB -> -

MUL -> \*

DIV  -> /

LBT -> \(

RBT -> \)

COMMA -> ,

BLANKS -> (\t | \  | \n | \r) *

This ellipsis is not part of the regular grammar above I used ... but because the middle character is too long, so it simplified representation.

Usually based on the form of perl regular expression will have some built this simplified representation, for example:

\ D or [0-9] simplified representation 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.

\ W or [_a-zA-Z0-9] can be simplified representation of all "lowercase letters" and "number" and "underline."

This is the use of meta-characters in regular expressions, we can see that if there is no regular grammar metacharacters, so we use too much trouble.

About metacharacters, we repeat that part of the follow-up, and now we are being asked a little trouble, use a regular method only a few dollars characters.

In addition, in the above grammar representation, did not say what is the end state, which is non-final state, for example, INT calculator in our non-state end. But this is not important, as long as we in the actual writing process, plus a state attribute on it.

So we are being put aside some unimportant things (for example: some characters are not necessary yuan; end state and non-state end), regular attention to grammar DFA conversion process it.

 

Third, the method of converting regular grammars of NFA and DFA

DFA regular grammar is converted into a lot of algorithms are, here we only describe one (relative to other algorithm This algorithm is more easily understood, and other algorithms can be minimized as the DFA).

Here is the first to introduce the concept of DFA and NFA.

I read an article on the students actually have been more clear what is DFA - it is a state transition diagram, there is a start state, a termination state, each state has to reach another state input conditions.

DFA and the NFA is actually very similar, but the state of the NFA converted input conditions can be ɛ (ie air), and a state is obtained when an input, a plurality of target state can be reached.

Draw a map or look will be more clear: such a * this regular expression DFA map can be painted in this way:

And also he said that a * NFA regular expression can be painted in this way:

Or so painted this

The above figure, a black circle indicates the termination state, or in an acceptable state, terminated state non-white circles, arrows and letters on state transition arrows indicate the direction and input, no input letters is empty arrow.

Empty input means, a state can directly reach another state, in fact, it is equivalent to simultaneously reach multiple states.

Here, it should almost be understood what is DFA and NFA, but still they are defined in the following copy of it!

 

1. What is the NFA?

An uncertain finite automata (NFA) consists of the following components:

a). an empty set of states S.

B). a set of input symbols Σ, i.e., the input alphabet. We assume that the representative is not empty Ɛ Σ elements.

C). a conversion function, which are given a set of successor states of the head of each state and for each symbol in Σ∪Ɛ.

d) in a state .S s0 is designated as the start state, or the sleep state.

e) a subset of F .S, is designated as an accept state (or terminated) set.

 

2. What is DFA?

Determined finite automaton (referred to as DFA) is uncertain finite automaton of a special case, in which:

a). The above operation is not input conversion Ɛ.

B). For each state s and each input symbol a, there is only one reference numeral is away side a s.

 

3. Expresses a regular expression with the NFA.

NFA regular expression represents a few basic form, any complex regular expressions can be formed by a combination of the following form:

A) a recognition of a character a NFA, as follows:

B) identification of two consecutive characters ab NFA follows:

C) identifying two characters ab NFA in any one of the following:

D) identifying a plurality of successive any NFA as (a *: kleen closure):

Four or more basic form of a B or, if used to replace a full NFA, it indicates that the recursive structure is formed of the NFA.

We use the form above to construct a slightly more complex regular expressions NFA form: ((ab) | c) *.

Here, ab is the form in front of B, the ab as a whole, then | c C is the form of the previous page, the (ab) | c as a whole, do * D is in the form of.

Draw NFA map like this:

Now we have can construct a regular expression of any NFA, but regular grammar How NFA to express it?

I do have a lot of books on compiler technology, I did not say that.

I was so represented - I'm not sure if it should be so, but at least --it works can run!

I use all the connections or in the form of grammar, and retain the termination of the state of the last node of each grammar.

For example, the following grammar:

A -> ab

B -> A | c

C -> B *

It would have NFA chart below (I forgot to mark the edge of the letter, but I think you'll understand).

This has a number of different acceptable state of the NFA, which is compiled in the book did not see before, so I am not sure this is not the NFA, but in the following description for convenience, I still call it the NFA.

 

Such NFA at run time, just follow the greedy matching, until the match is not going down, look at the black node after the last of what state, then to the black before all the input node identified it as a lexical the elements.

Then go back to the beginning of the entire state of the NFA, to continue to identify new elements from a lexical elements after the end of input.

Now that the NFA is also can run it, and now we do not need to construct a DFA also do the work of the lexical analysis. DFA only benefit analysis on the NFA to be faster than the speed a little bit.

If you want to know how a little detail that NFA is running, you can skip to (d) of this article.

If you want to read step by step, then the following will begin construction of the DFA.

 

4 has a structure equivalent to a DFA and NFA.

Algorithm is always more than one, in "Dragon Book Three" introduced this subset construction algorithm:

Define the algorithm on three actions:

operating description
Ɛ-closure(s) NFA can be started from a state s by only converting NFA ɛ set of states reachable
Ɛ-closure(T) 能够从T中某个NFA状态s开始只通过Ɛ转换到达的NFA状态集合
move(T, a) 能够从T中某个状态s出发通过标号为a的转换到达的NFA状态集合

 

 

 

 

算法为如下伪代码的过程:

一开始,Ɛ-closure(s0)是Dstates中唯一状态,且它未加标记;

while(在 Dstates中有一个未标记的状态T){

        给T加上标记; 

        for(每个输入符号a){

                U = Ɛ-closure(move(T, a));

                if(U不在Dstates中)

                        将U加入到Dstates中,且不加标记;

                Dtrun[T, a] = U;

        }

}

这里的Dstates是我们要构造的DFA的状态集合,从上面的算法我们可以知道,这个DFA的每个状态实际上是NFA的一个状态的子集(所以这个算法叫做子集构造造算法),Dtrun是我们要构造的DFA的转换函数。

经过这个算法一个DFA就可以构造出来了。

下面还是举个例子吧:

还是以((ab)|c)*为例,来讲一下:

略!——画图还是太麻烦,用手画还简单一些,这个东西我是打算拿出来做培训时当面讲的,所以这里就偷懒不画了,以后在会议室的白板上手画吧。

 

5.如何最小化一个DFA。

首先还是抄一下《龙书三》中的算法,再稍讲一讲:

a).首先构造包含两个组F和S-F的初始划分P,这两个组分别是D的接受状态组和非接受状态组。

b).应用如下过程来构造新的分划Pnew

    最初,令Pnew = P;

    for ( P 中每个组G){

        将G分划为更小的组,使得两个状态s和t在同一小组中当且公当对于所有的输入符号a,状态s和t在a上的转换都到达P的同一组;

        /*在最坏情况下,每个状态各自组成一个组*/

        在Pnew中将G替换为对G进行分划得到的那些小组;

    }

c).如果Pnew = P,令Pfinal = P并接着执行步骤d);否则,用Pnew替换P并重复步骤b)。

d).在分划Pfinal的每个组中选取一个状态作为该组的代表。这些代表构成了状态最小DFA的状态(以下用D2代表这个最小化的DFA,用D代表最小化前的DFA)。D2的其它部分按如下步骤构建:

    1).D2的开始状态是包含了D的开始状态的组的代表。

    2).D2的接受状态是那些包含了D的接受状态的组的代表。请注意,每个组中要么只包含接受状态,要么只包含非接受状态,因为我们一开始就将这两个状态分开了,而b)步骤中的过程总是通过分解已经构造得到的组来得到新的组。

    3).令s是Pfinal的某个组G的代表,并令D中输入a上离开s的转换到达状态t。令r为t所在组H的代表。那么d2中存在一个从s到r在输入a上的转换。注意,在D中,组G中的每一个状态必然在输入a上进入组H中的某个状态,否则,组G应该已经被b)步骤的过程分割成更小的组了。

 

这个算法在应用时,最大的问题还是在于多个接受状态的情况(在前面我有描述到我对于正则文法的NFA的表示的理解),这样在初始划分时,我的方式是划分为多个组:一个组是所有非接受状态的状态组,其它每个组分别接受不同的可接受状态。

 

6.去除DFA中的死状态。

几本书上都说上面的最小化DFA的算法可能产生死状态(在所有输入符号上都转向自己的非接受状态)。但没有一本书有举出这样的情况的例子,也没有说怎么样可以构造出这样的极端情况,我也从没遇到过死状态的情况 。

所以我对于消除死状态的做法是:

a).首先找到死状态。

b).如果找到了死状态,就抛一个异常出来。

这样在以后如果有幸碰到了一个死状态,那就马上就知道了,我也好长长见识。

 

四、NFA和DFA的运行

关于DFA的运行,在我的前一篇博文中已经有了比较详细的描述,所以在这里就只讲一下NFA的运行。

NFA和DFA的区别只有两个:1.存在输入为Ɛ的边。2.每个状态输入一个字符之后,可能到达多个状态。

针对第一点,我们的处理方式是:当我们到达一个状态节点时,这个节点的输入为Ɛ的边到达的节点也就同时到达了——即,我们每次到达的是一个状态集合。

针对第二点,我们的处理方式是:对于每个可能的方向都走,直到每个方向都走不同为止,看哪个方向能识别的单词最长(贪婪原则),我们就认为识别到了哪个单词——如果我们设计的文法是有冲突的(即:可能有两条路径同时识别到同一个单词),这样我们就要设计一个冲突解决的办法(通常是排在前面的文法优先级更高)。

 

五、基本正则表示之外的元字符

在最基本的正则表示中,我们所需要用到的元字符有两个:一个是|,另一个是*

其它元字符都是可以用最基本的方式来表示的,比如:

?,如:a?识别0个或1个a,但我们也可以这样表示(a|Ɛ)。

+,如:a+识别1个或多个a,但我们也可以这样表示aa*。

这样的元字符只是为了方便我们的表示而存在的。

还有另外一些元字符,比如小括号用于在文法的文本表示中把其中的一部分表示分组,如果我们不用小括号,也一定要用其它符号(但小括号是大家最习惯的),所以这样的元字符是必须的。

有元字符就一定要有转义字符,因为我们要识别的文本可能就包含元字符样子的文本,比如,我们可能需要识别一个语言中包含小括号的,这样我们就要在元字符前加一个反斜杠\(。

很多正则引擎内置了很多转义字符,如:

\d代表一个0到9之间的数字(包括0和9)

\n代表一个换行

\s表示一个空白字符(空格、水平制表符、垂直制表符……)

这些转义字符中有的是存在识别上的冲突的,比如:\w和\d。

如果我们自己写的正则引擎所支持的转义字符存在这种冲突应该怎么办呢?

这个问题在书上并没有写解决办法,但这是一个一定要解决的问题,不然如果存在两个有冲突的转义字符做为输入的路径的话,那就不是DFA了。

我对这个问题的解决办法是……这里暂时省略。

我们在设计自己的正则引擎时,也可以设计为可以让用户自己定义转义字符,这样可以给用户更大的自由度,但这样更难解决冲突。

 

六、正则文法的局限性

文法局限性方面,在我的印象中,好像只有下面的第三项有在一本书中看到过。

这里只列出来,就不细说了。

1.正则文法没有递归的定义方式。

2.正则文法不能识别上下文。

3.正则文法没有计数的能力。

 

七、几个相关算法的证明

太学术化的东西我不擅长!这些证明我不照着书看真是证明不出来,不过要写一个词法分析程序我倒是不需要翻书,直接就可以敲代码了。

所以这个部分就略了吧!

 

先贴一部分比较核心的代码在这里,以后再补充内容(最近JAVA8发布了,为了学习新东西,所以我所有代码都是用JDK8来写的——我还是头一次用JAVA来写一个通用的词法分析工具,以前用C/C++写过,也用Scala写过)。

 

/****************
 *
 * 这里的代码删掉了。
 *
 ****************/



 

 

转载于:https://www.cnblogs.com/naturemickey/p/3667571.html

Guess you like

Origin blog.csdn.net/weixin_33682790/article/details/93434692