Father of Python parser series of five: left recursive grammar PEG

Original title | Left-recursive PEG grammars

Author | Guido van Rossum (Python father)

Translator | next Pea cat ( "Python cat" public Author)

Disclaimer | This translation is for the purpose of exchange of learning, based on CC BY-NC-SA 4.0 license. For ease of reading, the contents of slight changes.

I have mentioned several times left recursion is a stumbling block, it's time to solve it. The basic question is: When using a recursive descent parser, left recursion due to a stack overflow and causes the program to terminate.

[This is my PEG Part 5 of this series. Other articles refer to this directory ]

Consider the following syntax rules:

expr: expr '+' term | term

If we naively translate it into fragments recursive descent parser, will be the following:

def expr():
    if expr() and expect('+') and term():
        return True
    if term():
        return True
    return False

That is expr()to invoke the expr()start, which also calls expr()began, and so on ...... this can only end in a stack overflow, an exception is thrown RecursionError.

The traditional remedy is to rewrite the grammar. In the previous article, I have done so. In fact, the above syntax can be identified, if we re-written like this:

expr: term '+' expr | term

However, if we use it to generate a parse tree, the parse tree shape will be different, which can lead to devastating consequences, such as when we add in a syntax '-'when operator (because a - (b - c)the (a - b) - care not the same).

This can often be more powerful features to solve the PEG, such as packet and iteration, we can rewrite the above rule is:

expr: term ('+' term)*

In fact, this is written in Python syntax currently on pgen parser generator (pgen and left recursive rule that has the same problem).

But it still has some problems: Because like '+'and '-'this operator is basically (in Python), when we resolved as a binary a + b + cwhen things like this, we must traverse the result of the analysis (basically a list [ 'a' , '+', 'b' , '+', 'c']), a left-recursive configured to parse tree (similar to [[ 'a', '+ ', 'b'], '+', ' c ']).

The original grammar has left recursion table v. The required relevance, therefore, if we can generate parsers directly in that form, it will be good. We can! One fan pointed out to me a very good technique, also it comes with a mathematical proof, easy to implement. I'll try to explain it here.

Let us consider the input foo + bar + bazas an example. We want to parse out the parse tree corresponds to (foo + bar)+ baz. This requires expr()three times left recursive call: one corresponding to the top of the "+" operator (i.e. second); corresponding to the inside of a "+" operator (i.e., a first); and a second choice is two alternatives (ie term).

Since I am not good at using computers to draw the actual chart, so I will use this as ASCII skills demonstration:

expr------------+------+
  |              \      \
expr--+------+   '+'   term
  |    \      \          |
expr   '+'   term        |
  |            |         |
term           |         |
  |            |         |
'foo'        'bar'     'baz'

The idea is to hope in expr () function has an "oracle" (translation: prophecy, oracle, not behind the translation), it tells us that the use of either first choices (ie recursive call expr ()), either the second (ie, call term ()). In the first call expr (), "oracle" should return true; in the second (recursive) call, it should also return true, but in the third call, it should return false, so that we can call term ().

In the code, it should be like this:

def expr():
    if oracle() and expr() and expect('+') and term():
        return True
    if term():
        return True
    return False

How can we write this "oracle" it? Give it a try ...... we can try to be recorded on the call stack expr () (left recursive) the number of calls, and "+" number of operators compared with the following expression. If the number of operators is greater than the depth of the call stack, it should return false.

I almost want sys._getframe()to achieve it, but there is a better way: Let us reverse the call stack!

The idea here is that we return false from the oracle at the call, and save the results. It has expr()->term()->'foo'. (It should return the original termparse tree, ie 'foo'. The above code simply returns True, but in the second article of this series, I've demonstrated how to return a parse tree.) Easy to write an oracle to achieve, it should when the first call returns false-- not need to check the stack forward or look back.

Then we called again expr(), this time oracle returns true, but we do not expr () were left recursive calls, but replacing saved when the results were called once before. Look na, the expected '+'operator and subsequent termthere, so we will be foo + bar.

We repeat this process, then things look very clear: This time we'll get a parse tree full expression, and it is the correct left recursion ((foo + bar) + baz).

We then repeat the process again, this time, oracle returns true, and saved at the last call before the results are available, the next step is not '+' operator, and the first failure of alternatives. So we try to prepare the second option, it will be successful, just find the initial term ( 'foo'). Compared with the previous call, this is a bad result, so we stop here and leave the longest to resolve (ie (foo + bar) + baz).

In order to convert it to actual working code, I first want to rewrite the code slightly to the oracle () call and left-recursive expr () call combined. We call it oracle_expr(). Code:

def expr():
    if oracle_expr() and expect('+') and term():
        return True
    if term():
        return True
    return False

Next, we will write a decorator to achieve the above logic. It uses a global variable (do not worry, I'll get rid of it later). oracle_expr()Function will read the global variables, manipulated and it is decorated:

saved_result = None
def oracle_expr():
    if saved_result is None:
        return False
    return saved_result
def expr_wrapper():
    global saved_result
    saved_result = None
    parsed_length = 0
    while True:
        new_result = expr()
        if not new_result:
            break
        new_parsed_length = <calculate size of new_result>
        if new_parsed_length <= parsed_length:
            break
        saved_result = new_result
        parsed_length = new_parsed_length
    return saved_result

This process of course is sad, but it shows the code of points, so let's try it, it can develop into something we are proud of.

Decisive insight (this is my own, although I may not be the first thought) that we can use the cache memory instead of global variables, save the results to a call to the next, and then we do not need the extra oracle_expr()function - we can generate standard calls to expr (), location regardless of whether it is left recursive.

To do this, we need a separate @memoize_left_rec decorators, only for the left recursive rules. Remove it from the value stored by the cache memory acting as the oracle_expr () function role, and it contains a cycle is called a long front portion longer than that covered by each new result, repeatedly call expr () .

Of course, because the memory cache location and according to the inputs of each analytical method to handle cache, so it is not backtracking or affect multiple recursive rules (for example, in the toy grammar, I have been using expr and term are left-recursive) .

Another nice property infrastructure I created in the first three articles is that it is easier to check the new result is longer than the old results: mark () method returns the index to the tag array input, so we can use it, rather than above parsed_length.

I do not have to prove why this algorithm always valid, no matter how crazy this syntax. That's because I have not actually read the proof. I see a simple case it applies to toys syntax of expr, also applies to more complex situations (for example, involving a backup option in the optional entry hidden behind the left recursion, or between multiple involve mutually recursive rules ), but Python's syntax, the most complex case I can think of is still quite mild, so I can trust to prove theorems and its people.

So let us insist on doing, and show some real code.

First, a parser generator must detect what is left recursive rule. This is the graph theory has been a problem. I will not show here algorithm, in fact I will further simplify the work, and the only hypothesis left recursive grammar rules is the direct left-recursive, just like our toy grammar expr. Then check left recursion only need to find alternatives to the current rule names that begin with. We can write:

def is_left_recursive(rule):
    for alt in rule.alts:
        if alt[0] == rule.name:
            return True
    return False

Now we modify the parser generator, so that for left recursive rule, it can generate a different decorator. Recall that in the first three articles, we used a modified @memoize all analytical methods. We are now on the generator be a small change, for left-recursive rules, we replaced @memoize_left_rec, then we trick in memoize_left_rec decorator. And support the rest of the code generator does not need to change! (However, I had to fiddle with it in a visual code.)

For reference, here is the original @memoize decorators, copied from the first three. Note that, a Self Parser instance, memo having attributes (with empty dictionary initialization), Mark () and reset () method for obtaining the current position and set the tokenizer:

def memoize(func):
    def memoize_wrapper(self, *args):
        pos = self.mark()
        memo = self.memos.get(pos)
        if memo is None:
            memo = self.memos[pos] = {}
        
        key = (func, args)
        if key in memo:
            res, endpos = memo[key]
            self.reset(endpos)
        else:
            res = func(self, *args)
            endpos = self.mark()
            memo[key] = res, endpos
        return res
    return memoize_wrapper

@memoize decorator remember a previous call at each input location - in each position of the input tag (inert) array, a separate memodictionary. The first four lines memoize_wrapper function and obtain the correct memodictionary related.

This is @memoize_left_rec. Only else branch and above @memoize different:

    def memoize_left_rec(func):
    def memoize_left_rec_wrapper(self, *args):
        pos = self.mark()
        memo = self.memos.get(pos)
        if memo is None:
            memo = self.memos[pos] = {}
        key = (func, args)
        if key in memo:
            res, endpos = memo[key]
            self.reset(endpos)
        else:
            # Prime the cache with a failure.
            memo[key] = lastres, lastpos = None, pos
            # Loop until no longer parse is obtained.
            while True:
                self.reset(pos)
                res = func(self, *args)
                endpos = self.mark()
                if endpos <= lastpos:
                    break
                memo[key] = lastres, lastpos = res, endpos
            res = lastres
            self.reset(lastpos)
        return res
    return memoize_left_rec_wrapper

It probably helps show expr () method to generate, so we can track the flow between the decorator and decorative methods:

    @memoize_left_rec 
    def expr(self):
        pos = self.mark()
        if ((expr := self.expr()) and
            self.expect('+') and
            (term := self.term())):
            return Node('expr', [expr, term])
        self.reset(pos)
        if term := self.term():
            return Node('term', [term])
        self.reset(pos)
        return None

Let us try to parse foo + bar + baz.

Whenever you call is decorated expr () function, the decorator will "intercept" calls, calls it a look before the current position. In the first call at, else it will enter the branch, where it repeatedly calls the function undecorated. When the undecorated function call expr (), which of course points to be decorated version, so this recursive call will be intercepted again. Recursive stop here, because now with the memo cache hit.

what's next? The initial cache values from this line:

            # Prime the cache with a failure.
            memo[key] = lastres, lastpos = None, pos

This makes the decoration of expr () returns None, in that expr () in the first if will fail (in expr := self.expr()). Therefore, we proceed to the second if, it successfully identified a Term (in our example is 'foo'), expr returns a Node instances. It returns to where? To the decorator inside the while loop. This new result will be updated memo cache (the node instance), and then starts the next iteration.

Call again undecorated expr (), the intercepted recursive call returns a new instance of Node cache (a term). It was a success, calls continue to expect ( '+'). This success again and we are now in the first "+" operator. After that, we want to find a term, are successful (find 'bar').

So for empty expr (), has been identified foo + bar, the while loop back, will go through the same process: new (longer) result to update the cache with the memo, and open the next iteration.

Games played again. Intercepted recursive expr () is called again to retrieve new results from the cache (this is foo + bar), and we expect to find another '+' (second) and another term ( 'baz'). We construct a Node representation (foo + bar) + baz, and returned to the while loop, which will fill it into the memo cache, and iterate again.

But next time things will be different. With the new results, we find another '+', but did not find! So this expr () calls will return to its second alternatives, and returns a poor term. When we arrived at the while loop, it is disappointing to find that the result is shorter than the last one, it is interrupted, the longer the results ((foo + bar) + baz) returns to the original call is to initialize the external expr () call place (for example, a statement () call - not shown here).

This, today's end of the story: We have successfully tamed the left recursion in PEG (-ish) parser. As for next week, I intend to discuss adding "action" (actions) in the syntax, so we can think that analytical method for a given backup options to customize the results it returns (but not always return a Node instance).

If you want to use the code, see the GitHub repository . (I also added left recursion visual code, but I do not particularly satisfied, so I do not intend to give the link.)

License agreement of the contents of the sample code: the CC-BY the NC-SA 4.0

About the author: Guido van Rossum, Python's creator, has been a "benevolent dictator for life" until July 12, 2018 to abdicate. Currently, he is one of five members of the new top decision-making, is still active in the community. This article from his blog series open parser written in Medium, the series still in the series, updated every Sunday.

Translator's Introduction: pea cat, was born in Guangdong, graduated from Wuhan University, now the Soviet drift programmer, there are some geeks thinking, there are some human feelings, there are some temperature, and some attitude. No public: "Python Cat" (python_cat).

Public number [ Python cat ], the serial number of high-quality series of articles, there are philosophical cat meow Star series, Python Advanced Series, recommended books series, technical writing, and recommend high-quality English translation and so on, welcome attention Oh.

Father of Python parser series of five: left recursive grammar PEG

Guess you like