Noeud:Looking for Balanced Expressions, Noeud « Next »:, Noeud « Up »:Parsing



Looking for Balanced Expressions

We have seen that Flex supports abbreviations, see Flex Regular Expressions. Using a more pleasant syntax, we could write for instance:

     digit:  [0-9];
     number: digit+;
     

It is then tempting to describe possibly parenthesized expressions using these abbreviations:

     expr: '(' expr ')' | number
     
     Example 7.1: A Simple Balanced Expression Grammar
     

to describe numbers nested in balanced parentheses.

Of course, Flex abbreviations cannot be recursive, since recursion can be used as above to describe balanced parentheses, which fall out of Flex' expressive power: Flex produces finite state recognizers, and FSR cannot recognized balanced parentheses because they have finite memory (voir Start Conditions).

This suggests that we need some form of virtually infinite memory to recognize such a language. The most primitive form of an infinite memory device is probably stacks, so let's try to design a recognizer for expr using a stack. Such automata are named pushdown automata.

Intuitively, if the language was reduced to balanced parentheses without any nested number, we could simply use the stack to push the opening parentheses, and pop them when finding closing parentheses. Slightly generalizing this approach to the present case, it seems a good idea to push the tokens onto the stack, and to pop them when we find what can be done out of them:

Step Stack Input
1. ( ( number ) )
2. ( ( number ) )
3. ( ( number ) )
4. ( ( number ) )

At this stage, our automaton should recognize that the number on top of its stack is a form of expr. It should therefore replace number with expr:

5. ( ( expr ) )
6. ( ( expr ) )

Now, the top of the stack, ( expr ), clearly denotes an expr, according to our rules:

7. ( expr )
8. ( expr )
9. expr

Finally, we managed to recognize that ( ( number ) ) is indeed an expression according to our definition: the whole input has been processed (i.e., there remains nothing in the input), and the stack contains a single nonterminal: expr. To emphasize that the whole input must be processed, henceforth we will add an additional token, $, standing for end of file, and an additional rule:

     $axiom: expr $;
     

If you look at the way our model works, there are basically two distinct operations to perform:

shift
Shifting a token, i.e., fetching the next token from the input, and pushing it onto the stack. Steps performed from the states 1, 2, 3, 5, and 7 are shifts.
reduce
Reducing a rule, i.e., recognizing that the top of the stack represents some right hand size of a rule, and replacing it with its left hand side. For instance at stage 4, the top stack contains number which can be reduced to expr thanks to the rule expr: number;. At stages 6 and 8, the top of the stack contains ( expr ), which, according to expr: '(' expr ')'; can be reduced to expr.

The initial rule, $axiom: expr $; is special: reducing it means we recognized the whole input. This is called accepting.

The tricky part is precisely knowing when to shift, and when to reduce: the strategy we followed consists in simply reducing as soon as we can, but how can we recognize stacks that can be reduced? To answer this question, we must also consider the cases where the input contains a syntax error, such as number number. Finally, our question amounts to "what are the valid stacks ready to be reduced". It is easy to see there are two cases: a possibly empty series of opening parentheses followed by either a lone number, or else by a (, then an expr, and a ), or, finally a single expr and end of file.

Hey! "A possibly empty series of opening parentheses followed by...": this is typically what we used to denote with regular expressions! Here:

     (* ( expr $ | ( expr ) | number )
     

If we managed to describe our reducible stacks with regular expressions, it means we can recognize them using a finite state recognizer!

This FSR is easy to compute by hand, since its language is fairly small. It is depicted below, with an additional terminal symbol, $, denoting the end of file, and N standing for number:

                    `('
                   ,----.
                   |    |
                   v    |
                ,---------. expr ,--------------. `)' ,------------------.
            ,-->| 1: `('+ |----->| 3: `('+ expr |---->| 4: `('+ expr `)' |-->
           /    `---------'      `--------------'     `------------------'
      `(' /          |
         /           |`N'
        /            v
     ,---. `N'  ,-------------.
     | 0 |----->| 2: `('* `N' |-->
     `---'      `-------------'
        \
         \ expr
          \
           \    ,---------.  $   ,-----------.
            `-->| 5: expr |----->| 6: expr $ |-->
                `---------'      `-----------'
     
     Example 7.2: A Stack Recognizer for Balanced Expressions
     

As expected, each final state denotes the recognition of a rule to reduce. For instance state 4 describes the completion of the first rule, expr: '(' expr ')', which we will denote as reduce 1. State 6 describes the acceptance of the whole text, which we will denote accept. Transitions labelled with tokens are shifts, for instance if the automaton is in state 1 and sees an N, then it will shift 2. An transition labelled with a non terminal does not change the stack, for instance an expr in state 1 makes the automaton go to 4. Any impossible transition (for instance receiving a closing parenthesis in state 0) denotes a syntax error: the text is not conform to the grammar.

In practice, because we want to produce efficient parsers, we won't restart the automaton from the beginning of the stack at each turn. Rather, we will remember the different states we reached. For instance the previous stack traces can be decorated with the states and actions as follows:

Step Stack Input Action
1. 0 ( ( number ) ) $ shift 1
2. 0 ( 1 ( number ) ) $ shift 1
3. 0 ( 1 ( 1 number ) ) $ shift 2
4. 0 ( 1 ( 1 number 2 ) ) $ reduce 2
5. 0 ( 1 ( 1 expr ) ) $ go to 3
6. 0 ( 1 ( 1 expr 3 ) ) $ shift 4
7. 0 ( 1 ( 1 expr 3 ) 4 ) $ reduce 1
8. 0 ( 1 expr ) $ go to 2
9. 0 ( 1 expr 2 ) $ shift 4
10. 0 ( 1 expr 2 ) 4 $ reduce 1
11. 0 expr $ go to 5
12. 0 expr 5 $ shift 6
13. 0 expr 5 $ 6 accept

     Example 7.3: Step by Step Analysis of ( ( number ) )
     


This technology is named LR(0) parsing, "L" standing for Left to Right Parsing, "R" standing for Rightmost Derivation1, and "0" standing for no lookahead symbol: we never had to look at the input to decide whether to shift or to reduce.

Again, while the theory is simple, computing the valid stacks, designing an finite state stack recognizer which shifts or reduces, etc. is extremely tedious and error prone. A little of automation is most welcome.


Notes de bas de page

  1. When we reduce a rule, it is always at the top of our stack, corresponding to the rightmost part of the text input so forth. Some other parsing techniques, completely different, first handle the leftmost possible reductions.