Programming with GNU Software

Noeud:Looking for Arithmetics, Noeud Ť Next ť:What is Bison, Noeud Ť Previous ť:Looking for Balanced Expressions, Noeud Ť Up ť:Parsing

Looking for Arithmetics

Except for a few insignificant details, the syntax introduced in the previous section is called BNF, standing for Backus-Naur form, from the name of its inventors: they used it to formalize the Algol 60 programming language. It is used to define grammars: a set of rules which describe the structure of a language, just as we used grammars to describe the English sentences at school. As for natural languages, two kinds of symbols are used to describe artificial languages: terminal symbols (e.g., he, she, etc. or +, -, etc.) and nonterminal symbols (e.g., "subject", or "operator"). Examples of grammars include

     sentence:    subject predicate;
     subject:     she | he | it;
     predicate:   verb noun-phrase | verb;
     verb:        eats;
     noun-phrase: article noun | noun ;
     article:     the;
     noun:        bananas | coconuts;

     expr: expr op expr | ( expr ) | number;
     op:   + | - | * | /;

     Example 7.4: A Grammar for Arithmetics

Such rules, which left hand side is always reduced to a single nonterminal, define so called context free grammars. Context free grammars are properly more powerful than regular expressions: any regular expression can be represented by a context free grammar, and there are context free grammars defining languages that regular expressions cannot denote (e.g., the nested parentheses).

Grammars can be ambiguous: some sentences can be understood in different ways. For instance the sentence number + number * number can be understood in two different ways by the grammar of the example 7.4:

                       ,  expr .                  ,  expr .
                      /    |    \                /    |    \
                     /     |     \              /     |     \
                    /      |      \            /      |      \
                   v       v       v          v       v       v
            , expr.       op      expr       expr     op       , expr.
           /   |   \       |       |          |       |       /   |   \
          /    |    \      |       |          |       |      /    |    \
         v     v     v     |       |          |       |     v     v     v
       expr   op    expr   |       |          |       |   expr   op    expr
         |     |     |     |       |          |       |     |     |     |
         v     |     v     v       v          v       v     v     |     v
     number + number*   number   number  +number * number

     Example 7.5: Non Equivalent Parse Trees for number + number * number

Because having different interpretations of a given sentence cannot be accepted for artificial languages (a satellite could easily be wasted if programmed in an ambiguous language), we will work with unambiguous grammars exclusively. For instance, we will have to refine the grammar of the example 7.4 in order to use it with Bison.

Please note that even if limited to the minus operator, this grammar is still ambiguous: two parse trees represent number - number - number¹. We, humans, don't find it ambiguous, because we know that by convention - is executed from left to right, or, in terms of parenthesis, it forms groups of two from left to right. This is called left associativity. There exist right associative operators, such as power, or assignment in C, that handle their rightmost parts first.

The following unambiguous grammar denotes the same language, but keeps only the conventional interpretation of subtraction: left associativity.

     expr: expr - number | number;

     Example 7.6: An Unambiguous Grammar for - Arithmetics

Let's look at our stack recognizer again, on the input number - number:

Step Stack Input
1. number - number
2. number - number
3. expr - number
4. expr - number
5. expr - number
6. expr

     Example 7.7: An LR(1) Parsing of number - number

This time, we no longer can systematically apply the rule expr: number each time we see a number on the top of the stack. In both step 2 and step 5, the top of the stack contains a number which can be reduced into an expr. We did reduce from step 2, but in step 5 we must not. If we did, our stack would then contain expr - expr, out of which nothing can be done: all the numbers except the last one, must be converted into an expr. We observe that looking at the next token, named the lookahead, solves the issue: if the top stack is number, then if the lookahead is minus, shift, if the lookahead is end-of-file, reduce the rule expr: num, and any other token is an error (think of number number for instance).

The theory we presented in the Looking for Balanced Expressions can be extended to recognize patterns on the stack plus one lookahead symbol. This is called LR(1) parsing. As you will have surely guessed, LR(k) denotes this technique when peeking at the k next tokens in the input.

Unfortunately, even for reasonably simple grammars the automaton is quickly huge, so in practice, one limits herself to k = 1. Yet with a single lookahead, the automaton remains huge: a lot of research has been devoted to design reasonable limitations or compression techniques to have it fit into a reasonable machine. A successful limitation, which description falls out of the scope of this book, is known as LALR(1). A promising approach, DRLR, consists in recognizing the stack from its top instead of from its bottom. Intuitively there are less patterns to recognize since the pertinent information is usually near the top, hence the automaton is smaller.

In A Discriminative Reverse Approach to LR(k) Parsing, Fortes Gálvez José compares the handling of three grammars:

arithmetics A small grammar similar to that of the example 7.4, but unambiguous. There are 5 terminals, 3 nonterminals, and 6 rules.
medium A medium size grammar comprising 41 terminals, 38 nonterminals, and 93 rules.
programming A real size grammar of a programming language, composed of 82 terminals, 68 nonterminals, and 234 rules.

by the three approaches. The size of the automaton is measured by its number of states and entries (roughly, its transitions):

LR(1) LR(1) LALR(1) LALR(1) DRLR(1) DRLR(1)
Grammar States Entries States Entries States Entries

arithmetics 22 71 12 26 8 23
medium 773 6,874 173 520 118 781
programming 1,000+ 15,000+ 439 3,155 270 3,145

As you can see, LR(1) parsers are too expensive and as matter of fact there is no wide spread and used implementation, LALR(1) is reasonable and can be found in numerous tools such as Yacc and all its clones, and finally, because DRLR(1) addresses all the LR(1) grammars, it is an appealing alternative to Yacc. Bison is an implementation of Yacc, hence it handles LALR(1) grammars, but might support DRLR(1) some day, providing the full expressive power of LR(1).

Notes de bas de page

These two parse trees are those of the example 7.5, with - replacing * and +.

Step	Stack	Input
1.		`number - number`
2.	`number`	`- number`
3.	`expr`	`- number`
4.	`expr -`	`number`
5.	`expr - number`
6.	`expr`

arithmetics	A small grammar similar to that of the example 7.4, but unambiguous. There are 5 terminals, 3 nonterminals, and 6 rules.
medium	A medium size grammar comprising 41 terminals, 38 nonterminals, and 93 rules.
programming	A real size grammar of a programming language, composed of 82 terminals, 68 nonterminals, and 234 rules.

	LR(1)	LR(1)	LALR(1)	LALR(1)	DRLR(1)	DRLR(1)
Grammar	States	Entries	States	Entries	States	Entries

arithmetics	22	71	12	26	8	23
medium	773	6,874	173	520	118	781
programming	1,000+	15,000+	439	3,155	270	3,145