Noeud:Looking for Arithmetics, Noeud « Next »:What is Bison, Noeud « Previous »:Looking for Balanced Expressions, Noeud « Up »:Parsing
Except for a few insignificant details, the syntax introduced in the
previous section is called BNF, standing for Backus-Naur form, from
the name of its inventors: they used it to formalize the Algol 60
programming language. It is used to define grammars: a set of
rules which describe the structure of a language, just as we used
grammars to describe the English sentences at school. As for natural
languages, two kinds of symbols are used to describe artificial
languages: terminal symbols (e.g., he
, she
, etc. or
+
, -
, etc.) and nonterminal symbols (e.g.,
"subject", or "operator"). Examples of grammars include
sentence: subject predicate; subject:she
|he
|it
; predicate: verb noun-phrase | verb; verb:eats
; noun-phrase: article noun | noun ; article:the
; noun:bananas
|coconuts
;
or
expr: expr op expr |(
expr)
|number
; op:+
|-
|*
|/
;
Example 7.4: A Grammar for Arithmetics
Such rules, which left hand side is always reduced to a single nonterminal, define so called context free grammars. Context free grammars are properly more powerful than regular expressions: any regular expression can be represented by a context free grammar, and there are context free grammars defining languages that regular expressions cannot denote (e.g., the nested parentheses).
Grammars can be ambiguous: some sentences can be understood in
different ways. For instance the sentence number + number *
number
can be understood in two different ways by the grammar of the
example 7.4:
, expr . , expr . / | \ / | \ / | \ / | \ / | \ / | \ v v v v v v , expr. op expr expr op , expr. / | \ | | | | / | \ / | \ | | | | / | \ v v v | | | | v v v expr op expr | | | | expr op expr | | | | | | | | | | v | v v v v v v | vnumber
+
number
*
number
number
+
number
*
number
Example 7.5: Non Equivalent Parse Trees for number + number * number
Because having different interpretations of a given sentence cannot be
accepted for artificial languages (a satellite could easily be wasted if
programmed in an ambiguous language), we will work with unambiguous
grammars exclusively. For instance, we will have to refine the grammar
of the example 7.4 in order to use it with
Bison.
Please note that even if limited to the minus operator, this grammar is
still ambiguous: two parse trees represent number - number -
number
1. We, humans, don't find it ambiguous, because we know that by
convention -
is executed from left to right, or, in terms of
parenthesis, it forms groups of two from left to right. This is called
left associativity. There exist right associative
operators, such as power, or assignment in C, that handle their
rightmost parts first.
The following unambiguous grammar denotes the same language, but keeps only the conventional interpretation of subtraction: left associativity.
expr: expr-
number
|number
;
Example 7.6: An Unambiguous Grammar for -
Arithmetics
Let's look at our stack recognizer again, on the input number -
number
:
Step | Stack | Input
| |
1. | | number - number
| |
2. | number | - number
| |
3. | expr | - number
| |
4. | expr - | number
| |
5. | expr - number |
| |
6. | expr |
|
Example 7.7: An LR(1) Parsing of number - number
This time, we no longer can systematically apply the rule expr:
number
each time we see a number
on the top of the stack.
In both step 2 and step 5, the top of the stack contains a number
which can be reduced into an expr
. We did reduce from step 2,
but in step 5 we must not. If we did, our stack would then contain
expr - expr
, out of which nothing can be done: all the
number
s except the last one, must be converted into an
expr
. We observe that looking at the next token, named the
lookahead, solves the issue: if the top stack is number
,
then if the lookahead is minus
, shift, if the lookahead is
end-of-file, reduce the rule expr: num
, and any other
token is an error (think of number number
for instance).
The theory we presented in the Looking for Balanced Expressions can be extended to recognize patterns on the stack plus one lookahead symbol. This is called LR(1) parsing. As you will have surely guessed, LR(k) denotes this technique when peeking at the k next tokens in the input.
Unfortunately, even for reasonably simple grammars the automaton is quickly huge, so in practice, one limits herself to k = 1. Yet with a single lookahead, the automaton remains huge: a lot of research has been devoted to design reasonable limitations or compression techniques to have it fit into a reasonable machine. A successful limitation, which description falls out of the scope of this book, is known as LALR(1). A promising approach, DRLR, consists in recognizing the stack from its top instead of from its bottom. Intuitively there are less patterns to recognize since the pertinent information is usually near the top, hence the automaton is smaller.
In A Discriminative Reverse Approach to LR(k) Parsing, Fortes Gálvez José compares the handling of three grammars:
arithmetics |
A small grammar similar to that of the example 7.4, but unambiguous. There are 5 terminals, 3 nonterminals,
and 6 rules.
|
medium |
A medium size grammar comprising 41 terminals, 38 nonterminals, and 93
rules.
|
programming |
A real size grammar of a programming language, composed of 82 terminals,
68 nonterminals, and 234 rules.
|
by the three approaches. The size of the automaton is measured by its number of states and entries (roughly, its transitions):
LR(1) | LR(1) | LALR(1) | LALR(1) | DRLR(1) | DRLR(1)
| |
Grammar | States | Entries | States | Entries | States | Entries
|
arithmetics | 22 | 71 | 12 | 26 | 8 | 23
|
medium | 773 | 6,874 | 173 | 520 | 118 | 781
|
programming | 1,000+ | 15,000+ | 439 | 3,155 | 270 | 3,145
|
As you can see, LR(1) parsers are too expensive and as matter of fact there is no wide spread and used implementation, LALR(1) is reasonable and can be found in numerous tools such as Yacc and all its clones, and finally, because DRLR(1) addresses all the LR(1) grammars, it is an appealing alternative to Yacc. Bison is an implementation of Yacc, hence it handles LALR(1) grammars, but might support DRLR(1) some day, providing the full expressive power of LR(1).