Programming with GNU Software

Noeud:Advanced Use of Bison, Noeud Ť Next ť:The yleval Module, Noeud Ť Previous ť:Using Actions, Noeud Ť Up ť:Parsing

Advanced Use of Bison

The example of the previous section is very representative of real uses of Bison, except for its scale. Nevertheless, there remains a few issues to learn before your Bison expert graduation. In this section, we complete a real case example started in Advanced Use of Flex: a reimplementation of M4's eval builtin, using Flex and Bison.

Bison parsers are often used on top of Flex scanners. As already explained, the scanner then needs to know the definition of the token types, and the token value types: to this end use --defines or %defines to produce a header containing their definition.

The real added value of good parsers over correct parsers is in the handling of errors: the accuracy and readability of the error messages, and their ability to proceed as much as possible instead of merely dying at the first glitch¹.

It is an absolute necessity for error messages to specify the precise location of the errors. To this end, when given --locations or %locations, bison provides an additional type, yyltype which defaults to:

     typedef struct yyltype
     {
       int first_line;
       int first_column;
     
       int last_line;
       int last_column;
     } yyltype;

and another global variable, yylloc, which the scanner is expected to set for each token (voir Advanced Use of Flex). Then, the parser will automatically keep track of the locations not only of the tokens, but also of nonterminals. For instance, it knows that the location of 1 + 20 * 300 starts where 1 starts, and ends where 300 does. As with $$, $1 etc., you may use @$, @1 to set/access the location of a symbol, for instance in "division by 0" error messages.

It is unfortunate, but the simple result of history, that improving the error messages requires some black incantations:

     #define YYERROR_VERBOSE 1
     #define yyerror(Msg) yyerror (&yylloc, Msg)

in other words, even with %locations Bison does not pass the location to yyerror by default. Similarly, enriching the traces with the locations requires:

     /* FIXME: There used to be locations here.  */
     #define YYPRINT(File, Type, Value) yyprint (File, /* &yylloc, */ Type, &Value)

In the real world, using a fixed set of names such as yylex, yyparse, yydebug and so on is wrong: you may have several parsers in the same program, or, in the case of libraries, you must stay within your own name space. Similarly, using global variables such as yylval, yylloc is dangerous, and prevents any recursive use of the parser.

These two problems can be solved with Bison: use %name-prefix="prefix" to rename of the yyfoos into prefixfoos, and use %pure-parser to make Bison define yylval etc. as variables of yyparse instead of global variables.

If you are looking for global-free parsers, then, obviously, you need to be able to exchange information with yyparse, otherwise, how could it return something to you! Bison provides an optional user parameter, which is typically a structure in which you include all the information you might need. This structure is conventionally called a parser control structure, and for consistency I suggest naming it yycontrol. To define it, merely define YYPARSE_PARAM:

     #define YYPARSE_PARAM yycontrol

Moving away from yy via %name-prefix, and from global variables via %pure-parser also deeply change the protocol between the scanner and the parser: before, yylex expected to find a global variable named yylval, but now it is named foolval and it is not global any more!

Scanners written with Flex can easily be adjusted: give it %option prefix="prefix" to change the yy prefix, and explain, (i) to Bison that you want to pass an additional parameter to yylex:

     #define YYLEX_PARAM yycontrol

and (ii) on the other hand, to Flex that the prototype of yylex is to be changed:

     #define YY_DECL \
       int yylex (yystype *yylval, yyltype *yylloc, yycontrol_t *yycontrol)

Putting this all together for our eval reimplementation gives:

     %debug
     %defines
     %locations
     %pure-parser
     %name-prefix="yleval_"
     /* Request detailed parse error messages. */
     %error-verbose
     
     %{
     #if HAVE_CONFIG_H
     #  include <config.h>
     #endif
     
     #include <m4module.h>
     #include "yleval.h"
     
     /* When debugging our pure parser, we want to see values and locations
        of the tokens.  */
     /* FIXME: Locations. */
     #define YYPRINT(File, Type, Value) \
             yyprint (File, /* FIXME: &yylloc, */ Type, &Value)
     static void yyprint (FILE *file, /* FIXME: const yyltype *loc, */
                          int type, const yystype *value);
     /*
     void yyerror (const YYLTYPE *location, yleval_control_t *yleval_control,
                   const char *message, ...);
     */
     
     %}
     
     /* Pass the control structure to YYPARSE and YYLEX. */
     %parse-param "yleval_control_t *yycontrol", "yycontrol"
     %lex-param   "yleval_control_t *yycontrol", "yycontrol"
     
     /* Only NUMBERS have a value.  */
     %union
     {
       number number;
     };

The name of the tokens is an implementation detail: the user has no reason to know NUMBER is your token type for numbers. Unfortunately, these token names appear in the verbose error messages, and we want them! For instance, in the previous section, we implemented calc:

     $ ./calc '(' 2 + ')'
     error-->./calc: parse error, unexpected ')', expecting NUMBER or '('

Bison lets you specify the "visible" symbol names to produce:

     $ ./calc '(' 2 + ')'
     error-->./calc: parse error, unexpected ')', expecting "number" or '('

which is not perfect, but still better. In our case:

     /* Define the tokens together with there human representation. */
     %token YLEVAL_EOF 0 "end of string"
     %token <number> NUMBER "number"
     %token LEFTP "(" RIGHTP ")"
     %token LOR  "||"
     %token LAND "&&"
     %token OR   "|"
     %token XOR  "^"
     %token AND  "&"
     %token EQ "=" NOTEQ "!="
     %token GT ">" GTEQ ">=" LS "<" LSEQ "<="
     %token LSHIFT "<<" RSHIFT ">>"
     %token PLUS "+" MINUS "-"
     %token TIMES "*" DIVIDE "/" MODULO "%" RATIO ":"
     %token EXPONENT "**"
     %token LNOT "!" NOT "~" UNARY
     
     %type <number> exp

There remains to define the precedences for all our operators, which ends our prologue:

     /* Specify associativities and precedences from the lowest to the
        highest.  */
     %left LOR
     %left LAND
     %left OR
     %left XOR
     %left AND
     /* These operators are non associative, in other words, we forbid
        `-1 < 0 < 1'.  C allows this, but understands it as
        `(-1 < 0) < 1' which evaluates to... false.  */
     %nonassoc EQ NOTEQ
     %nonassoc GT GTEQ LS LSEQ
     %nonassoc LSHIFT RSHIFT
     %left  PLUS MINUS
     %left  TIMES DIVIDE MODULO RATIO
     %right EXPONENT
     /* UNARY is used only as a precedence place holder for the
        unary plus/minus rules.  */
     %right LNOT NOT UNARY
     %%

The next section of ylparse.y is the core of the grammar: its rules. The very first rule deserves special attention:

     result: { LOCATION_RESET (yylloc) } exp { yycontrol->result = $2; };

it aims at initializing yylloc before the parsing actually starts. It does so via a so called mid-rule action, i.e., a rule which is executed before the whole rule is recognized². Then it looks for a single expression, which value consists in the result of the parse: we store it in the parser control structure.

The following chunk addresses the token NUMBER. It relies on the default action: $$ = $1.

     /* Plain numbers. */
     exp:
       NUMBER
     ;
     
     /* Parentheses. */
     exp:
       LEFTP exp RIGHTP { $$ = $2; }
     ;

The next bit of the grammar describes arithmetics. Two treatments deserve attention:

Unary Operators: We want the unary minus and plus to bind extremely tightly: their precedence is higher than that binary plus and minus. But, as already explained, the precedence of a rule is that of its last token... by default... To override this default, we use %prec which you can put anywhere in the rule. The rules below read as "the rules for unary minus and plus have the precedence of the precedence place holder UNARY".
Semantic Errors: Not all the exponentiations are valid (2 ^ -1), nor are all the divisions and moduli (1 / 0, 1 % 0). When such errors are detected, we abort the parsing by invoking YYABORT.

     /* Arithmetics. */
     exp:
       PLUS  exp        { $$ =   $2; }   %prec UNARY
     | MINUS exp        { $$ = - $2; }   %prec UNARY
     | exp PLUS     exp { $$ = $1 + $3; }
     | exp MINUS    exp { $$ = $1 - $3; }
     | exp TIMES    exp { $$ = $1 * $3; }
     | exp DIVIDE   exp { if (!yldiv (yycontrol, &@$, &$$, $1, $3)) YYABORT; }
     | exp MODULO   exp { if (!ylmod (yycontrol, &@$, &$$, $1, $3)) YYABORT; }
     | exp RATIO    exp { if (!yldiv (yycontrol, &@$, &$$, $1, $3)) YYABORT; }
     | exp EXPONENT exp { if (!ylpow (yycontrol, &@$, &$$, $1, $3)) YYABORT; }
     ;
     
     /* Booleans. */
     exp:
       LNOT exp     { $$ = ! $2; }
     | exp LAND exp { $$ = $1 && $3; }
     | exp LOR  exp { $$ = $1 || $3; }
     ;
     
     /* Comparisons. */
     exp:
       exp EQ    exp { $$ = $1 == $3; }
     | exp NOTEQ exp { $$ = $1 != $3; }
     | exp LS    exp { $$ = $1 <  $3; }
     | exp LSEQ  exp { $$ = $1 <= $3; }
     | exp GT    exp { $$ = $1 >  $3; }
     | exp GTEQ  exp { $$ = $1 >= $3; }
     ;
     
     /* Bitwise. */
     exp:
       NOT exp        { $$ = ~ $2; }
     | exp AND    exp { $$ = $1 & $3; }
     | exp OR     exp { $$ = $1 | $3; }
     | exp LSHIFT exp { $$ = $1 << $3; }
     | exp RSHIFT exp { $$ = $1 >> $3; }
     ;
     %%

Finally, the epilogue of our parser consists in the definition of the tracing function, yyprint:

     /*------------------------------------------------------------------.
     | When debugging the parser, display tokens' locations and values.  |
     `------------------------------------------------------------------*/
     
     static void
     yyprint (FILE *file,
              /* FIXME: const yyltype *loc, */ int type, const yystype *value)
     {
       fputs (" (", file);
       /* FIXME: LOCATION_PRINT (file, *loc); */
       fputs (")", file);
       switch (type)
         {
         case NUMBER:
           fprintf (file, " = %ld", value->number);
           break;
         }
     }

Notes de bas de page

No one would ever accept a compiler which reports only the first error it finds, and then exits!
The attentive reader will notice that the notion of mid-rule action does not fit with LALR(1) techniques: an action can only be performed once a whole rule is recognized. The paradox is simple to solve: internally Bison rewrites the above fragment as
```
     result: @1 exp    { yycontrol->result = $2; };
     @1: /* Empty. */  { LOCATION_RESET (yylloc); };
     
```
where @1 is a fresh nonterminal. You are invited to read the --verbose report which does include these invisible symbols and rules.