Noeud:Advanced Use of Bison, Noeud « Next »:The yleval Module, Noeud « Previous »:Using Actions, Noeud « Up »:Parsing
The example of the previous section is very representative of real uses
of Bison, except for its scale. Nevertheless, there remains a few
issues to learn before your Bison expert graduation. In this section,
we complete a real case example started in Advanced Use of Flex: a
reimplementation of M4's eval
builtin, using Flex and Bison.
Bison parsers are often used on top of Flex scanners. As already
explained, the scanner then needs to know the definition of the token
types, and the token value types: to this end use --defines
or
%defines
to produce a header containing their definition.
The real added value of good parsers over correct parsers is in the handling of errors: the accuracy and readability of the error messages, and their ability to proceed as much as possible instead of merely dying at the first glitch1.
It is an absolute necessity for error messages to specify the precise
location of the errors. To this end, when given --locations
or
%locations
, bison
provides an additional type,
yyltype
which defaults to:
typedef struct yyltype { int first_line; int first_column; int last_line; int last_column; } yyltype;
and another global variable, yylloc
, which the scanner is
expected to set for each token (voir Advanced Use of Flex). Then,
the parser will automatically keep track of the locations not only of
the tokens, but also of nonterminals. For instance, it knows that the
location of 1 + 20 * 300
starts where 1
starts, and ends
where 300
does. As with $$
, $1
etc., you may use
@$
, @1
to set/access the location of a symbol, for
instance in "division by 0" error messages.
It is unfortunate, but the simple result of history, that improving the error messages requires some black incantations:
#define YYERROR_VERBOSE 1 #define yyerror(Msg) yyerror (&yylloc, Msg)
in other words, even with %locations
Bison does not pass the
location to yyerror
by default. Similarly, enriching the traces
with the locations requires:
/* FIXME: There used to be locations here. */ #define YYPRINT(File, Type, Value) yyprint (File, /* &yylloc, */ Type, &Value)
In the real world, using a fixed set of names such as yylex
,
yyparse
, yydebug
and so on is wrong: you may have several
parsers in the same program, or, in the case of libraries, you must stay
within your own name space. Similarly, using global variables such as
yylval
, yylloc
is dangerous, and prevents any recursive
use of the parser.
These two problems can be solved with Bison: use
%name-prefix="
prefix"
to rename of the yyfoo
s into
prefix
foo
s, and use %pure-parser
to make Bison
define yylval
etc. as variables of yyparse
instead of
global variables.
If you are looking for global-free parsers, then, obviously, you need to
be able to exchange information with yyparse
, otherwise, how
could it return something to you! Bison provides an optional user
parameter, which is typically a structure in which you include all the
information you might need. This structure is conventionally called a
parser control structure, and for consistency I suggest naming it
yycontrol
. To define it, merely define YYPARSE_PARAM
:
#define YYPARSE_PARAM yycontrol
Moving away from yy
via %name-prefix
, and from global
variables via %pure-parser
also deeply change the protocol
between the scanner and the parser: before, yylex
expected to
find a global variable named yylval
, but now it is named
foolval
and it is not global any more!
Scanners written with Flex can easily be adjusted: give it %option
prefix="
prefix"
to change the yy
prefix, and explain, (i)
to Bison that you want to pass an additional parameter to yylex
:
#define YYLEX_PARAM yycontrol
and (ii) on the other hand, to Flex that the prototype of yylex
is to be changed:
#define YY_DECL \ int yylex (yystype *yylval, yyltype *yylloc, yycontrol_t *yycontrol)
Putting this all together for our eval
reimplementation gives:
%debug %defines %locations %pure-parser %name-prefix="yleval_" /* Request detailed parse error messages. */ %error-verbose %{ #if HAVE_CONFIG_H # include <config.h> #endif #include <m4module.h> #include "yleval.h" /* When debugging our pure parser, we want to see values and locations of the tokens. */ /* FIXME: Locations. */ #define YYPRINT(File, Type, Value) \ yyprint (File, /* FIXME: &yylloc, */ Type, &Value) static void yyprint (FILE *file, /* FIXME: const yyltype *loc, */ int type, const yystype *value); /* void yyerror (const YYLTYPE *location, yleval_control_t *yleval_control, const char *message, ...); */ %} /* Pass the control structure to YYPARSE and YYLEX. */ %parse-param "yleval_control_t *yycontrol", "yycontrol" %lex-param "yleval_control_t *yycontrol", "yycontrol" /* Only NUMBERS have a value. */ %union { number number; };
The name of the tokens is an implementation detail: the user has no
reason to know NUMBER
is your token type for numbers.
Unfortunately, these token names appear in the verbose error messages,
and we want them! For instance, in the previous section, we implemented
calc
:
$ ./calc '(' 2 + ')' error-->./calc: parse error, unexpected ')', expecting NUMBER or '('
Bison lets you specify the "visible" symbol names to produce:
$ ./calc '(' 2 + ')' error-->./calc: parse error, unexpected ')', expecting "number" or '('
which is not perfect, but still better. In our case:
/* Define the tokens together with there human representation. */ %token YLEVAL_EOF 0 "end of string" %token <number> NUMBER "number" %token LEFTP "(" RIGHTP ")" %token LOR "||" %token LAND "&&" %token OR "|" %token XOR "^" %token AND "&" %token EQ "=" NOTEQ "!=" %token GT ">" GTEQ ">=" LS "<" LSEQ "<=" %token LSHIFT "<<" RSHIFT ">>" %token PLUS "+" MINUS "-" %token TIMES "*" DIVIDE "/" MODULO "%" RATIO ":" %token EXPONENT "**" %token LNOT "!" NOT "~" UNARY %type <number> exp
There remains to define the precedences for all our operators, which ends our prologue:
/* Specify associativities and precedences from the lowest to the highest. */ %left LOR %left LAND %left OR %left XOR %left AND /* These operators are non associative, in other words, we forbid `-1 < 0 < 1'. C allows this, but understands it as `(-1 < 0) < 1' which evaluates to... false. */ %nonassoc EQ NOTEQ %nonassoc GT GTEQ LS LSEQ %nonassoc LSHIFT RSHIFT %left PLUS MINUS %left TIMES DIVIDE MODULO RATIO %right EXPONENT /* UNARY is used only as a precedence place holder for the unary plus/minus rules. */ %right LNOT NOT UNARY %%
The next section of ylparse.y
is the core of the grammar: its
rules. The very first rule deserves special attention:
result: { LOCATION_RESET (yylloc) } exp { yycontrol->result = $2; };
it aims at initializing yylloc
before the parsing actually
starts. It does so via a so called mid-rule action, i.e., a rule
which is executed before the whole rule is recognized2. Then it looks for a single expression, which value consists in the
result of the parse: we store it in the parser control structure.
The following chunk addresses the token NUMBER
. It relies on the
default action: $$ = $1
.
/* Plain numbers. */ exp: NUMBER ; /* Parentheses. */ exp: LEFTP exp RIGHTP { $$ = $2; } ;
The next bit of the grammar describes arithmetics. Two treatments deserve attention:
%prec
which you can
put anywhere in the rule. The rules below read as "the rules for unary
minus and plus have the precedence of the precedence place holder
UNARY
".
2 ^ -1
), nor are all the
divisions and moduli (1 / 0
, 1 % 0
). When such errors are
detected, we abort the parsing by invoking YYABORT
.
/* Arithmetics. */ exp: PLUS exp { $$ = $2; } %prec UNARY | MINUS exp { $$ = - $2; } %prec UNARY | exp PLUS exp { $$ = $1 + $3; } | exp MINUS exp { $$ = $1 - $3; } | exp TIMES exp { $$ = $1 * $3; } | exp DIVIDE exp { if (!yldiv (yycontrol, &@$, &$$, $1, $3)) YYABORT; } | exp MODULO exp { if (!ylmod (yycontrol, &@$, &$$, $1, $3)) YYABORT; } | exp RATIO exp { if (!yldiv (yycontrol, &@$, &$$, $1, $3)) YYABORT; } | exp EXPONENT exp { if (!ylpow (yycontrol, &@$, &$$, $1, $3)) YYABORT; } ; /* Booleans. */ exp: LNOT exp { $$ = ! $2; } | exp LAND exp { $$ = $1 && $3; } | exp LOR exp { $$ = $1 || $3; } ; /* Comparisons. */ exp: exp EQ exp { $$ = $1 == $3; } | exp NOTEQ exp { $$ = $1 != $3; } | exp LS exp { $$ = $1 < $3; } | exp LSEQ exp { $$ = $1 <= $3; } | exp GT exp { $$ = $1 > $3; } | exp GTEQ exp { $$ = $1 >= $3; } ; /* Bitwise. */ exp: NOT exp { $$ = ~ $2; } | exp AND exp { $$ = $1 & $3; } | exp OR exp { $$ = $1 | $3; } | exp LSHIFT exp { $$ = $1 << $3; } | exp RSHIFT exp { $$ = $1 >> $3; } ; %%
Finally, the epilogue of our parser consists in the definition of the
tracing function, yyprint
:
/*------------------------------------------------------------------. | When debugging the parser, display tokens' locations and values. | `------------------------------------------------------------------*/ static void yyprint (FILE *file, /* FIXME: const yyltype *loc, */ int type, const yystype *value) { fputs (" (", file); /* FIXME: LOCATION_PRINT (file, *loc); */ fputs (")", file); switch (type) { case NUMBER: fprintf (file, " = %ld", value->number); break; } }
No one would ever accept a compiler which reports only the first error it finds, and then exits!
The attentive reader will notice that the notion of mid-rule action does not fit with LALR(1) techniques: an action can only be performed once a whole rule is recognized. The paradox is simple to solve: internally Bison rewrites the above fragment as
result: @1 exp { yycontrol->result = $2; }; @1: /* Empty. */ { LOCATION_RESET (yylloc); };where
@1
is a fresh nonterminal. You are invited to read the
--verbose
report which does include these invisible symbols and
rules.