Noeud:Advanced Use of Gperf, Noeud « Next »:Using Gperf with the GNU Build System, Noeud « Previous »:Using Gperf, Noeud « Up »:Scanning with Gperf
This section is devoted to a real application of Gperf: we will design
numeral
, an M4 module providing the builtin numeral
, which
converts numbers written in American English1 into their actual
value2.
While it is obvious that we will need to map tokens (e.g., two
,
forty
, thousand
...) onto their values (2
,
40
, 1000
...), the algorithm to reconstruct a value is less
obvious. For a start, we won't try to handle "and" as in "one
hundred and fifty".
Looking at two hundred two
it is obvious that, reading from left
to right, if the value of a token, two
, is smaller than that of
the next token, hundred
, then we have to multiply, 2 * 100 = 200.
If not, hundred
followed by two
, then we need to add, 200
+ 2. It seems that using two variables --res
for the current
result, and prev_value
to remember the value of the previous
token-- is enough.
Unfortunately, it is not that simple. For instance two thousand
two hundred
would produce ((2 * 1000) + 2) * 100, while the correct
equation is (2 * 1000) + (2 * 100). Since we are reading from left to
right, we will need an additional variable to store the hundreds,
hundreds
. If the value of the current token is smaller that
1000, then we apply the previous algorithm to compute hundreds
(add if smaller, multiply if greater), otherwise multiply the current
hundreds
by this value, and add to the global result.
if token value >= 1000 then res += hundreds * token value hundreds = 0 elseif previous value <= token value then hundreds *= token value else hundreds += token value end if
Finally, note that for hundreds *= token value
,
hundreds
should be initialized to 1, while in
hundreds += token value
, 0 is the proper initialization.
Instead of additional if
s, we will use the GNU C extension,
foo
? :
bar standing for
foo
?
foo
:
bar3.
Finally, let the file atoms.gperf
be:
%{ /* -*- C -*- */ #if HAVE_CONFIG_H # include <config.h> #endif #include <m4module.h> #include <stdint.h> #include "numeral.h" %} struct atom_s { const char *name; const uintmax_t value; }; %% # Units. zero, 0 one, 1 two, 2 three, 3 four, 4 five, 5 six, 6 seven, 7 eight, 8 nine, 9 # Teens. ten, 10 eleven, 11 twelve, 12 thirteen, 13 fourteen, 14 fifteen, 15 sixteen, 16 seventeen, 17 eighteen, 18 nineteen, 19 # Tens. twenty, 20 thirty, 30 forty, 40 fifty, 50 sixty, 60 seventy, 70 eighty, 80 ninety, 90 # Hundreds. hundred, 100 hundreds, 100 # Powers of thousand. thousands, 1000 thousand, 1000 million, 1000000 millions, 1000000 billion, 1000000000 billions, 1000000000 trillion, 1000000000000 trillions, 1000000000000 quadrillion, 1000000000000000 quadrillions, 1000000000000000 quintillion, 1000000000000000000 quintillions, 1000000000000000000 %% uintmax_t numeral_convert (const char *str) { const char *alphabet = "abcdefghijklmnopqrstuvwxyz"; uintmax_t res = 0, hundreds = 0, prev_value = 0; while ((str = strpbrk (str, alphabet))) { size_t len = strspn (str, alphabet); struct atom_s *atom = in_word_set (str, len); if (!atom) { numeral_error = strndup (str, len); return 0; } if (atom->value >= 1000) { res += (hundreds ? : 1) * atom->value; hundreds = 0; } else if (prev_value <= atom->value) { hundreds = (hundreds ? : 1) * atom->value; } else { hundreds += atom->value; } prev_value = atom->value; str += len; } return res + hundreds; }
Example 6.4: atoms.gperf
-- Converting Numbers in American
English into Integers
English into Integers
Finally, numeral.c
contains all the needed hooks to create an M4
module, and the definition of the builtin, numeral
:
/*--------------------. | numeral(EXPRESSION) | `--------------------*/ M4BUILTIN_HANDLER (numeral) { uintmax_t res; char buf[128]; res = numeral_convert (M4ARG (1)); if (!numeral_error) { sprintf (buf, "%ju", numeral_convert (M4ARG (1))); obstack_grow (obs, buf, strlen (buf)); } else { M4ERROR ((warning_status, 0, _("Warning: %s: invalid number component: %s"), M4ARG (0), numeral_error)); free (numeral_error); numeral_error = NULL; } }
Example 6.5:numeral.c
-- M4 Module Wrappingatoms.gperf
Let us run gperf
to create atoms.c
:
$ gperf --switch=1 --language=ANSI-C --struct-type atoms.gperf >atoms.c error-->Key link: "eight" = "three", with key set "et". error-->Key link: "thirty" = "twenty", with key set "ty". error-->Key link: "fifty" = "forty", with key set "fy". error-->Key link: "trillion" = "thirteen", with key set "nt". error-->Key link: "trillions" = "thousands", with key set "st". error-->Key link: "quintillion" = "quadrillion", with key set "nq". error-->Key link: "quintillions" = "quadrillions", with key set "qs". error-->7 input keys have identical hash values, error-->try different key positions or use option -D.
Eek! Gperf recognizes that its simple heuristic, based apparently on
peeking only at the first and last characters of each word (and their
lengths), fails, and it lists the clashes. If performances do not
matter too much for your application, such as our, then using
--duplicates
, -D
, asks Gperf to handle the collisions
using successive string comparisons:
$ gperf -S 1 -L ANSI-C -t -D atoms.gperf >atoms.c error-->7 input keys have identical hash values, examine output carefully...
This time the message is only informative, the hash function will just
be efficient, not extremely efficient, that's all. If, on the contrary,
performances matter to you, then you may play with the various options
of gperf
, see Using Gperf. We will try to find a better
set of character positions to peek at, using --key-positions
,
-k
.
Notice that the last characters are often not very informative: n
is frequently used because of -teen
and -lion
, y
because of yty
, and s
because of the plurals etc. This
suggests not to use $
in our key positions. Since ten
is
our shortest word, we need to consider at least one of 1, 2, and 3 as
key position. If you try each one, 3 is the best choice with 7
collisions, against 13 for 1, and 14 for 2:
$ gperf -t -k3 atoms.gperf >/dev/null error-->Key link: "twelve" = "eleven", with key set "e". error-->Key link: "twenty" = "eleven", with key set "e". error-->Key link: "forty" = "three", with key set "r". error-->Key link: "hundreds" = "nineteen", with key set "n". error-->Key link: "billion" = "million", with key set "l". error-->Key link: "billions" = "millions", with key set "l". error-->Key link: "trillion" = "thirteen", with key set "i". error-->7 input keys have identical hash values, error-->try different key positions or use option -D.
Choosing 1 is seducing, since most of the collided words start with a different character:
$ gperf -t -k1,3 atoms.gperf >/dev/null error-->Key link: "twenty" = "twelve", with key set "et". error-->Key link: "trillion" = "thirteen", with key set "it". error-->2 input keys have identical hash values, error-->try different key positions or use option -D.
Finally, obviously, the fourth, fifth and sixth characters will solve
the remaining conflicts. Choosing the fourth, we can run
gperf
with all the options required by our application:
$ gperf -S 1 -L ANSI-C -t -k1,3,4 atoms.gperf >atoms.c
then compile our numeral module, and try it:
$ m4 -m numeral numeral(forty) =>40 numeral(two) =>2 numeral(forty-two) error-->m4: stdin: 3: Warning: numeral: invalid number component: forty =>
Huh? What the f* invalid number component: forty
? Clearly this
Gperf recognizer demonstrated that it does know forty
and
two
, but it does not recognize forty
in forty-two
.
How come?
We just fell into a huge trap left wide open in Gperf, meant by its
authors: although we do specify the length of the word to check,
Gperf uses strcmp
! In the present case, it is comparing
forty-two
against forty
using strcmp
, even though
we did mention the five first characters of forty-two
were to be
checked.
In fact, what we told Gperf is that if it wants to peek at the last character of the word, then this character is at the fifth position, but when comparing strings to make sure the word and the keyword it might be are equal, it uses a plain regular C string comparison.
I cannot emphasize this too much: if you are not submitting 0
ended strings then use --compare-strncmp
.
Trying again:
$ gperf -S 1 -L ANSI-C -t -k1,3,4 --compare-strncmp atoms.gperf >atoms.c
and then:
$ m4 -m numeral numeral(forty-two) =>42 numeral( twelve quintillion three hundred forty-five quadrillion six hundred seventy-eight trillion nine hundred one billion two hundred thirty-four million five hundred sixty-seven thousand eight hundred ninety ) =>12345678901234567890
In American English, the words ``billion'', ``trillion'', etc. denote a different value than in Commonwealth English.
To be more rigorous, internally they are converted into
uintmax_t
, and output as strings of digits.
There are not strictly equivalent, since in foo
?
foo
:
bar the expression foo is evaluated twice, while
in
foo
? :
bar it is evaluated only once.