Noeud:Advanced Use of Gperf, Noeud « Next »:, Noeud « Previous »:Using Gperf, Noeud « Up »:Scanning with Gperf

#### Advanced Use of Gperf

This section is devoted to a real application of Gperf: we will design `numeral`, an M4 module providing the builtin `numeral`, which converts numbers written in American English1 into their actual value2.

While it is obvious that we will need to map tokens (e.g., `two`, `forty`, `thousand`...) onto their values (`2`, `40`, `1000`...), the algorithm to reconstruct a value is less obvious. For a start, we won't try to handle "and" as in "one hundred and fifty".

Looking at `two hundred two` it is obvious that, reading from left to right, if the value of a token, `two`, is smaller than that of the next token, `hundred`, then we have to multiply, 2 * 100 = 200. If not, `hundred` followed by `two`, then we need to add, 200 + 2. It seems that using two variables --`res` for the current result, and `prev_value` to remember the value of the previous token-- is enough.

Unfortunately, it is not that simple. For instance ```two thousand two hundred``` would produce ((2 * 1000) + 2) * 100, while the correct equation is (2 * 1000) + (2 * 100). Since we are reading from left to right, we will need an additional variable to store the hundreds, `hundreds`. If the value of the current token is smaller that 1000, then we apply the previous algorithm to compute `hundreds` (add if smaller, multiply if greater), otherwise multiply the current `hundreds` by this value, and add to the global result.

```     if token value >= 1000 then
res += hundreds * token value
hundreds = 0
elseif previous value <= token value then
hundreds *= token value
else
hundreds += token value
end if
```

Finally, note that for `hundreds *= token value`, `hundreds` should be initialized to 1, while in `hundreds += token value`, 0 is the proper initialization. Instead of additional `if`s, we will use the GNU C extension, foo` ? : `bar standing for foo` ? `foo``` : ```bar3.

Finally, let the file `atoms.gperf` be:

```     %{	/* -*- C -*- */
#if HAVE_CONFIG_H
#  include <config.h>
#endif

#include <m4module.h>
#include <stdint.h>
#include "numeral.h"

%}
struct atom_s
{
const char *name;
const uintmax_t value;
};
%%
# Units.
zero,  0
one,   1
two,   2
three, 3
four,  4
five,  5
six,   6
seven, 7
eight, 8
nine,  9
# Teens.
ten,       10
eleven,    11
twelve,    12
thirteen,  13
fourteen,  14
fifteen,   15
sixteen,   16
seventeen, 17
eighteen,  18
nineteen,  19
# Tens.
twenty,   20
thirty,   30
forty,    40
fifty,    50
sixty,    60
seventy,  70
eighty,   80
ninety,   90
# Hundreds.
hundred,  100
hundreds, 100
# Powers of thousand.
thousands,    1000
thousand,     1000
million,      1000000
millions,     1000000
billion,      1000000000
billions,     1000000000
trillion,     1000000000000
trillions,    1000000000000
quintillion,  1000000000000000000
quintillions, 1000000000000000000
%%
uintmax_t
numeral_convert (const char *str)
{
const char *alphabet = "abcdefghijklmnopqrstuvwxyz";
uintmax_t res = 0, hundreds = 0, prev_value = 0;

while ((str = strpbrk (str, alphabet)))
{
size_t len = strspn (str, alphabet);
struct atom_s *atom = in_word_set (str, len);
if (!atom)
{
numeral_error = strndup (str, len);
return 0;
}
if (atom->value >= 1000)
{
res += (hundreds ? : 1) * atom->value;
hundreds = 0;
}
else if (prev_value <= atom->value)
{
hundreds = (hundreds ? : 1) * atom->value;
}
else
{
hundreds += atom->value;
}
prev_value = atom->value;
str += len;
}
return res + hundreds;
}
```
```     Example 6.4: `atoms.gperf` -- Converting Numbers in American
English into Integers
```
English into Integers

Finally, `numeral.c` contains all the needed hooks to create an M4 module, and the definition of the builtin, `numeral`:

```     /*--------------------.
| numeral(EXPRESSION) |
`--------------------*/

M4BUILTIN_HANDLER (numeral)
{
uintmax_t res;
char buf;

res = numeral_convert (M4ARG (1));
if (!numeral_error)
{
sprintf (buf, "%ju", numeral_convert (M4ARG (1)));
obstack_grow (obs, buf, strlen (buf));
}
else
{
M4ERROR ((warning_status, 0,
_("Warning: %s: invalid number component: %s"),
M4ARG (0), numeral_error));
free (numeral_error);
numeral_error = NULL;
}
}
```
```     Example 6.5: `numeral.c` -- M4 Module Wrapping `atoms.gperf`
```

Let us run `gperf` to create `atoms.c`:

```     \$ gperf --switch=1 --language=ANSI-C --struct-type atoms.gperf >atoms.c
error-->Key link: "eight" = "three", with key set "et".
error-->Key link: "thirty" = "twenty", with key set "ty".
error-->Key link: "fifty" = "forty", with key set "fy".
error-->Key link: "trillion" = "thirteen", with key set "nt".
error-->Key link: "trillions" = "thousands", with key set "st".
error-->Key link: "quintillion" = "quadrillion", with key set "nq".
error-->Key link: "quintillions" = "quadrillions", with key set "qs".
error-->7 input keys have identical hash values,
error-->try different key positions or use option -D.
```

Eek! Gperf recognizes that its simple heuristic, based apparently on peeking only at the first and last characters of each word (and their lengths), fails, and it lists the clashes. If performances do not matter too much for your application, such as our, then using `--duplicates`, `-D`, asks Gperf to handle the collisions using successive string comparisons:

```     \$ gperf -S 1 -L ANSI-C -t -D atoms.gperf >atoms.c
error-->7 input keys have identical hash values, examine output carefully...
```

This time the message is only informative, the hash function will just be efficient, not extremely efficient, that's all. If, on the contrary, performances matter to you, then you may play with the various options of `gperf`, see Using Gperf. We will try to find a better set of character positions to peek at, using `--key-positions`, `-k`.

Notice that the last characters are often not very informative: `n` is frequently used because of `-teen` and `-lion`, `y` because of `yty`, and `s` because of the plurals etc. This suggests not to use `\$` in our key positions. Since `ten` is our shortest word, we need to consider at least one of 1, 2, and 3 as key position. If you try each one, 3 is the best choice with 7 collisions, against 13 for 1, and 14 for 2:

```     \$ gperf -t -k3 atoms.gperf >/dev/null
error-->Key link: "twelve" = "eleven", with key set "e".
error-->Key link: "twenty" = "eleven", with key set "e".
error-->Key link: "forty" = "three", with key set "r".
error-->Key link: "hundreds" = "nineteen", with key set "n".
error-->Key link: "billion" = "million", with key set "l".
error-->Key link: "billions" = "millions", with key set "l".
error-->Key link: "trillion" = "thirteen", with key set "i".
error-->7 input keys have identical hash values,
error-->try different key positions or use option -D.
```

Choosing 1 is seducing, since most of the collided words start with a different character:

```     \$ gperf -t -k1,3 atoms.gperf >/dev/null
error-->Key link: "twenty" = "twelve", with key set "et".
error-->Key link: "trillion" = "thirteen", with key set "it".
error-->2 input keys have identical hash values,
error-->try different key positions or use option -D.
```

Finally, obviously, the fourth, fifth and sixth characters will solve the remaining conflicts. Choosing the fourth, we can run `gperf` with all the options required by our application:

```     \$ gperf -S 1 -L ANSI-C -t -k1,3,4 atoms.gperf >atoms.c
```

then compile our numeral module, and try it:

```     \$ m4 -m numeral
numeral(forty)
=>40
numeral(two)
=>2
numeral(forty-two)
error-->m4: stdin: 3: Warning: numeral: invalid number component: forty
=>
```

Huh? What the f* `invalid number component: forty`? Clearly this Gperf recognizer demonstrated that it does know `forty` and `two`, but it does not recognize `forty` in `forty-two`. How come?

We just fell into a huge trap left wide open in Gperf, meant by its authors: although we do specify the length of the word to check, Gperf uses `strcmp`! In the present case, it is comparing `forty-two` against `forty` using `strcmp`, even though we did mention the five first characters of `forty-two` were to be checked.

In fact, what we told Gperf is that if it wants to peek at the last character of the word, then this character is at the fifth position, but when comparing strings to make sure the word and the keyword it might be are equal, it uses a plain regular C string comparison.

I cannot emphasize this too much: if you are not submitting 0 ended strings then use `--compare-strncmp`.

Trying again:

```     \$ gperf -S 1 -L ANSI-C -t -k1,3,4 --compare-strncmp atoms.gperf >atoms.c
```

and then:

```     \$ m4 -m numeral
numeral(forty-two)
=>42
numeral(
twelve quintillion
three hundred forty-five quadrillion
six hundred seventy-eight trillion
nine hundred one billion
two hundred thirty-four million
five hundred sixty-seven thousand
eight hundred ninety
)
=>12345678901234567890
```

#### Notes de bas de page

1. In American English, the words ``billion'', ``trillion'', etc. denote a different value than in Commonwealth English.

2. To be more rigorous, internally they are converted into `uintmax_t`, and output as strings of digits.

3. There are not strictly equivalent, since in foo` ? `foo``` : ```bar the expression foo is evaluated twice, while in foo` ? : `bar it is evaluated only once.