Noeud:Advanced Use of Gperf, Noeud « Next »:, Noeud « Previous »:Using Gperf, Noeud « Up »:Scanning with Gperf



Advanced Use of Gperf

This section is devoted to a real application of Gperf: we will design numeral, an M4 module providing the builtin numeral, which converts numbers written in American English1 into their actual value2.

While it is obvious that we will need to map tokens (e.g., two, forty, thousand...) onto their values (2, 40, 1000...), the algorithm to reconstruct a value is less obvious. For a start, we won't try to handle "and" as in "one hundred and fifty".

Looking at two hundred two it is obvious that, reading from left to right, if the value of a token, two, is smaller than that of the next token, hundred, then we have to multiply, 2 * 100 = 200. If not, hundred followed by two, then we need to add, 200 + 2. It seems that using two variables --res for the current result, and prev_value to remember the value of the previous token-- is enough.

Unfortunately, it is not that simple. For instance two thousand two hundred would produce ((2 * 1000) + 2) * 100, while the correct equation is (2 * 1000) + (2 * 100). Since we are reading from left to right, we will need an additional variable to store the hundreds, hundreds. If the value of the current token is smaller that 1000, then we apply the previous algorithm to compute hundreds (add if smaller, multiply if greater), otherwise multiply the current hundreds by this value, and add to the global result.

     if token value >= 1000 then
       res += hundreds * token value
       hundreds = 0
     elseif previous value <= token value then
       hundreds *= token value
     else
       hundreds += token value
     end if
     

Finally, note that for hundreds *= token value, hundreds should be initialized to 1, while in hundreds += token value, 0 is the proper initialization. Instead of additional ifs, we will use the GNU C extension, foo ? : bar standing for foo ? foo : bar3.

Finally, let the file atoms.gperf be:

     %{	/* -*- C -*- */
     #if HAVE_CONFIG_H
     #  include <config.h>
     #endif
     
     #include <m4module.h>
     #include <stdint.h>
     #include "numeral.h"
     
     %}
     struct atom_s
     {
       const char *name;
       const uintmax_t value;
     };
     %%
     # Units.
     zero,  0
     one,   1
     two,   2
     three, 3
     four,  4
     five,  5
     six,   6
     seven, 7
     eight, 8
     nine,  9
     # Teens.
     ten,       10
     eleven,    11
     twelve,    12
     thirteen,  13
     fourteen,  14
     fifteen,   15
     sixteen,   16
     seventeen, 17
     eighteen,  18
     nineteen,  19
     # Tens.
     twenty,   20
     thirty,   30
     forty,    40
     fifty,    50
     sixty,    60
     seventy,  70
     eighty,   80
     ninety,   90
     # Hundreds.
     hundred,  100
     hundreds, 100
     # Powers of thousand.
     thousands,    1000
     thousand,     1000
     million,      1000000
     millions,     1000000
     billion,      1000000000
     billions,     1000000000
     trillion,     1000000000000
     trillions,    1000000000000
     quadrillion,  1000000000000000
     quadrillions, 1000000000000000
     quintillion,  1000000000000000000
     quintillions, 1000000000000000000
     %%
     uintmax_t
     numeral_convert (const char *str)
     {
       const char *alphabet = "abcdefghijklmnopqrstuvwxyz";
       uintmax_t res = 0, hundreds = 0, prev_value = 0;
     
       while ((str = strpbrk (str, alphabet)))
         {
           size_t len = strspn (str, alphabet);
           struct atom_s *atom = in_word_set (str, len);
           if (!atom)
             {
               numeral_error = strndup (str, len);
               return 0;
             }
           if (atom->value >= 1000)
             {
               res += (hundreds ? : 1) * atom->value;
               hundreds = 0;
             }
           else if (prev_value <= atom->value)
             {
               hundreds = (hundreds ? : 1) * atom->value;
             }
           else
             {
               hundreds += atom->value;
             }
           prev_value = atom->value;
           str += len;
         }
       return res + hundreds;
     }
     
     Example 6.4: atoms.gperf -- Converting Numbers in American
     English into Integers
     
English into Integers

Finally, numeral.c contains all the needed hooks to create an M4 module, and the definition of the builtin, numeral:

     /*--------------------.
     | numeral(EXPRESSION) |
     `--------------------*/
     
     M4BUILTIN_HANDLER (numeral)
     {
       uintmax_t res;
       char buf[128];
     
       res = numeral_convert (M4ARG (1));
       if (!numeral_error)
         {
           sprintf (buf, "%ju", numeral_convert (M4ARG (1)));
           obstack_grow (obs, buf, strlen (buf));
         }
       else
         {
           M4ERROR ((warning_status, 0,
                     _("Warning: %s: invalid number component: %s"),
                     M4ARG (0), numeral_error));
           free (numeral_error);
           numeral_error = NULL;
         }
     }
     
     Example 6.5: numeral.c -- M4 Module Wrapping atoms.gperf
     

Let us run gperf to create atoms.c:

     $ gperf --switch=1 --language=ANSI-C --struct-type atoms.gperf >atoms.c
     error-->Key link: "eight" = "three", with key set "et".
     error-->Key link: "thirty" = "twenty", with key set "ty".
     error-->Key link: "fifty" = "forty", with key set "fy".
     error-->Key link: "trillion" = "thirteen", with key set "nt".
     error-->Key link: "trillions" = "thousands", with key set "st".
     error-->Key link: "quintillion" = "quadrillion", with key set "nq".
     error-->Key link: "quintillions" = "quadrillions", with key set "qs".
     error-->7 input keys have identical hash values,
     error-->try different key positions or use option -D.
     

Eek! Gperf recognizes that its simple heuristic, based apparently on peeking only at the first and last characters of each word (and their lengths), fails, and it lists the clashes. If performances do not matter too much for your application, such as our, then using --duplicates, -D, asks Gperf to handle the collisions using successive string comparisons:

     $ gperf -S 1 -L ANSI-C -t -D atoms.gperf >atoms.c
     error-->7 input keys have identical hash values, examine output carefully...
     

This time the message is only informative, the hash function will just be efficient, not extremely efficient, that's all. If, on the contrary, performances matter to you, then you may play with the various options of gperf, see Using Gperf. We will try to find a better set of character positions to peek at, using --key-positions, -k.

Notice that the last characters are often not very informative: n is frequently used because of -teen and -lion, y because of yty, and s because of the plurals etc. This suggests not to use $ in our key positions. Since ten is our shortest word, we need to consider at least one of 1, 2, and 3 as key position. If you try each one, 3 is the best choice with 7 collisions, against 13 for 1, and 14 for 2:

     $ gperf -t -k3 atoms.gperf >/dev/null
     error-->Key link: "twelve" = "eleven", with key set "e".
     error-->Key link: "twenty" = "eleven", with key set "e".
     error-->Key link: "forty" = "three", with key set "r".
     error-->Key link: "hundreds" = "nineteen", with key set "n".
     error-->Key link: "billion" = "million", with key set "l".
     error-->Key link: "billions" = "millions", with key set "l".
     error-->Key link: "trillion" = "thirteen", with key set "i".
     error-->7 input keys have identical hash values,
     error-->try different key positions or use option -D.
     

Choosing 1 is seducing, since most of the collided words start with a different character:

     $ gperf -t -k1,3 atoms.gperf >/dev/null
     error-->Key link: "twenty" = "twelve", with key set "et".
     error-->Key link: "trillion" = "thirteen", with key set "it".
     error-->2 input keys have identical hash values,
     error-->try different key positions or use option -D.
     

Finally, obviously, the fourth, fifth and sixth characters will solve the remaining conflicts. Choosing the fourth, we can run gperf with all the options required by our application:

     $ gperf -S 1 -L ANSI-C -t -k1,3,4 atoms.gperf >atoms.c
     

then compile our numeral module, and try it:

     $ m4 -m numeral
     numeral(forty)
     =>40
     numeral(two)
     =>2
     numeral(forty-two)
     error-->m4: stdin: 3: Warning: numeral: invalid number component: forty
     =>
     

Huh? What the f* invalid number component: forty? Clearly this Gperf recognizer demonstrated that it does know forty and two, but it does not recognize forty in forty-two. How come?

We just fell into a huge trap left wide open in Gperf, meant by its authors: although we do specify the length of the word to check, Gperf uses strcmp! In the present case, it is comparing forty-two against forty using strcmp, even though we did mention the five first characters of forty-two were to be checked.

In fact, what we told Gperf is that if it wants to peek at the last character of the word, then this character is at the fifth position, but when comparing strings to make sure the word and the keyword it might be are equal, it uses a plain regular C string comparison.

I cannot emphasize this too much: if you are not submitting 0 ended strings then use --compare-strncmp.

Trying again:

     $ gperf -S 1 -L ANSI-C -t -k1,3,4 --compare-strncmp atoms.gperf >atoms.c
     

and then:

     $ m4 -m numeral
     numeral(forty-two)
     =>42
     numeral(
                       twelve quintillion
     three hundred forty-five quadrillion
       six hundred seventy-eight trillion
                 nine hundred one billion
          two hundred thirty-four million
        five hundred sixty-seven thousand
                     eight hundred ninety
     )
     =>12345678901234567890
     

Notes de bas de page

  1. In American English, the words ``billion'', ``trillion'', etc. denote a different value than in Commonwealth English.

  2. To be more rigorous, internally they are converted into uintmax_t, and output as strings of digits.

  3. There are not strictly equivalent, since in foo ? foo : bar the expression foo is evaluated twice, while in foo ? : bar it is evaluated only once.