Noeud:Looking for Keywords, Noeud « Next »:, Noeud « Up »:Scanning with Gperf



Looking for Keywords

Suppose you face the following problem: you are given a fixed and finite set K of k words --henceforth keywords--, find whether a candidate word belongs to K.

There are tons of solutions, of varying efficiency. The question essentially amounts to writing a generalization of switch operating on literal strings:

     switch (word)
     {
       case "foo":
         /* perform `foo' operations. */
         break;
       case "bar":
         /* perform `bar' operations. */
         break;
       ...
       default:
         /* WORD is not a keyword. */
         break;
     }
     

A first implementation could be

     if (!strcmp (word, "foo"))
       /* perform `foo' operations. */
     else if (!strcmp (word, "bar"))
       /* perform `bar' operations. */
       ...
     else
       /* WORD is not a keyword. */
     

which of course is extremely inefficient both when submitted a keyword (think of the last one) and worse yet, when submitted a plain word. In the worst case, we perform k string comparisons. But it is good enough for a few lookups in a small set of words, for instance command line arguments.

The proper ways to implement this by hand are well known, binary search for a start:

     /* Compare the keys of KEY1 and KEY2 with strcmp. */
     int keyword_cmp (const keyword_t *key1, const keyword_t *key2);
     
     /* The list of keywords, ordered lexicographically. */
     const keyword_t keywords[] =
     {
       ...
       { "bar", &bar_function },
       { "baz", &baz_function },
       ...
     }
     
     /* Number of keywords. */
     const size_t keywords_num = sizeof (keywords) / sizeof (*keywords);
     
     {
       /* Look for WORD in the list of keywords. */
       keyword = bsearch (word, keywords, keyword_num,
                          sizeof (keyword_t), keyword_cmp);
       if (keyword)
         /* Perform the associated action. */
         keyword->function ();
       else
         /* Not a keyword. */
         default_action ();
       ...
     

This is very efficient --the number of string comparisons is bounded by the logarithm of k-- and way enough for most uses. Nevertheless, you have to be sure that your array is properly sorted. For sake of reliability, sorting the array beforehand (e.g., using qsort at program startup) might be a good idea, but of course incurs some performance penalty.

But initialization is unlikely to matter, especially if keyword lookup is frequently performed. If you are crazy for speed, always looking for wilder sensations, you will notice that once we reached some performance bounds with respect the number of string comparisons, then it is the number of character comparisons that matters. Imagine you have 256 keywords, then the first character of a non keyword will be compared at least 8 times, no matter whether any keyword actually starts with this letter. All its other characters are also likely to be read and compared several times. Therefore, instead of considering the problem looking at rows of strings, you will look at the columns of characters. You will end up with something like:

     switch (word[0])
     {
       ...
       case 'b':                          /* `b' */
         switch (word[1])
         {
            ...
            case 'a':                     /* `ba' */
              switch (word[2])
              {
                 ...
                 case 'r':                /* `bar' */
                   switch (word[3])
                   {
                     case '\0':           /* `bar\0' */
                       /* Perform `bar' function. */
                       break;
                     ...
                     default:
                       /* Not a keyword. */
                       break;
                   } /* switch (word[3]) */
                   break;
                 ...
              } /* switch (word[2]) */
              break;
            ...
          } /* switch (word[1]) */
          break;
       ...
     } /* switch (word[0]) */
     

In other words, you will implement by hand a small automaton, reduced to a simple tree:

                               ,-----. \0 ,-------.
                           r.--| bar |----| bar\0 |-->
              ,---. a ,----/   `-----'    `-------'
              | b |---| ba |
              /---'   `----\   ,-----. \0 ,-------.
            b/             z`--| baz |----| baz\0 |-->
        ,---/                  `-----'    `-------'
     -->|   |--- ...
        `---\
            f\
              \---. o ,----. o ,-----. \0 ,-------.
              | f |---| fo |---| foo |----| foo\0 |-->
              `---'   `----'   `-----'    `-------'
     
     Example 6.1: A Fast Keyword Tree-like Recognizer
     

where (i) the entering arrow denotes the initial state, (ii) exiting arrows denote successful keyword recognitions, and (iii), any characters that cannot be "accepted" by a state (i.e., there is no transition, no arrow, exiting from this node, labeled with the character), result in the successful recognition of a non keyword.

This automaton is definitely unbeatable at recognizing keywords: how could you be faster given that each character is compared only once1! The average time taken by linear approach, first exposed, is proportional to the number of keywords; using a binary search cuts it down to its logarithm; and finally the automaton needs virtually a single string comparison: its efficiency is independent of the number of keywords!

Can you do better than that?...

In typical compilers input, you have to take user identifiers into account, and on average, being the fastest at recognizing keywords is not enough: you would like to discriminate non keywords as soon as possible. Suppose you found out that all the keywords end with _t, or even that the same character always appears at some fixed position. Starting by checking the characters at these positions will let you recognize some non keywords faster than using the automaton above.

Generalizing this idea, we are looking for a function which will let us know with a high probability whether a word is not a keyword just by looking at a few well chosen characters. We can improve even further this idea: if we choose the characters well enough, it might give us an intuition of the keyword the word might be. Let us name this function hash, and let it compute an integer, so that processing its result will be cheap. The algorithm will be:

     compute the hash code: the value of hash (word)
     
     if the hash code is not representative of any keyword then
       return failure
     else
       for each keyword corresponding to the hash code do
         if this keyword matches then
           return it
         end if
       end for each
       return failure
     end if
     

This algorithm, based on a hash function, is as efficient as the automaton was: its performances are independent of the number of keywords!

It is crucial that the hash code be simple to compute, but the time spent in designing hash function is inessential: once you wrote it for a given set of keywords, it will be potentially used millions of times. But of course a little of automation would be most welcome, since laziness is definitely a quality of programmers...


Notes de bas de page

  1. There is room for debate here: the compiler will transform most of the switch into plain ifs, instead of computed gotos, especially when there are few cases, hence there will be several comparisons. A strict study depends upon the compiler. In addition, the price of the gotos should be taken into account too. Let us merely say ``each character is consulted once''.