Suppose you face the following problem: you are given a fixed and finite set K of k words --henceforth keywords--, find whether a candidate word belongs to K.
There are tons of solutions, of varying efficiency. The question
essentially amounts to writing a generalization of switch
operating on literal strings:
switch (word) { case "foo": /* perform `foo' operations. */ break; case "bar": /* perform `bar' operations. */ break; ... default: /* WORD is not a keyword. */ break; }
A first implementation could be
if (!strcmp (word, "foo")) /* perform `foo' operations. */ else if (!strcmp (word, "bar")) /* perform `bar' operations. */ ... else /* WORD is not a keyword. */
which of course is extremely inefficient both when submitted a keyword (think of the last one) and worse yet, when submitted a plain word. In the worst case, we perform k string comparisons. But it is good enough for a few lookups in a small set of words, for instance command line arguments.
The proper ways to implement this by hand are well known, binary search for a start:
/* Compare the keys of KEY1 and KEY2 with strcmp. */ int keyword_cmp (const keyword_t *key1, const keyword_t *key2); /* The list of keywords, ordered lexicographically. */ const keyword_t keywords[] = { ... { "bar", &bar_function }, { "baz", &baz_function }, ... } /* Number of keywords. */ const size_t keywords_num = sizeof (keywords) / sizeof (*keywords); { /* Look for WORD in the list of keywords. */ keyword = bsearch (word, keywords, keyword_num, sizeof (keyword_t), keyword_cmp); if (keyword) /* Perform the associated action. */ keyword->function (); else /* Not a keyword. */ default_action (); ...
This is very efficient --the number of string comparisons is bounded by
the logarithm of k-- and way enough for most uses. Nevertheless,
you have to be sure that your array is properly sorted. For sake
of reliability, sorting the array beforehand (e.g., using qsort
at program startup) might be a good idea, but of course incurs some
performance penalty.
But initialization is unlikely to matter, especially if keyword lookup is frequently performed. If you are crazy for speed, always looking for wilder sensations, you will notice that once we reached some performance bounds with respect the number of string comparisons, then it is the number of character comparisons that matters. Imagine you have 256 keywords, then the first character of a non keyword will be compared at least 8 times, no matter whether any keyword actually starts with this letter. All its other characters are also likely to be read and compared several times. Therefore, instead of considering the problem looking at rows of strings, you will look at the columns of characters. You will end up with something like:
switch (word[0]) { ... case 'b': /* `b' */ switch (word[1]) { ... case 'a': /* `ba' */ switch (word[2]) { ... case 'r': /* `bar' */ switch (word[3]) { case '\0': /* `bar\0' */ /* Perform `bar' function. */ break; ... default: /* Not a keyword. */ break; } /* switch (word[3]) */ break; ... } /* switch (word[2]) */ break; ... } /* switch (word[1]) */ break; ... } /* switch (word[0]) */
In other words, you will implement by hand a small automaton, reduced to a simple tree:
,-----. \0 ,-------. r.--| bar |----| bar\0 |--> ,---. a ,----/ `-----' `-------' | b |---| ba | /---' `----\ ,-----. \0 ,-------. b/ z`--| baz |----| baz\0 |--> ,---/ `-----' `-------' -->| |--- ... `---\ f\ \---. o ,----. o ,-----. \0 ,-------. | f |---| fo |---| foo |----| foo\0 |--> `---' `----' `-----' `-------'
Example 6.1: A Fast Keyword Tree-like Recognizer
where (i) the entering arrow denotes the initial state, (ii) exiting arrows denote successful keyword recognitions, and (iii), any characters that cannot be "accepted" by a state (i.e., there is no transition, no arrow, exiting from this node, labeled with the character), result in the successful recognition of a non keyword.
This automaton is definitely unbeatable at recognizing keywords: how could you be faster given that each character is compared only once1! The average time taken by linear approach, first exposed, is proportional to the number of keywords; using a binary search cuts it down to its logarithm; and finally the automaton needs virtually a single string comparison: its efficiency is independent of the number of keywords!
Can you do better than that?...
In typical compilers input, you have to take user identifiers into
account, and on average, being the fastest at recognizing keywords is
not enough: you would like to discriminate non keywords as soon as
possible. Suppose you found out that all the keywords end with
_t
, or even that the same character always appears at some fixed
position. Starting by checking the characters at these positions will
let you recognize some non keywords faster than using the automaton
above.
Generalizing this idea, we are looking for a function which will let us
know with a high probability whether a word is not a keyword just
by looking at a few well chosen characters. We can improve even further
this idea: if we choose the characters well enough, it might give us an
intuition of the keyword the word might be. Let us name
this function hash
, and let it compute an integer, so that
processing its result will be cheap. The algorithm will be:
compute the hash code: the value ofhash (
word)
if
the hash code is not representative of any keywordthen
return failureelse
for each
keyword corresponding to the hash codedo
if
this keyword matchesthen
return itend if
end for each
return failureend if
This algorithm, based on a hash function, is as efficient as the automaton was: its performances are independent of the number of keywords!
It is crucial that the hash code be simple to compute, but the time spent in designing hash function is inessential: once you wrote it for a given set of keywords, it will be potentially used millions of times. But of course a little of automation would be most welcome, since laziness is definitely a quality of programmers...
There is room for debate here: the compiler will transform most of the
switch
into plain if
s, instead of computed goto
s,
especially when there are few case
s, hence there will be several
comparisons. A strict study depends upon the compiler. In addition,
the price of the goto
s should be taken into account too. Let us
merely say ``each character is consulted once''.