Noeud:Wildcard Matching, Noeud « Next »:, Noeud « Up »:Pattern Matching



Wildcard Matching

This is the kind of pattern matching will be immediately familiar if you have ever used a UNIX command shell. Here is a short snippet of shell code you might use to add a new directory to your PATH, unless that directory was already included:

     case :$PATH: in
       *:/opt/gnu/bin:*) ;;
       *)                PATH=/opt/gnu/bin:$PATH ;;
     esac
     

The Bourne shell case construct tries to match the argument string, expanded from :$PATH:, against each of the patterns that follow, and executes the code in the first matching branch. Specifically, if the expansion of :$PATH: already contains the substring :/opt/gnu/bin:, then the first empty branch is executed. Otherwise, the second branch is executed, and /opt/gnu/bin: is prepended to the existing value of PATH.

There are, then, two patterns in this case statement, the first describes the set of all strings that contain the substring :/opt/gnu/bin:, and the second describes an infinite set that will match any string at all! Except for a few special wildcard characters, each character in the pattern must match itself in the test string in order for the pattern as a whole to match successfully.

As evidenced by the second pattern in the example above, the * wildcard character will match successfully against any string of characters, or indeed against no characters at all. And hence the other pattern in the example could be read as: "The set of strings which begin with zero or more characters followed by the substring :/opt/gnu/bin:, followed by zero or more characters".

The idiom of adding an extra delimiter at either end of a test expression, for example the extra : in :$PATH: saves you from having to worry about needing to match the first and last items in the expansion of $PATH explicity. Without the extra :s, you would need to write this:

     case $PATH in
       /opt/gnu/bin:*)   ;;
       *:/opt/gnu/bin:*) ;;
       *:/opt/gnu/bin)   ;;
       *)                PATH=/opt/gnu/bin:$PATH ;;
     esac
     

In addition, wildcard patterns have a number of other special characters:

?
This wildcard will successfully match any single character.
[ab]
[x-y]
Square brackets describe character sets which will successfully match against any one of the characters listed explicitly, or within a range bounded by x-y. For example [0-9] will match against any single digit.

To match the character ] with a character set, it must be the first character listed. Similarly, to match a literal -, the - must be the first or last character listed in the set.

[:class:]
As an alternative to listing character sets explicitly, wildcard patterns can also specify some commonly used sets by name. Valid names are alnum, alpha, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper and xdigit, which correspond to the C functions isalpha, isascii etc.

Each of these is an example of an atom, as are each of the non-wildcard self-matching characters. In the context of pattern matching, an atom is any minimal constituent part of the pattern that potentially matches one or more characters. For example a is an atom that matches the character a; [0-9] is an atom that matches any of the characters 0 through 9 (so is [[:digit:]]); and so on. None can be broken down into smaller valid sub-patterns.

Sometimes the set of strings you are trying to describe with a pattern should include a literal *, or another of the characters that are interpreted as wildcards when you use them in a pattern. By prefixing those characters with a \ character, the special meaning is turned off (or escaped) allowing the character to be matched exactly. Hence, \ is not an atom, since it cannot match anything by itself. Rather, \ might modify the meaning of the character that comes right after it: \* is an atom that matches only the character *; \a will match only the character a; and \\ matches the character \.

(FIXME: ISTR some pecularity with and character sets.)

The GNU C library provides a function which will tell you whether specific strings match against a given pattern:

int fnmatch (const char *pattern, const char *string, int flags) Function
This function returns zero if pattern describes a set that contains string. Conversely, if pattern does not match string, this function returns non-zero.

The flags argument is a bit field built by bitwise oring of the various options that tweak the behaviour of the matching process. Some of the values you could pass are as follows:

FNM_NO_ESCAPE
Normally, if you want to match a \ character, you must write \\ in pattern because the normal behaviour of \ is to escape any special behaviour of the following character. Passing this flag makes \ behave like a normal character, but leaves you with no way to escape special characters in pattern.
FNM_CASEFOLD
Passing this flag causes the pattern to make no distinction between upper and lower case characters when deciding whether pattern matches string.
FNM_EXTMATCH
When you pass this flag, you can use the following additional atoms in pattern to help describe the set of matching strings, where pattern-list is a list of | delimited patterns:
!(pattern-list)
This pattern matches if none of the patterns in pattern-list could match string.
*(pattern-list)
This pattern matches if zero or more occurences of the patterns in pattern-list match string.
?(pattern-list)
This pattern matches if zero or one occurences of the patterns in pattern-list match string.
@(pattern-list)
This pattern matches if exactly one occurence of any pattern in pattern-list match string.
+(pattern-list)
This pattern matches if one or more occurences of the patterns in pattern-list match string.

There are other flags that can be added to the flags argument of fnmatch, which you can find in your system documentation.

     #include <stdio.h>
     #include <fnmatch.h>
     
     int
     main (int argc, const char *argv[])
     {
       const char *pattern;
       int count;
       int arg;
     
       if (argc < 3)
         {
           fprintf (stderr, "USAGE: %s <pattern> <string> ...\n", argv[0]);
           return 1;
         }
     
       pattern = argv[1];
     
       for (count = 0, arg = 2; arg < argc; ++arg)
         {
           switch (fnmatch (pattern, argv[arg], FNM_CASEFOLD|FNM_EXTMATCH))
             {
     	case FNM_NOMATCH:
     	  break;
     
     	case 0:
               printf ("\"%s\" matches \"%s\"\n", pattern, argv[arg]);
               ++count;
     	  break;
     
     	default:
     	  perror ("fnmatch");
     	  return 1;
             }
         }
     
       if (!count)
         printf ("\"%s\" does not match any of the other strings.\n");
     
       return 0;
     }
     
     Example 2.16: Using fnmatch