This is the kind of pattern matching will be immediately familiar if you
have ever used a UNIX command shell. Here is a short snippet of
shell code you might use to add a new directory to your PATH
,
unless that directory was already included:
case :$PATH: in *:/opt/gnu/bin:*) ;; *) PATH=/opt/gnu/bin:$PATH ;; esac
The Bourne shell case
construct tries to match the argument
string, expanded from :$PATH:
, against each of the patterns that
follow, and executes the code in the first matching branch.
Specifically, if the expansion of :$PATH:
already contains the
substring :/opt/gnu/bin:
, then the first empty branch is
executed. Otherwise, the second branch is executed, and
/opt/gnu/bin:
is prepended to the existing value of PATH
.
There are, then, two patterns in this case
statement, the
first describes the set of all strings that contain the substring
:/opt/gnu/bin:
, and the second describes an infinite set that
will match any string at all! Except for a few special wildcard
characters, each character in the pattern must match itself in the test
string in order for the pattern as a whole to match successfully.
As evidenced by the second pattern in the example above, the *
wildcard character will match successfully against any string of
characters, or indeed against no characters at all. And hence the other
pattern in the example could be read as: "The set of strings which
begin with zero or more characters followed by the substring
:/opt/gnu/bin:
, followed by zero or more characters".
The idiom of adding an extra delimiter at either end of a test
expression, for example the extra :
in :$PATH:
saves you
from having to worry about needing to match the first and last items in
the expansion of $PATH
explicity. Without the extra :
s,
you would need to write this:
case $PATH in /opt/gnu/bin:*) ;; *:/opt/gnu/bin:*) ;; *:/opt/gnu/bin) ;; *) PATH=/opt/gnu/bin:$PATH ;; esac
In addition, wildcard patterns have a number of other special characters:
?
[
a
b]
[
x-
y]
x-
y
. For example [0-9]
will
match against any single digit.
To match the character ]
with a character set, it must be the
first character listed. Similarly, to match a literal -
, the
-
must be the first or last character listed in the set.
[:
class:]
alnum
, alpha
, ascii
, blank
,
cntrl
, digit
, graph
, lower
, print
,
punct
, space
, upper
and xdigit
, which
correspond to the C functions isalpha
, isascii
etc.
Each of these is an example of an atom, as are each of the
non-wildcard self-matching characters. In the context of pattern
matching, an atom is any minimal constituent part of the pattern that
potentially matches one or more characters. For example a
is an
atom that matches the character a
; [0-9]
is an atom that
matches any of the characters 0
through 9
(so is
[[:digit:]]
); and so on. None can be broken down into smaller
valid sub-patterns.
Sometimes the set of strings you are trying to describe with a pattern
should include a literal *
, or another of the characters that are
interpreted as wildcards when you use them in a pattern. By prefixing
those characters with a \
character, the special meaning is turned
off (or escaped) allowing the character to be matched exactly.
Hence, \
is not an atom, since it cannot match anything by
itself. Rather, \
might modify the meaning of the character that
comes right after it: \*
is an atom that matches only the
character *
; \a
will match only the character a
;
and \\
matches the character \
.
(FIXME: ISTR some pecularity with and character sets.)
The GNU C library provides a function which will tell you whether specific strings match against a given pattern:
int fnmatch (const char *pattern, const char *string, int flags) | Function |
This function returns zero if pattern describes a set that contains string. Conversely, if pattern does not match string, this function returns non-zero. |
The flags argument is a bit field built by bitwise oring of the various options that tweak the behaviour of the matching process. Some of the values you could pass are as follows:
FNM_NO_ESCAPE
\
character, you must write
\\
in pattern because the normal behaviour of \
is to
escape any special behaviour of the following character. Passing this
flag makes \
behave like a normal character, but leaves you with
no way to escape special characters in pattern.
FNM_CASEFOLD
FNM_EXTMATCH
|
delimited patterns:
!(
pattern-list)
*(
pattern-list)
?(
pattern-list)
@(
pattern-list)
+(
pattern-list)
There are other flags that can be added to the flags argument of
fnmatch
, which you can find in your system documentation.
#include <stdio.h> #include <fnmatch.h> int main (int argc, const char *argv[]) { const char *pattern; int count; int arg; if (argc < 3) { fprintf (stderr, "USAGE: %s <pattern> <string> ...\n", argv[0]); return 1; } pattern = argv[1]; for (count = 0, arg = 2; arg < argc; ++arg) { switch (fnmatch (pattern, argv[arg], FNM_CASEFOLD|FNM_EXTMATCH)) { case FNM_NOMATCH: break; case 0: printf ("\"%s\" matches \"%s\"\n", pattern, argv[arg]); ++count; break; default: perror ("fnmatch"); return 1; } } if (!count) printf ("\"%s\" does not match any of the other strings.\n"); return 0; }
Example 2.16: Using fnmatch