Wildcard matching algorithms


by Alessandro Cantatore

[2003 Apr 25] Download



Introduction

The code described below is focused on the design of an algorithm to compare a pattern string containing wildcards (also known as metacharacters, global characters or joker characters), with a text string (typically a file name).

All the various algorithms have been written in a unique module with the purpose of testing both their speed and their correctness.

All these routines have been designed to work on OS/2 and follow these simple rules:

A global array of characters, mapped to uppercase, initialized when the program starts, is used for comparing characters case insensitively.

License


Recursive algorithms

While recursive algorithms are usually easier to design (or understand) they are usually much slower and use more resources than non-recursive ones.

A first algorithm I designed a few years ago

I designed this algorithm a few years ago, but revised it recently as the original was buggy and a bit slower than the one I'm presenting here.
This should be quite easy to understand: both the pattern and the string to be tested are scanned character by character and according to the current pattern character (a metacharacter or a ordinary character) the proper action is taken:

  1. if the pattern character is a question mark the match is always considered valid unless the character is a period;
  2. if the pattern character is an asterisk the procedure recursively calls itself, increasing the string pointer, until the string terminates (mismatch) or until a match is found during recursion;
  3. if the pattern character is an ordinary character just return FALSE in case of mismatch.
BOOL szWildMatch1(PSZ pat, PSZ str) {
   while (*str) {
      switch (*pat) {
         case '?':
            if (*str == '.') return FALSE;
            break;
         case '*':
            do { ++pat; } while (*pat == '*'); /* enddo */
            if (!*pat) return TRUE;
            while (*str) if (szWildMatch1(pat, str++)) return TRUE;
            return FALSE;
         default:
            if (mapCaseTable[*str] != mapCaseTable[*pat]) return FALSE;
            break;
      } /* endswitch */
      ++pat, ++str;
   } /* endwhile */
   while (*pat == '*') ++pat;
   return !*pat;
}

A modified version using a slightly different flow

If you understood the logic of the previous routine should have no problem with this one. It just uses if else-if rather than switch and start the loop comparing the pattern and the string characters.

BOOL szWildMatch2(PSZ pat, PSZ str) {
   while (*str) {
      if (mapCaseTable[*str] == mapCaseTable[*pat]) {
         ++pat, ++str;
      } else if (*pat == '?') {
         if (*str == '.') return FALSE;
         ++pat, ++str;
      } else if (*pat == '*') {
         do { ++pat; } while (*pat == '*'); /* enddo */
         if (!*pat) return TRUE;
         while (*str) if (szWildMatch2(pat, str++)) return TRUE;
         return FALSE;
      } else {
         return FALSE;
      } /* endif */
   } /* endwhile */
   while (*pat == '*') ++pat;
   return !*pat;
}

A public domain algorithm published by snippets.org

This was suggested by an anonymous poster in comp.os.os2.programmer.
It was published by snippets.org (file XSTRCMP.C).

As reported in the source file this is:

Derived from code by Arjan Kentner (submitted by Steve Summit),  modified by Bob Stout.
No use restriction is mentioned in the source file.

I modified it to prevent matching between '?' and '.' and to use the mapCaseTable[] array rather than toupper(). I also modified the name of the variables to be consistent in all the algorithms.

While this code is short and looks nice, its flow, on a first sight, is not so easy to understand and, what is worse, it performs around twice slower than the previous routines, as recursion is used much more extensively.
BOOL szWildMatch3(PSZ pat, PSZ str) {
   switch (*pat) {
      case '\0':
         return !*str;
      case '*' :
         return szWildMatch3(pat+1, str) || *str && szWildMatch3(pat, str+1);
      case '?' :
         return *str && (*str != '.') && szWildMatch3(pat+1, str+1);
      default  :
         return (mapCaseTable[*str] == mapCaseTable[*pat]) &&
                 szWildMatch3(pat+1, str+1);
   } /* endswitch */
}
The previous algorithm modified to reduce recursion

In order to improve the performances of the previous algorithm, I modified the program logic to reduce recursion using a flow similar to the faster szWildMatch1.
The resulting routine, while working faster (about 30 % - 80 %) than the previous, is still quite slower than the first.

BOOL szWildMatch4(PSZ pat, PSZ str) {
   while (*str) {
      switch (*pat) {
         case '?':
            if (*str == '.') return FALSE;
            break;
         case '*':
            return !*(pat + 1) ||
                   szWildMatch4(pat + 1, str) ||
                   szWildMatch4(pat, str + 1);
         default:
            if (mapCaseTable[*str] != mapCaseTable[*pat]) return FALSE;
            break;
      } /* endswitch */
      ++str, ++pat;
   } /* endwhile */
   while (*pat == '*') ++pat;
   return !*pat;
}

Non-recursive algorithms

I always knew that non-recursive algorithms are faster than recursive ones, so I was quite excited when, on an old CD I found a non-recursive algorithm.
A public domain algorithm published by the C/C++ Users Journal

I found this on a Walnut Creek CD (C/C++ user group library).

The original code is from the C/C++ Users Journal. The author is Mike Cornelison.
No use restriction is mentioned in the source file or the other files I found in the CD.

I modified this to prevent matching between '?' and '.' and to use the mapCaseTable[] array (the original routine didn't perform case insensitive matching). I also modified the name of the variables to be consistent in all the algorithms and slightly changed the code executed after the new_segment label to avoid unneeded re-assignments to the variable star.

On my PC this performed almost up to 100 times faster than the routine from snippets.org .

The working flow may not look so obvious and some people may object to the use of goto, but what do you think it is better: code just looking pretty to read or code performing fast and efficiently?

BOOL szWildMatch5(PSZ pat, PSZ str) {
   int i, star;

new_segment:

   star = 0;
   if (*pat == '*') {
      star = 1;
      do { pat++; } while (*pat == '*'); /* enddo */
   } /* endif */

test_match:

   for (i = 0; pat[i] && (pat[i] != '*'); i++) {
      if (mapCaseTable[str[i]] != mapCaseTable[pat[i]]) {
         if (!str[i]) return 0;
         if ((pat[i] == '?') && (str[i] != '.')) continue;
         if (!star) return 0;
         str++;
         goto test_match;
      }
   }
   if (pat[i] == '*') {
      str += i;
      pat += i;
      goto new_segment;
   }
   if (!str[i]) return 1;
   if (i && pat[i - 1] == '*') return 1;
   if (!star) return 0;
   str++;
   goto test_match;
}

Algorithm 1 modified to run in non-recursive mode

Encouraged by the performances of the previous algorithm, I decided to redesign my own algorithm, the faster among the recursive ones, to work in non-recursive mode.

The work flow of the program is not much different from that of the recursive method: rather than calling recursively the routine, a label (loopStart), within the routine itself, is recursively called as needed.
An array index keeps track of the current position within the pattern and the tested string.
A different label (starCheck) is used to mimic returning a value from the routine called recursively.

The resulting routine performed up to about 60 % faster than the previous one.

BOOL szWildMatch6(PSZ pat, PSZ str) {
   int i;
   BOOL star = FALSE;

loopStart:
   for (i = 0; str[i]; i++) {
      switch (pat[i]) {
         case '?':
            if (str[i] == '.') goto starCheck;
            break;
         case '*':
            star = TRUE;
            str += i, pat += i;
            do { ++pat; } while (*pat == '*');
            if (!*pat) return TRUE;
            goto loopStart;
         default:
            if (mapCaseTable[str[i]] != mapCaseTable[pat[i]])
               goto starCheck;
            break;
      } /* endswitch */
   } /* endfor */
   while (pat[i] == '*') ++i;
   return (!pat[i]);

starCheck:
   if (!star) return FALSE;
   str++;
   goto loopStart;
}

Evolution of algorithm 6 (pointers are better than array indexes)

Although already satisfied by the speed results, but taken by the coding fever, I decided to check if I could make better than that.

To achieve that I substituted array indexing with self incrementing pointers, anyway the resulting algorithm managed to get just a minimal speed increase (ranging from -1% to +10%).

BOOL szWildMatch7(PSZ pat, PSZ str) {
   PSZ s, p;
   BOOL star = FALSE;

loopStart:
   for (s = str, p = pat; *s; ++s, ++p) {
      switch (*p) {
         case '?':
            if (*s == '.') goto starCheck;
            break;
         case '*':
            star = TRUE;
            str = s, pat = p;
            do { ++pat; } while (*pat == '*');
            if (!*pat) return TRUE;
            goto loopStart;
         default:
            if (mapCaseTable[*s] != mapCaseTable[*p])
               goto starCheck;
            break;
      } /* endswitch */
   } /* endfor */
   while (*p == '*') ++p;
   return (!*p);

starCheck:
   if (!star) return FALSE;
   str++;
   goto loopStart;
}

Final choice

Since this kind of algorithm is likely to be called multiple times in a loop to check multiple file names, a further small improvement might be achieved by preprocessing the pattern string, in order to remove consecutive asterisks, before calling the pattern matching procedure in a DosFindFirst-DosFindNext loop.

The following algorithm is a modified version of the previous one: all the code used to check for consecutive asterisks has been removed.

BOOL szWildMatch8(PSZ pat, PSZ str) {
   PSZ s, p;
   BOOL star = FALSE;

loopStart:
   for (s = str, p = pat; *s; ++s, ++p) {
      switch (*p) {
         case '?':
            if (*s == '.') goto starCheck;
            break;
         case '*':
            star = TRUE;
            str = s, pat = p;
            if (!*++pat) return TRUE;
            goto loopStart;
         default:
            if (mapCaseTable[*s] != mapCaseTable[*p])
               goto starCheck;
            break;
      } /* endswitch */
   } /* endfor */
   if (*p == '*') ++p;
   return (!*p);

starCheck:
   if (!star) return FALSE;
   str++;
   goto loopStart;
}

Speed and correctness tests

The test program, executed without parameters, performs a speed test of all the algorithms.
To check how an algorithm behaves with a pattern and a text string, you have to specifiy, as program parameters, the algorithm number, the pattern and the text string.

To make everithing easier I prepared a REXX script (WILDTEST.CMD) listing various kind of patterns and test strings and a batch file (TESTALL.CMD) to execute those test for all the algorithms.
Then I checked carefully that the first algorithm behaved as expected and that the results of all the other algorithms matched perfectly that first result.

If you want to check that yourself and want to be sure I did not miss any pattern or want to check the test program yourself, with a different compiler or with a different machine (I used a P4 2400@2600), you can download the whole package (30 KB).


Conclusions

The tests I performed showed that non-recursive algorithms are always much faster than the recursive ones.
The fastest algorithms were szWildMatch7() and szWildMatch8(), (remember, you have to preprocess the pattern string to use szWildMatch8).

If you have time to test those with another compiler or another machine and get different results or find any kind of bug please let me know.

Thanks.