Crafting Your Own Regular Expressions
Up to this point, this chapter has introduced you to the reFind() and reReplace() functions (and their case-insensitive counterparts). Along the way, you learned about a number of RegEx concepts, such as subexpressions and backreferences. You've also seen some decent examples of actual RegEx criteria syntax (that is, the various wildcards you can use in regular expressions)but you haven't been formally introduced to what each of the wildcards does.The remainder of this chapter will focus on the regular expressions themselves.
Understanding Literals and Metacharacters
Every regular expression you write includes two types of characters: literals and metacharacters .Literals, or literal characters , are normal text characters that represent themselves literally. In other words, literals are all the characters in a RegEx that aren't wildcards of one form or another. In the email RegEx that's been used several times in this chapter (see Listing 13.1), the only literal character is the @ sign. If your search involves the word dog, your RegEx will likely contain the literal d, o, and g characters.Metacharacters are the various special characters (what I've been calling wildcards up to this point) that have special meaning to the regular expression engine. You've already seen a few of the most common metacharacters, such as the [, ], {, }, and + characters. You'll learn about all the rest in the pages to come.NOTEUp to this point, I've been using the term wildcard as an approximate synonym for metacharacter. Wildcard is less technical and perhaps a bit less precise, but it rolls off the tongue a lot more easily and is more intuitively understood. I imagine you've understood what I've meant by wildcard all along, whereas metacharacter might have slipped us up a bit. I'll continue to use wildcard during the less formal parts of the remaining discussion.
Including Metacharacters Literally
Sometimes, you need to include one of the metacharacters as a literal. To do so, you escape the metacharacter by preceding it with a backslash. You saw this demonstrated in Listing 13.2, where the sequences \( and \) were used to denote literal parentheses characters (that is, parentheses that should actually be searched for, rather than having their usual special meaning of indicating a subexpression).NOTE
If you need to search for a literal backslash, escape the backslash with another backslash. Just use two backslashes together, as in \\.Altering Text with Backreferences."
Introducing the Cast of Metacharacters
The RegEx implementation in ColdFusion supports a lot of metacharacters, which can be broken into the conceptual groups shown in Table 13.6.
TYPE | DESCRIPTION |
---|---|
Character classes | Character classes define a set of characters that will match. They are defined with square brackets: [aeiou] matches any single vowel; [0-9] matches any single number, and [^0-9] matches any single character except numbers. There are also special shortcuts for often-used sets of characters, such as \w or for any letter or number, or \s for any whitespace character. Finally, there's the dot character (.), which matches any character at all. |
Quantifiers | These metacharacters allow you to specify how many times a certain item can appear to still be considered a match. Quantifiers include ? for optional matches, + for one or more matches, and * for any number of matches (including none). There are also the interval quantifiers : {num} for num number of matches; {num,max} for num to max number of matches; and {num,} for num or more matches. |
Alternation | You can establish OR conditions in your regular expressions with the | character. Parentheses constrain how far the | reaches, so (you|we) matches you or we. |
String anchors | String anchors let you specify that a match must occur at a particular location in a chunk of text. Anchors include ^ for matches at the beginning of the text (or line) and $ for matches at the end. There are also the \A and \Z anchors, which are similar, except do not work in multiline mode. |
Escape sequences | Escape sequences are mostly for matching certain unprintable characters; for example, \t to match tabs or \n to match newlines. |
Modifiers | Modifiers allow you to turn on different types of RegEx behavior for use in special cases. Modifiers include (?m) for line-by-line matching and (?=) for lookahead matching. |
Metacharacters 101: Character Classes
Of all of the metacharacters available in regular expressions, character classes are probably the most important. Character classes are a way of specifying a set of characters, any one of which can be considered a match. You can specify your own classes or use any number of predefined classes supported by RegEx.
Specifying Character Classes with [ ]
You can specify any set of characters as a character class with the square bracket characters [ and ]. The class [aeiouAEIOU] will match any vowel; [12345] will match a 1, 2, 3, 4, or 5 character. For instance, perhaps your last name is Andersen and people often misspell it as Anderson or forget to capitalize the first letter. You could find any of the various spellings using [Aa]nders[eo]n as the regular expression.The hyphen character has special meaning when it is between a set of square brackets: It indicates a range of acceptable characters. For instance, [1-5] is easier to type than [12345] and will still match a 1, 2, 3, 4, or 5 character. Very common character classes are [A-Za-z] for matching any letter, and [0-9] for matching any single number character. If your company uses an ID number composed of two letters followed by a dash and then three numbers, you could use this as the regular expression:
As you'll learn in Metacharacters 102, you could use quantifiers as an easier way of specifying the part consisting of three numbers at the end.
[A-Z][A-Z]-[0-9][0-9][0-9]
Negating a Character Class with ^
If the square bracket contents for a character class start with a caret character, the character class is negated, meaning that the class will match any character that isn't in the class. For example, [^A-Za-z0-9] matches anything other than a number or letter, and [^aeiouAEIOU] matches anything other than a vowel.NOTEKeep in mind that there are lots of other characters other than letters and numbers, including unprintable characters such as tabs and newlines. So, while you may think at first glance that[^aeiouAEIOU] would simply match all consonants, that's not all it will match. It will also match unprintable characters, and all other characters, too, including punctuation characters (commas, periods, and the like).
Common Character Classes
Because certain character classes are called for frequently (such as [A-Za-z] for matching any letter, and [0-9] for matching any digit), ColdFusion supports a number of shortcuts for the most commonly needed character classes. Different regular expression tools support slightly different ways of specifying these shortcuts, but most adhere to the shortcuts supported by Perl or by POSIX. ColdFusion's RegEx implementation supports both. The Perl shortcuts, in particular, are really easy to type.Table 13.7 shows common character classes you might need to use in your regular expressions, with the Perl-style and POSIX-style shortcuts for each. The Normal column shows how to write the character class using the normal square bracket syntax. The Perl Shortcut and POSIX Shortcut columns show the shortcuts for each class; for some of the classes, there is a POSIX shortcut but no corresponding Perl shortcut, in which case the Perl Shortcut column is left blank. A few shortcuts shown at the bottom of the table would be virtually impossible to type using the manual [ ] syntax, so the Normal column is left blank.
NORMAL | PERL | POSIX SHORTCUT | MATCHES SHORTCUT |
---|---|---|---|
[A-Z] | [[:upper:]] | Any uppercase letter. | |
[a-z] | [[:lower:]] | Any lowercase letter | |
[A-Za-z] | [[:alpha:]] | Any letter, regardless of case. | |
[0-9] | \d | [[:digit:]] | Any number character (digit). |
[^0-9] | \D | [^[:digit:]] | Any character other than a number. |
[0-9A-Za-z] | \w | [[:alnum:]] | Any letter or number character. |
[^0-9A-Za-z] | \W | [^[:alnum:]] | Any character other than a number or letter. |
[ \t] | [[:blank:]] | A space or a tab. | |
[ \t\n\r\f] | \s | [[:space:]] | Any whitespace character, which means any spaces, tabs, or any of the end-of-line indicators (newlines, form feeds, and carriage returns). |
[^ \t\n\r\f] | \S | [[:graph:]] | Any nonwhitespace character. |
. (dot) | Any character at all. It's important to understand that in ColdFusion, the dot character always matches newlines, which is not always the case with Perl. | Any number character (digit). |
You can use Perl-style shortcuts to make the RegEx easier to type and look at, like this:
[A-Z][A-Z]-[0-9][0-9][0-9]
Or, you can use POSIX-style shortcuts, like so:
[A-Z][A-Z]-\d\d\d
Feel free to mix and match the two types of shortcuts, like so:
[[:upper:]][[:upper:]]-[[:digit:]][[:digit:]][[:digit:]]
NOTEThe POSIX shortcuts can be negated with the ^ character, as shown in the POSIX Column for the [^0-9] class in Table 13.7.NOTEYou might be wondering why you would use [[:upper:]] instead of [A-Z], because it doesn't seem to be much of a shortcut at all (there's actually more to type). The main benefit is that the POSIX shortcuts attempt to understand uppercase and lowercase letters for each language, whereas something like [A-Z] will work only for English and other roman-style character sets.
[[:upper:]][[:upper:]]-\n\n
Metacharacters 102: Quantifiers
As you learned in Table 13.6, quantifiers allow you to specify how many times certain parts of a RegEx can match for the overall regular expression to be considered a match. You will learn about the many quantifiers in detail as we work through the Metacharacters section.Regardless of which quantifier you're using, you always place it right after the item that you want to affect. That item might be a single character, a character class, or the set of parentheses that sets off a subexpression. If character classes are the foundation of what regular expressions are about, quantifiers give the technology its muscle; without them, it would be hard to solve anything but simple problems with RegEx.Table 13.8 lists the quantifier metacharacters available for your use.
Using Quantifiers
Let's look at a few examples of using character classes and quantifiers. Say you need to create a regular expression that will match a U.S. ZIP code. Let's start off with the simple five-digit version of a ZIP code. Using the character class skills you learned in Metacharacters 101, you know you could use this:
or this:
[0-9][0-9][0-9][0-9][0-9]
You can use the {num} quantifier from Table 13.8 to avoid having to type a separate class for each digit, like so:
\d\d\d\d\d
or like so:
[0-9]{5}
Now let's say you want to match the nine-digit version of a ZIP code. Just add another class and quantifier sequence, like so:
\d{5}
NOTEFor those of you who aren't from the U.S., sorry to use such a culturally myopic example. It's just a natural one to start off with. Anyway, U.S. ZIP codes are just the postal code used in a mailing address. ZIP codes come in two forms. For a long time, they were simply five-digit numbers. Later, the postal service introduced a nine-digit version, in the form 99999-9999. Both forms are used in practice today..
\d{5}-\d{4}
Making Certain Portions Be Optional with ?
Okay, what if you wanted to accept either five- or nine-digit ZIP codes? You can use the ? quantifier to say that the second portion of the code is optional, as in the following (Figure 13.13):
\d{5}(-\d{5})?
Figure 13.13. The ? operator handles items that don't necessarily need to be present.
[View full size image]

Including One or More Matches with +
Another cool quantifier is the + metacharacter. Because it matches one or more times, + is essential for matching substrings that will vary in length. That turns out to describe the majority of regular expression problems, so you'll be using + a lot.The following matches any number of digits:
Like ? and all the other quantifiers, the + character respects parentheses. When it follows a parenthesized group, + matches the entire group one or more times. You can also nest these sets of parentheses within one another, an approach that forms the basis of the email address RegEx you have seen throughout this chapter:
[0-9]+
That looks complex at first, but it's not so bad if you concentrate on each portion separately. The first portion is in charge of matching the username part of the email address (the part before the @ sign). I came up with [\w._]+ for this part, which matches any number of letters, numbers, dots, or underscores. After the @ sign, the next portion is [\w_]+, which is almost the same except that itFigure 13.4). Those parentheses don't have anything to do with the + sign, and don't affect which addresses actually match. They just make it possible to capture each portion of the match separately.
[\w._]+@[\w_]+(\.[\w_]+)+
Matching Any Number of Matches with *
The * metacharacter is similar to + in that it will match one, two, or any other number of whatever preceded it. The difference is that it will also match zero times: It matches even if the preceding item isn't present at all. I like to think of this quantifier as meaning "any amount of the preceding, but let it be optional."For instance, it could be used to find <b> (boldface) tags in a chunk o219:
In plain English, this means to match a <b>, then any amount of anything, then </b>. This seems sensible enough. If you try it against this text:
<b>.*</b>
you will find that the <b>Bear</b> part is what matches, which is what you would expect. However, if you try it against this text:
The <b>Bear</b> walked alone
it will match the <b>Bear</b> and the <b>Fox</b> part of the text. That is, the RegEx engine finds the first <b>, then matches everything up to the last </b>. What's going on? Although it might seem counterintuitive at first, it's important to understand that the .* part really does mean "any number of any characters." There's nothing in the </b>.*</b> expression that says that the .* part isn't supposed to match the characters in the </b> part. It's an important concept that is crucial to understand when crafting regular expressions.By default, regular expressions are "greedy," which means that the processor is always willing to return the least rigorous interpretation of your RegEx as possible. Or, to put it another way, the engine will always assume that you want the longest possible match. The ColdFusion documentation refers to this as maximal matching , but most regular expression references call it greedy matching. One way to fix the boldface-text example is to replace the .* with [^<]*, like so:
The <b>Bear</b> and the <b>Fox</b> walked hand in hand.
See the difference? In plain English, this now means "match <b>, then match any number of anything that isn't a <, then match </b>."When used against the previous text sample, this version of the RegEx will correctly match <b>Bear</b> and <b>Fox</b>, making it a pretty good solution to the problem. However, it will fail if the text contains any < characters between the <b> and </b>, like this:
<b>[^<]*</b>
Using this text, the <b>[^<]*</b> expression will only match <b>Fox</b>. Bummer. All is not lost, though. You can tell the RegEx engine not to use greedy matching, which brings us to our next topic.
The <b><i>Bear</i></b> and the <b>Fox</b> walked hand in hand.
Using Minimal Matching (Non-Greedy) Quantifiers
As you have seen, the fact that regular expressions will match the longest possible substring by default, maximal matching (greedy matching) can sometimes be a problem. In such situations, you can use slightly different quantifiers to tell the RegEx engine to match the shortest possible substring instead. The ColdFusion documentation refers to this as minimal matching (as opposed to maximal matching), but most RegEx texts call it non-greedy matching .There is a non-greedy version of each of the quantifiers shown in Table 13.8. To indicate that you want to use the non-greedy version, follow the quantifier with a ? character, as shown in Table 13.9.
QUANTIFIER | DESCRIPTION |
---|---|
?? | Non-greedy version of ?, which means that the preceding item is optional. The difference in the non-greedy version is that the RegEx engine will first try to match based on the item's absence. In other words, the item will only be included in the match if it is not possible to get a match without the item. |
+? | Non-greedy version of +, which means that the preceding item will match at least once, but as few times as possible. |
*? | Non-greedy version of *, which means that the preceding item can appear any number of times (including none at all), but the shortest possible string will always be found. |
{num,max}? | Non-greedy version of {num,max}, which means that the preceding item will match between num and max times, but as few times as actually possible. |
{num,}? | Non-greedy version of {num,}, which means that the preceding item will match at least num times, but as few times as actually possible. |
If you wanted to ensure that there was at least one character between the <b> and </b> tags, you could use the non-greedy version of + instead of *, like so:
<b>(.*?)</b>
This expression will match all bold text, but not empty <b></b> tags.NOTENon-greedy matching is sometimes called lazy matching, meaning that the RegEx engine is "lazily" trying to match as little text as possible.
<b>(.+?)</b>
Metacharacters 201: Alternation
Sometimes you might need to find matches that contain one string or pattern, or another string or pattern. That is, sometimes you need the conceptual equivalent of what would be called an "or" in normal programming languages, or the OR part of a SQL query.To perform "or" matches with regular expressions, use the | character (usually called the pipe character). Each pipe represents the idea of "or." Just as in normal programming, the | character's effect can be constrained with parentheses, so Number (1|2) is different from Number 1|2. The first would match the string Number 1 or Number 2, whereas the second would match Number 1 or just the number 2.The following RegEx would match the phrase My Red Fox, My Brown Fox, or My Beige Fox. It would also match My 1 Fox, My 2 Foxes, My 3 Foxes, or any other number of foxes:
My ((Red|Brown|Beige|1) Fox|[0-9]+Foxes)\b
Metacharacters 202: Word Boundaries
Often, you will need to write regular expressions that are aware of word boundaries. ColdFusion supports the Perl-style \b and \B boundary sequences, as described in Table 13.10.
SEQUENCE | MEANING |
---|---|
\b | Matches what can generally be described in plain English as a word boundary. Technically, a boundary is defined as the transition between an alphanumeric character and a nonalphanumeric character. |
\B | The opposite of \b, matching any character that is not a word boundary. Generally less useful than \b in most scenarios. |
Metacharacters 203: String Anchors
String anchors are conceptually similar to boundary sequences (see the preceding section), because they are another way of making sure that your regular expression doesn't find undesired "partial matches." Whereas boundaries are about making sure the match "bumps up" against the beginning or end of a word, string anchors are about making sure the match "bumps up" against the beginning or end of the entire chunk of text being searched.The RegEx string anchors are listed in Table 13.11.
ANCHOR | DESCRIPTION |
---|---|
^ | Matches the beginning the chunk of text being searched. Or, in multiline mode, matches the beginning of a line (multiline mode is discussed next). |
$ | Matches the end of the text being searched. Or, in multiline mode, matches the end of a line. |
\A | Always matches the beginning of the chunk of text being searched, regardless of whether multiline mode is being used. |
\Z | Always matches the end of the text being searched, regardless of multiline mode. |
This regular expression seems to do the job. It displays "Okay" if the user enters something like 01201-9809, and "Not Valid" if the user enters 01201-98 or just 01201.However, it will also display "Okay" if the user types Foo 01201-9809 or 01201-9809Bar, because there is nothing about the regular expression that says the ZIP code must be the only thing the user enters. The solution is to anchor the regular expression to the beginning and end of the string using ^ and $, like so:
<cfif reFind("\d{5}-\d{4}", FORM.zipCodePlus4)>
Okay
<cfelse>
Not Valid
</cfif>
Alternatively, you could use the \A and \Z sequences, like so:
<cfif reFind("^\d{5}-\d{4}$", FORM.zipCodePlus4)>
Okay
<cfelse>
Not Valid
</cfif>
These two snippets will perform the same way, because ^ is synonymous with \A (and $ is synonymous with \Z) unless the regular expression uses multiline mode.
<cfif reFind("\A\d{5}-\d{4}\Z", FORM.zipCodePlus4)>
Okay
<cfelse>
Not Valid
</cfif>
Understanding Multiline Mode
If you start your regular expression with the special sequence (?m), the regular expression is processed in what the ColdFusion and Perl engines call multiline mode . Multiline mode means that the ^ and $ characters match the beginning and end of a line within the chunk of text being searched, rather than the beginning and end of the entire chunk of text (Figure 13.14).
Figure 13.14. Multiline mode anchors matches to lines in the text being searched.
[View full size image]

The following regular expression would get only the first line; because multiline mode is not in effect, ^ will match only the very beginning of the text:
1 frog a leaping
2 foxes jumping
100 programs crashing
5 golden rings
This one matches all four lines; because multimode is on, ^ matches the beginning of a line:
^\d+[[:print:]]+
This next one matches the first three lines (because they all end with ing), but not the last line (Figure 13.14):
(?m)^\d+[[:print:]]+
All this said, it is very important to understand what the definition of a line is for the purposes of multiline mode processing. When you use (?m) with ColdFusion, each linefeed character (that's ASCII character 10) is considered to start a new line; this is the Unix method of indicating new lines. Carriage return characters (ASCII code 13) are not considered the start of new lines, which means that
(?m)^\d+[[:print:]]+ing$
- Multimode processing won't work correctly with chunks of text that originate on Macintosh computers, because the text might contain only carriage return characters and no linefeeds.
- Chunks of text that originate on Windows/MS-DOS machines probably contain CRLF sequences (a carriage return followed by a linefeed), to separate the lines. As far as RegEx's multimode processing is concerned, a carriage return character sits at the very end of every line, which means that the $ will not work properly because it matches only linefeeds, not carriage returns.
- Chunks of text that originate on Unix machines will work fine (but if the chunks of text are coming from the public, it's unlikely that they are using Unix browsers).
Therefore, if you are going to use multiline mode, I recommend that you use ColdFusion's normal replace() method to massage the chunk of text that you're going to search. First, replace each CRLF with a linefeed (that should take care of the Windows text), and then replace any remaining carriage returns with linefeeds (to deal with the Mac text). Assuming that the chunk of text you will be searching is in a string variable called str, the following two lines will do the job:
Another option would be to use the adjustNewlinesToLinefeeds() function included in the RegExFunctions.cfm UDF library (Table 13.5), like so:
<cfset str = reReplace(str, Chr(13)&Chr(10), Chr(10), "ALL")>
<cfset str = reReplace(str, Chr(13), Chr(10), "ALL")>
<cfset str = adjustNewlinesToLinefeeds(str)>
Metacharacters 301: Match Modifiers
Perl 5 introduced a number of special modifiers that begin with the sequence (?, as listed in Table 13.12. Most of these modifiers are discussed elsewhere in this chapter, as indicated.NOTEThe ColdFusion documentation implies that you can use only (?x) or (?m) or (?i) at the very beginning of a regular expression. Actually, you can use them anywhere in the expression, but they always affect the whole expression, ignoring parentheses. There is no way to say that you only want part of the expression to be affected by (?i), for instance. This is consistent with Perl's behavior. Just the same, I recommend putting these match modifiers at the beginning of the expression, because that's the documented usage.As an example of using the (?x) modifier, consider the simple phone number RegEx that has been used elsewhere in this chapter. When used in a reFind(), it can look a bit unwieldy and somewhat inscrutable:
Using (?x), you can spread the regular expression over as many lines as you want, using whatever indention you want. You can also use the # sign to add comments, like this:
<cfset match = reFind("(\([0-9]{3}\))([0-9]{3}-[0-9]{4})", text, 1, true)>
Anything from a ## to the end of the line is considered to be a comment.NOTEActually, the RegEx comment indicator is a single #, not ##, but because # has special meaning to ColdFusion, you need to use two pound signs together in order to get the # character into the RegEx string. This is the case anytime you need to embed # within a quoted string in CFML.TIP
<cfset match = reFind("(?x)
( ## (begin capturing area code with subexpression)
\([0-9]{3}\) ## Area Code portion, surrounded by literal parentheses
) ## (end capturing of area code)
( ## (begin capturing actual phone number)
[0-9]{3} ## "Exchange" portion of phone number,
- ## then a hyphen,
[0-9]{4} ## then the last four digits of phone number
) ## (end capturing of phone number)
", text, 1, True)>
If you need to match a space character while using (?x), escape the space character by typing a \followed by a space. That tells the processor to consider the space as an actual part of the match criteria, rather than part of the indention and other decorative whitespace.
Metacharacters 302: Lookahead Matching
As noted in Table 13.12, you can use the positive lookahead modifier at the beginning of any parenthesized set of items. Positive lookahead means that you want to test that a pattern exists, but without it actually being considered part of the match. For instance, consider the following regular expression:
This expression will match Belinda in a chunk of text, but only if it is followed by Foxile. Belinda followed by Carlisle will not match.Negative lookahead, conversely, means that you want to test that a pattern does not exist. Conceptually, it's kind of like being able to say "this but not that." The following expression will match any Belinda, as long as it's not Belinda Carlisle:
\bBelinda (?=Foxile)
Here's another example of using lookahead. Say you are using a simple regular expression such as the following to match telephone numbers in the form (999)999-9999:
\bBelinda (?!Carlisle)
The following variation adds negative lookahead to match only the phone numbers that are not in the 212 area code (see Figure 13.15):
(\([0-9]{3}\))([0-9]{3}-[0-9]{4})
(\((?!212)[0-9]{3}\))([0-9]{3}-[0-9]{4})
Figure 13.15. Lookahead matching allows for "this but not that" matches.
[View full size image]

NOTEColdFusion does not support lookbehind processing (Perl's (?<=) and (?<!) sequences).
(\((?!212)[0-9]{3}\))([0-9]{3}-[0-9]{4})\s+(?=\(new listing\))
Metacharacters 303: Backreferences Redux
Earlier in this chapter, you learned about using backreferences such as \1 and \2 in the replacement string when using REReplace(), which allowed you to perform replacements that were far more intelligent than with static replacement strings. You can also use backreferences within the regular expression itself: Each backreference is like a variable that holds the value of the corresponding subexpression.For instance, let's look at our telephone number RegEx again. Here's the normal version of the expression:
The following variation matches only those phone numbers where the last four digits are the same:
(\([0-9]{3}\))([0-9]{3}-[0-9]{4})
This variation adds negative lookahead (discussed in the preceding section) to match only phone numbers in which the last four digits are not the same:
(\([0-9]{3}\))([0-9]{3}-(\d)\3\3\3)
(\([0-9]{3}\))([0-9]{3}-(?!(\d)\3\3\3))
Metacharacters 304: Escape Sequences
ColdFusion supports the use of normal Perl escape sequences in regular expressions, as shown in Table 13.13. Previously, you needed to add these special characters to your RegEx string using the Chr() function. You can still do so, but these escape sequences are more standard and easier to type and read.