Using Regular Expressions in ColdFusion
The next two portions of this chapter will teach you about two concepts:
- How to use CFML's RegEx functions (reFind() and the others listed in Table 13.2) to actually perform regular expression operations within your ColdFusion pages.
- How to craft the regular expression for a particular task, using the various RegEx wildcards available to you.
This is a kind of chicken-and-egg scenario for me. How can I explain how to incorporate regular expressions like ([\w._]+)\@([\w_]+(\.[\w_]+)+) in your CFML code if you don't yet understandCrafting Your Own Regular Expressions" section if you don't like looking at all these wildcards without understanding what they mean.
Finding Matches with reFind()
Assuming you have already crafted the wildcard-laden RegEx criteria you want, you can use the reFind() function to tell ColdFusion to search a chunk of text with the criteria, like this:
Table 13.3 describes each of the reFind() arguments.
reFind(regex, string [, start] [, returnSubExpressions] )
ARGUMENT | DESCRIPTION |
---|---|
regex | Required . The regular expression that describes the text that you want to find. |
string | Required . The text that you want to search. |
start | Optional . The starting position for the search. The default is 1, meaning that the entire string is searched. If you provide a start value of 50, then only the portion of the string after the first 49 characters is searched. |
returnSubExpressions | Optional . A Boolean value indicating whether you want to obtain information about the position and length of the actual text that was found by the various portions of the regular expression. The default is False. You will learn more about this topic in the section "Getting the Matched Text Using returnSubExpressions" later in this chapter. |
- Assuming that returnSubExpressions is False (the default), the function returns the character position of the text that's found (that is, the first substring that matches the search criteria). If no match is found in the text, the function returns 0 (zero). This behavior is consistent with the ordinary, non-RegEx find() function.
- If returnSubExpressions is True, the function returns a CFML structure composed of two arrays called pos and len. These arrays contain the position and length of the first substring that matches the search criteria. The first value in the arrays (that is, pos[1] and len[1]) correspond to the match as a whole. The remaining values in the arrays correspond to any subexpressions defined by the regular expression.
The bit about the subexpressions might be confusing at this point, since you haven't learned what subexpressions actually are. Don't worry about it for the moment. Just think of the subexpressions argument as something you should set to True if you need to get the actual text that was found.
A Simple Example
For the moment, accept it on faith that the following regular expression will find a sensibly formed Internet email address (such as nate@nateweiss.com or nate@nateweiss.co.uk):
Listing 13.1 shows how to use this regular expression to find an email address within a chunk of text.
([\w._]+)\@([\w_]+(\.[\w_]+)+)
Listing 13.1. RegExFindEmail1.cfmA Simple Regular Expression Example
If you visit this page with your browser, the character position of the email address is displayed (Figure 13.1). If you change the text variable so that it no longer contains an Internet-style email address, the listing displays "No matches were found."
<!---
Filename: RegExFindEmail1.cfm
Author: Nate Weiss (NMW)
Purpose: Demonstrates basic use of reFind()
--->
&l233>
<head><title>Using a Regular Expression</title></head>
<body>
<!--- The text to search --->
<cfset text = "My email address is nate@nateweiss.com. Write to me anytime.">
<!--- Attempt to find a match --->
<cfset foundPos = reFind("([\w._]+)@([\w_]+(\.[\w_]+)+)", text)>
<!--- Display the result --->
<cfif foundPos gt 0>
<cfoutput>
<p>A match was found at position #foundPos#.</p>
</cfoutput>
<cfelse>
<p>No matches were found.</p>
</cfif>
</body>
</html>
Figure 13.1. Regular expressions can search for email addresses, phone numbers, and the like.

Ignoring Capitalization with reFindNoCase()
Internet email addresses aren't generally considered to be case-sensitive, so you might want to tell ColdFusion to perform the match without respect to case. To do so, use reFindNoCase() instead of reFind(). Both functions take the same arguments and are used in exactly the same way, so there's no need to provide a separate example listing for reFindNoCase().In short, anywhere you see reFind() in this chapter, you could use reFindNoCase() instead, and vice-versa. Just use the one that's appropriate for the task at hand. Also, note that it is possible to use case-insensitive regular expressions, making reFindNoCase() unnecessary.
Getting the Matched Text Using the Found Position
Sometimes you just want to find out whether a match exists within a chunk of text. In such a case, you would use the reFind() function as it was used in Listing 13.1.You can also use that form of reFind() if the nature of the RegEx is such that the actual match will always have the same length. For instance, if you were searching specifically for a U.S. telephone number in the form (999)999-9999 (where each of the 9s represents a number), you could use the following regular expression:
Because the length of a matched phone number will always be the same due to the nature of phone numbers, it's a simple matter to extract the actual phone number that was found. You use ColdFusion's built-in mid() function, feeding it the position returned by the reFind() function (as shown in Figure 13.1) as the start position, and the number 13 as the length.Listing 13.2 puts these concepts together, displaying the actual phone number found in text (Figure 13.2).
\([0-9]{3}\)[0-9]{3}-[0-9]{4}
Listing 13.2. RegExFindPhone1.cfmUsing mid() to Extract the Matched Text
<!---
Filename: RegExFindPhone1.cfm
Author: Nate Weiss (NMW)
Purpose: Demonstrates basic use of reFind()
--->
&l233>
<head><title>Using a Regular Expression</title></head>
<body>
<!--- The text to search --->
<cfset text = "My phone number is (718)555-1212. Call me anytime.">
<!--- Attempt to find a match --->
<cfset matchPos = reFind("(\([0-9]{3}\))([0-9]{3}-[0-9]{4})", text)>
<!--- Display the result --->
<cfif matchPos gt 0>
<cfset foundString = mid(text, matchPos, 13)>
<cfoutput>
<p>A match was found at position #matchPos#.</p>
<p>The actual match is: #foundString#</p>
</cfoutput>
<cfelse>
<p>No matches were found.</p>
</cfif>
</body>
</html>
Figure 13.2. If you know its length ahead of time, it's easy to display the matched text.

Getting the Matched Text Using returnSubExpressions
If you want to adjust the email address example in Listing 13.1 so that it displays the actual email address found, the task is a bit more complicated because not all email addresses are the same length. What would you supply to the third argument of the mid() function? You can't use a constant number in the manner shown in Listing 13.2. Clearly, you need some way of telling reFind() to return the length, in addition to the position, of the match.This is when the returnSubExpressions argument comes into play. If you set this argument to True when you use reFind(), the function will return a structure that contains the position and length of the match. (The structure also includes the position and length that correspond to any subexpressions in the structure, but don't worry about that right now.)Listing 13.3 shows how to use this parameter of the reFind() function. It uses the first element in pos and len arrays to determine the position and length of the matched text and then displays the match (Figure 13.3).
Listing 13.3. RegExFindEmail2.cfmUsing reFind()'s returnSubExpressions Argument
<!---
Filename: RegExFindEmail2.cfm
Author: Nate Weiss (NMW)
Purpose: Demonstrates basic use of REFind()
--->
&l233>
<head><title>Using a Regular Expression</title></head>
<body>
<!--- The text to search --->
<cfset text = "My email address is nate@nateweiss.com. Write to me anytime.">
<!--- Attempt to find a match --->
<cfset matchStruct = reFind("([\w._]+)\@([\w_]+(\.[\w_]+)+)", text, 1, True)>
<!--- Display the result --->
<cfif matchStruct.pos[1] gt 0>
<cfset foundString = mid(text, matchStruct.pos[1], matchStruct.len[1])>
<cfoutput>
<p>A match was found at position #matchStruct.pos[1]#.</p>
<p>The actual match is: #foundString#</p>
</cfoutput>
<cfelse>
<p>No matches were found.</p>
</cfif>
</body>
</html>
Figure 13.3. It's easy to display a matched substring, even if its length will vary at run time.

Working with Subexpressions
As exhibited by the last example, the first values in the pos and len arrays correspond to the position and length of the match found by the reFind() function. Those values (pos[1] and len[1]) will always exist. So why are pos and len implemented as arrays if the first value in each is the only interesting value? What other information do they hold?The answer is this: If your regular expression contains any subexpressions , there will be an additional value in the pos and len arrays that corresponds to the actual text matched by the subexpression. If your regular expression has two subexpressions, pos[2] and len[2] are the position and length of the first subexpression's match, and pos[3] and len[3] are the position and length for the second subexpression.So, what's a subexpression? When you are using regular expressions to solve specific problems (such as finding email addresses or phone numbers in a chunk of text), you are often looking for several different patterns of text, one after another. That is, the nature of the problem is often such that the regular expression is made up of several parts ("look for this, followed by that"), where all of the parts must be found in order for the whole regular expression to be satisfied. If you place parentheses around each of the parts, the parts become subexpressions.Subexpressions do two things:
- They make the overall RegEx criteria more flexible, because you can use many regular expression wildcards on each subexpression. This capability allows you to say that some subexpressions must be found while others are optional, or that a particular subexpression can be repeated multiple times, and so on. To put it another way, the parentheses allow you to work with the enclosed characters or wildcards as an isolated group. This isn't so different conceptually from the way parentheses work in <cfif> statements or SQL criteria.
- The match for each subexpression is included in the len and pos arrays, so you can easily find out what specific text was actually matched by each part of your RegEx criteria. You get position and length information not only for the match as a whole, but for each of its constituent parts.
Understanding Multiline Mode").