5.4 Extended Examples
The next few examples illustrate some important techniques about regular expressions.
The discussions are longer, and show more of the thought processes and
mistaken paths that eventually lead to a working conclusion.
5.4.1 Keeping in Sync with Your Data
Let''''s look at a lengthy example that might seem a bit contrived, but which illustrates
some excellent points on why it''''s important to keep in sync with what
you''''re trying to match (and provides some methods to do so).
Let''''s say that your data is a series of five-digit US postal codes (ZIP codes) that are
run together, and that you need to retrieve all that begin with, say, 44. Here is a
sample line of data, with the codes we want to retrieve in bold:
03824531449411615213441829503544272752010217443235
Building Up a Regex Through Variables in Java
String SubDomain = "(?i:[a-z0-9]|[a-z0-9][-a-z0-9]*[a-z0-9])";
String TopDomains = "(?x-i:com\\b \n" +
" |edu\\b \n" +
" |biz\\b \n" +
" |in(?:t|fo)\\b \n" +
" |mil\\b \n" +
" |net\\b \n" +
" |org\\b \n" +
" |[a-z][a-z]\\b \n" + // country codes
") \n";
String Hostname = "(?:" + SubDomain + "\\.)+" + TopDomains;
String NOT_IN = ";\"''''<>()\\[\\]\\{\\}\\s\\x7F-\\xFF";
String NOT_END = ".,?";
String ANYWHERE = "[^" + NOT_IN + NOT_END + "]";
String EMBEDDED = "[" + NOT_END + "]";
String UrlPath = "/"+ANYWHERE + "*("+EMBEDDED+"+"+ANYWHERE+"+)*";
String Url =
"(?x: \n"+
" \\b \n"+
" ## match the hostname part \n"+
" ( \n"+
" (?: ftp | http s? ): // [-\\w]+(\\.\\w[-\\w]*)+ \n"+
" | \n"+
" " + Hostname + " \n"+
" ) \n"+
" # allow optional port \n"+
" (?: \\d+ )? \n"+
" \n"+
" # rest of url is optional, and begins with / \n"+
" (?: " + UrlPath + ")? \n"+
")";
// Now convert string we''''ve built up into a real regex object
Pattern UrlRegex = Pattern.compile(Url);
// Now ready to apply to raw text to find urls . . .
·
·
·
As a starting point, consider that
\d\d\d\d\d
can be used repeatedly to match all
the ZIP codes. In Perl, this is as simple as
@zips = m/\d\d\d\d\d/g;
to create a
list with one ZIP code per element. (To keep these examples less cluttered, they
assume the text to be matched is in Perl''''s default target variable
$_
see Section 2.3.7.) With
other languages, it''''s usually a simple matter to call the regex "find" method in a
loop. I''''d like to concentrate on the regular expression rather than that mechanics
particular to each language, so will continue to use Perl to show the examples.
Back to
\d\d\d\d\d
. Here''''s a point whose importance will soon become apparent:
the regex never fails until the entire list has been parsedthere are absolutely no bump-and-retries by the transmission. (I''''m assuming we''''ll have only proper
data, an assumption that is very situation specific.)
So, it should be apparent that changing
\d\d\d\d\d
to
44\d\d\d
in an attempt
to find only ZIP codes starting with 44 is silly once a match attempt fails, the
transmission bumps along one character, thus putting the match for the
44···
out of
sync with the start of each ZIP code. Using
44\d\d\d
incorrectly finds a match at ''''···5314494116···''''.
You could, of course, put a caret or
\A
at the head of the regex, but they allow a
target ZIP code to match only if it''''s the first in the string. We need to keep the
regex engine in sync manually by writing our regex to pass over undesired ZIP
codes as needed. The key here is that it must pass over full ZIP codes, not single
characters as with the automatic bump-along.
5.4.1.1 Keeping the match in sync with expectations
The following are a few ways to pass over undesired ZIP codes. Inserting them
before what we want (
(44\d\d\d)
) achieves the desired effect. Non-capturing
(?:···)
parentheses are used to match undesired ZIP codes, effectively allowing us
to pass them over on the way toward matching a desired ZIP code within the $1
capturing parentheses:
(?:[^4]\d\d\d\d|\d[^4]\d\d\d)*···
This brute-force method actively skips ZIP codes that start with something
other than 44. (Well, it''''s probably better to use
[1235-9]
instead of
[^4]
,
but as I said earlier, I am assuming properly formatted data.) By the way,
we can''''t use
(?:[^4][^4]\d\d\d)*
, as it does not match (and thus does
not pass over) undesired ZIP codes like 43210.
(?:(?!44)\d\d\d\d\d)*···
This method, which uses negative lookahead, actively skips ZIP codes that
do not start with 44. This English description sounds virtually identical to
the previous one, but when rendered into a regular expression looks quite
different. Compare the two descriptions and related expressions. In this
case, a desired ZIP code (beginning with 44) causes
(?!44)
to fail, thus
causing the skipping to stop.
(?:\d\d\d\d\d)*?···
This method uses a lazy quantifier to skip ZIP codes only when needed. We
use it before a subexpression matching what we do want, so that if that
subexpression fails, this one matches a ZIP. It''''s the laziness of
(···)*?
that
allows this to happen. Because of the lazy quantifier,
(?:\d\d\d\d\d)
is
not even attempted until whatever follows has failed. The star assures that it
is repeatedly attempted until whatever follows finally does match, thus
effectively skipping only what we want to skip.
Combining this last method with
(44\d\d\d)
gives us
@zips = m/
(?:\d\d\d\d\d)*?(44\d\d\d)/g;
and picks out the desired ''''44xxx
'''' codes, actively skipping undesired ones that
intervene. (In this "
@array = m/···/g
" situation, Perl fills the array with what''''s
matched by capturing parentheses during each match attempt see Section 7.5.3.3.) This regex
can work repeatedly on the string because we know each match always leaves the
"current match position" at the start of the next ZIP code, thereby priming the next
match to start at the beginning of a ZIP code as the regex expects.
5.4.1.2 Maintaining sync after a non-match as well
Have we really ensured that the regex is always applied only at the start of a ZIP
code? No! We have manually skipped intervening undesired ZIP codes, but once
there are no more desired ones, the regex finally fails. As always, the bump-alongand-
retry happens, thereby starting the match from a position within a ZIP code
something our approach relies on never happening.
Let''''s look at our sample data again:
038245314494116
15213441829503544272
7
5
2
010217443235
Here, the matched codes are bold (the third of which is undesired), the codes we
actively skipped are underlined, and characters skipped via bump-along-and-retry
are marked. After the match of 44272, no more target codes are able to be
matched, so the subsequent attempt fails. Does the whole match attempt end? Of
course not. The transmission bumps along to apply the regex at the next character,
putting us out of sync with the real ZIP codes. After the fourth such bump-along,
the regex skips 10217 as it matches 44323, reporting it falsely as a desired code.
Any of our three expressions work smoothly so long as they are applied at the
start of a ZIP code, but the transmission''''s bump-along defeats them. This can be
solved by ensuring that the transmission doesn''''t bump along, or that a bumpalong
doesn''''t cause problems.
One way to ensure that the transmission doesn''''t bump along, at least for the first
two methods, is to make
(44\d\d\d\)
greedily optional by appending
?
. This
plays off the knowledge that the prepended
(?:(?!44)\d\d\d\d\d)*···
or
(?:[^4]\d\d\d\d|\d[^4]\d\d\d)*···
finish only when at a desired code, or
when there are no more codes (which is why it can''''t be used for the third, nongr
eedy method.) Thus,
(44\d\d\d)?
matches the desired ZIP code if it''''s there, but
doesn''''t force a backtrack if it''''s not.
There are some problems with this solution. One is that because we can now have
a regex match even when we don''''t have a target ZIP code, the handling code must
be a bit more complex. However, to its benefit, it is fast, since it doesn''''t involve
much backtracking, nor any bump-alongs by the transmission.
5.4.1.3 Maintaining sync with \G
A more general approach is to simply prepend
\G
(see Section 3.4.3.3) to any of the three
methods'''' expressions. Because we crafted each to explicitly end on a ZIP code
boundary, we''''re assured that any subsequent match that has had no intervening
bump-along begins on that same ZIP boundary. And if there has been a bumpalong,
the leading
\G
fails immediately, because with most flavors, it''''s successful
only when there''''s been no intervening bump-along. (This is not true for Ruby and
other flavors whose
\G
means "start of the current match" instead of "end of the
previous match" see Section 3.4.3.4.)
So, using the second expression, we end up with
@zips = m/\G(?:(?!44)\d\d\d\d\d)*(44\d\d\d\d)/g;
without the need for any special after-match checking.
5.4.1.4 This example in perspective
I''''ll be the first to admit that this example is contrived, but nevertheless, it shows a
number of valuable lessons about how to keep a regex in sync with the data. Still,
were I really to need to do this in real life, I would probably not try to solve it
completely with regular expressions. I would simply use
\d\d\d\d\d
to grab
each ZIP code, then discard it if it doesn''''t begin with ''''44''''. In Perl, this looks like:
@zips = ( ); # Ensure the array is empty
while (m/(\d\d\d\d\d)/) {
$zip = $1;
if (substr($zip, 0, 2) eq "44") {
push @zips, $zip;
}
}
Also, see the sidebar in Section 3.4.3.4 for a particularly interesting use of
\G
, although
one available at the time of this writing only in Perl.
5.4.2 Parsing CSV Files
As anyone who''''s ever tried to parse a CSV (Comma Separated Values) file can tell
you, it can be a bit tricky. The biggest problem is that it seems every program that
produces a CSV file has a different idea of just what the format should be. In this
section, I''''ll start off with methods to parse the kind of CSV file that Microsoft Excel
generates, and we''''ll move from there to look at some other format
permutations.[3]
Section 6.6.7.3) after the
efficiency issues discussed in that chapter are taken into consideration.
Luckily, the Microsoft format is one of the simplest. The values, separated by commas,
are either "raw" (just sitting there between the commas), or within double quotes (and within the double quotes, a double quote itself is represented by a
pair of double quotes in a row).
Here''''s an example:
Ten Thousand,10000, 2710 ,,"10,000","It''''s "10 Grand", baby",10K
This row represents seven fields:
Ten•Thousand
10000
•2710•
an empty field
10,000
It''''s•"10•Grand",•baby
10K
So, to parse out the fields from a line, we need an expression to cover each of two
field types. The non-quoted ones are easythey contain anything except commas
and quotes, so they are matched by
[^",]+
.
A double-quoted field can contain commas, spaces, and in fact anything except a
double quote. It can also contain the two quotes in a row that represent one quote
in the final value. So, a double-quoted field is matched by any number of
[^"]|"
between
"···"
, which gives us
"(?:[^"]|")*"
. (Actually, for efficiency, I can use
atomic grouping,
(?>···)
instead of
(?:···)
, but I''''ll leave that discussion until the
next chapter; see Section 6.5.7.)
Putting this all together results in
[^",]+|"(?:[^"]|")*"
to match a single
field. That might be getting a bit hard to read, so I''''ll rewrite it in a free-spacing
mode (see Section 3.3.3.2):
# Either some non-quote/non-comma text . . .
[^",]+
# . . . or . . .
|
# . . . a double-quoted
field (inside, paired
double quotes are allowed)
" # field''''s opening quote
(?: [^"] ; " )*
" # field''''s closing quote
Now, to use this in practice, we can apply it repeatedly to a string containing a
CSV row, but if we want to actually do anything productive with the results of the
match, we should know which alternative matched. If it''''s the double-quoted field,
we need to remove the outer quotes and replace internal paired double quotes
with one double quote to yield the original data.
I can think of two approaches to this. One is to just look at the text matched and
see whether the first character is a double quote. If so, we know that we must
strip the first and last characters (the double quotes) and replace any internal ''''"'''' by ''''"''''. That''''s simple enough, but it''''s even simpler if we are clever with capturing
parentheses. If we put capturing parentheses around each of the subexpressions
that match actual field data, we can inspect them after the match to see which
group has a value:
# Either some non-quote/non-comma text . . .
( [^",]+ )
# . . . or . . .
|
# . .. a double-quoted field (inside, paired
double quotes are allowed)
" # field''''s opening quote
( (?: [^"] | " )* )
" # field''''s closing quote
Now, if we see that the first group captured, we can just use the value as is. If the
second group captured, we merely need to replace any ''''"'''' with ''''"'''' and we can
use the value.
I''''ll show the example now in Perl, and a bit later (after we flush out some bugs) in
Java and VB.NET. Here''''s the snippet in Perl, assuming our line is in $html and has
had any newline removed from the end (we don''''t want the newline to be part of
the last field!):
while ($line =~ m{
# Either some non-quote/non-comma text . . .
( [^",]+ )
# . . . or . . .
|
# . . . a double-quoted field (" allowed inside)
" # field''''s opening quote
( (?: [^"] | " )* )
" # field''''s closing quote
}gx)
{
if (defined $1) {
$field = $1;
} else {
$field = $2;
$field =~ s/"/"/g;
}
print "[$field]"; # print the field, for debugging
Can work with $field now . . .
}
Applying this against our test data, the output is:
[Ten•Thousand][10000][•2710•]
[10,000][It''''s•"10•Grand",•baby][10K]
This looks mostly good, but unfortunately doesn''''t give us anything for that empty
fourth field. If the program''''s "work with $field" is to save the field value to an
array, once we''''re all done, we''''d want access to the fifth element of the array to
yield the fifth field ("10,000"). That won''''t work if we don''''t fill an element of the
array with an empty value for each empty field.
The first idea that might come to mind for matching an empty field is to change
[^",]+
to
[^",]*
. Well, that may seem obvious, but does it really work?
Let''''s test it. Here''''s the output:
[Ten•Thousand][][10000][][•2710•][][][][
10,000][][][It''''s•"10•Grand", . . .
Oops, we somehow got a bunch of extra fields! Well, in retrospect, we shouldn''''t
be surprised. By using
(···)*
to match a field, we don''''t actually require anything
to match. That works to our advantage when we have an empty field, but consider
after having matched the first field, the next application of the regex starts at
''''Ten•Thousand,10000···''''. If nothing in the regex can match that raw comma (as is
the case), yet an empty match is considered successful, the empty match will
indeed be successful at that point. In fact, it could be successful an infinite number
of times at that point if the regex engine doesn''''t realize, as modern ones do,
that it''''s in such a situation and force a bump-along so that two zero-width matches
don''''t happen in a row (see Section 3.4.3.4). That''''s why there''''s one empty match between each
valid match (and although not shown, there''''s an empty match at the end).
5.4.2.1 Distrusting the bump-along
The problem here stems from us having relied on the transmission''''s bump-along
to get us past the separating commas. To solve it, we need to take that control into
our own hands. Two approaches come to mind:
We could try to match the commas ourselves. If we take this approach, we
must be sure to match a comma as part of matching a regular field, using it to
"pace ourselves" through the string.
We could check to be sure that each match start is consistent with locations
that we know can start a field. A field starts either at the beginning of the
line, or after a comma.
Perhaps even better, we can combine the two. Starting with the first approach
(matching the commas ourselves), we can simply require a comma before each
field except the first. Alternatively, we can require a comma after each field except
the last. We can do this by prepending
^|,
or appending
$|,
to our regex, with
appropriate parentheses to control the scope. Let''''s try prepending, which gives us:
(?:^|,)
(?:
# Either some non-quote/non-comma text....
( [^",]* )
# ··· or···
|
# ··· a double-quoted field (inside, paired
double quotes are allowed)
" # field''''s opening quote
( (?: [^"] | " )* )
" # field''''s closing quote
)
This really sounds like it should work, but plugging it into our test program, the
result is disappointing:
[Ten•Thousand][10000][•2710•]
[][][000][][•baby][10K]
Remember, we''''re expecting:
[Ten•Thousand][10000][•2710•]
[][10,000][It''''s•"10•Grand",•baby][10K]
Why didn''''t this one work? It seems that the double-quoted fields didn''''t come out
right, so the problem must be with the part that matches a double-quoted field,
right? No, the problem is before that. Perhaps the moral from Section 4.5.10.1 can help:
when more than one alternative can potentially match from the same point, care
must be taken when selecting the order of the alternatives. Since the first alternative,
[^",]*
, requires nothing to be successful, the second alternative never gets a
chance to match unless forced by something that must match later in the regex.
Our regex doesn''''t have anything after these alternatives, so as it''''s written, the second
alternative is never even reached. Doh!
Wow, you''''ve really got to keep your wits about you. Okay, let''''s swap the alternatives
and try again:
(?:^|,)
(?: # Now, match either a double-quoted field
(inside, paired double quotes are allowed) . . .
" # (double-quoted field''''s opening quote)
( (?: [^"] | " )* )
" # (double-quoted field''''s closing quote)
|
# . . . or, some non-quote/non-comma text . . .
( [^",]* )
)
Now, it works! Well, at least for our test data. Could it fail with other data? This
section is named "Distrusting the bump-along," and while nothing takes the place
of some thought backed up with good testing, we can use
\G
to ensure that each
match begins exactly at the point that the previous match ended. We believe that
should be true already because of how we''''ve constructed and apply our regex. If
we start out the regex with
\G
, we disallow any match after the engine''''s transmission
is forced to bump along. We hope it will never matter, but doing so may
make an error more apparent. Had we done this with our previously-failing regex
that had given us
[Ten•Thousand][10000][•2710•][][][000][][•baby][10K]
we would have gotten
[Ten•Thousand][10000][•2710•][][]
instead. This perhaps would have made the error more apparent, had we missed it
the first time.
CSV Processing in Java
Here''''s the CSV example with Sun''''s java.util.regex. This is designed to be
clear and simplea more efficient version is given in Chapter 8 (see Section 8.4.4.3).
import java.util.regex.*;
·
·
·
Pattern fieldRegex = Pattern.compile(
"\\G(?:^|,) \n"+
"(?: \n"+
" # Either a double-quoted field ... \n"+
" \" # field''''s opening quote \n"+
" ( (?: [^\"]++ | \"\" )*+ ) \n"+
" \" # field''''s closing quote \n"+
" # ... or ... \n"+
" | \n"+
" # ... some non-quote/non-comma text ... \n"+
" ( [^\",]* ) \n"+
" ) \n", Pattern.COMMENTS);
Pattern quotesRegex = Pattern.compile("\"\");
·
·
·
// Given the string in ''''line'''', find all the fields . . .
Matcher m = fieldRegex.matcher(line);
while (m.find())
{
String field;
if (m.group(1) != null) {
field = quotesRegex.matcher(m.group(1)).replaceAll("\");
} else {
field = m.group(2);
}
// We can now work with the field . . .
System.out.println("[" + field + "]");
}
Another approach. The beginning of this section noted two approaches to
ensuring we stay properly aligned with the fields. The other is to be sure that a
match begins only where a field is allowed. On the surface, this is similar to
prepending
^|,
, except using lookbehind as with
(?<=^|,)
.
Unfortunately, as the section in the previous chapter (see Section 3.4.3.6) explains, even if
lookbehind is supported, variable-length lookbehind sometimes isn''''t, so this
approach may not work. If the variable length is the issue, we could replace
(?<=^|,)
with
(?:^|(?<=,))
, but this seems overly complex considering that we
already have the first approach working. Also, it reverts to relying on the transmission''''s
bump-along to bypass the commas, so if we''''ve made a mistake elsewhere, it
could allow a match to begin at a location like ''''···"10,000"···''''. All in all, it just
seems safer to use the first approach.
However, we can use a twist on this approachrequiring a match to end before a
comma (or before the end of the line). Adding
(?=$|,)
to the end of our regex
adds yet another assurance that it won''''t match where we don''''t want it to. In practice,
would I do add this? Well, frankly, I feel pretty comfortable with what we
came up with before, so I''''d probably not add it in this exact situation, but it''''s a
good technique to have on hand when you need it.
5.4.2.2 One change for the sake of efficiency
Although I don''''t concentrate on efficiency until the next chapter, I''''d like to make
one efficiency-related change now, for systems that support atomic grouping
(see Section 3.4.5.3). If supported, I''''d change the part that matches the values of doublequoted
fields from
(?:[^"]|")*
to
(?>[^"]+|")*
. The VB.NET example in
the sidebar below shows this.
CSV Processing in VB.NET
Imports System.Text.RegularExpressions
·
·
·
Dim FieldRegex as Regex = New Regex( _
"(?:^|,) " & _
"(?: " & _
" (?# Either a doublequoted field ...) " & _
" " (?# field''''s opening quote ) " & _
" ( (?> [^"]+ | "" )* ) " & _
" " (?# field''''s closing quote ) " & _
" (?# ... or ...) " & _
" | " & _
" (?# ... some non-quote/non-comma text ...) " & _
" ( [^",]*) " & _
" )", RegexOptions.IgnorePatternWhitespace)
Dim QuotesRegex as Regex = New Regex(" " " ")
''''A string with two double quotes
·
·
·
Dim FieldMatch as Match = FieldRegex.Match(Line)
While FieldMatch.Success
Dim Field as String
If FieldMatch.Groups(1).Success
Field = QuotesRegex.Replace(FieldMatch.Groups(1).Value, "")
Else
Field = FieldMatch.Groups(2).Value
End If
Console.WriteLine("[" & Field & "]")
'''' Can now work with ''''Field''''.···
FieldMatch = FieldMatch.NextMatch
End While
If possessive quantifiers (see Section 3.4.5.8) are supported, as they are with Sun''''s Java regex
package, they can be used instead. The sidebar with the Java CSV code shows this.
The reasoning behind these changes is discussed in the Chapter 6, and eventually
we end up with a particularly efficient version, shown in Section 6.6.7.3.
5.4.2.3 Other CSV formats
Microsoft''''s CSV format is popular because it''''s Microsoft''''s CSV format, but it''''s not
necessarily what other programs use. Here are some twists I''''ve seen:
Using another character, such as '''';'''' or a tab, as the separator (which makes
one wonder why the format is still called "comma-separated values").
Allowing spaces after the separators, but not counting them as part of the
value.
Using backslashes to escape quotes (e.g., using ''''\"'''' rather than ''''"'''' to include
a quote within the value). This usually means that a backslash is allowed (and
ignored) before any character.
These changes are easily accommodated. Do the first by replacing each comma in
the regex with the desired separator; the second by adding
\s*
after the first separator,
e.g., starting out with
(?:^|,\s*)
.
For the third, we can use what we developed earlier (see Section 5.2.7), replacing
[^"]+|"
with
[^"\\]+|\\.
. Of course, we''''d also have to change the subsequent
s/"/"/g
to the more general
s/\\(.)/$1/g
, or our target language''''s equivalent.