6.7 The Freeflowing Regex
We just spent some time constructing a regex to match a C comment, but left off
with the problem of how to stop comment-like items within strings from being
matched. Using Perl, we might mistakenly try to remove comments with:
$prog =~ s{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}{}g; # remove
C comments (and more!)
Text in the variable $prog that is matched by our regex is removed (that is,
replaced by nothing). The problem with this is that there''''s nothing to stop a match
from starting within a string, as in this C snippet:
char *CommentStart = "/*"; /* start of comment */
char *CommentEnd = "*/"; /* end of comment */
Here, the underlined portions are what the regex finds, but the bold portions are
what we wish to be found. When the engine is searching for a match, it tries to
match the expression at each point in the target. Since it is successful only from
where a comment begins (or where it looks like one begins), it doesn''''t match at
most locations, so the transmission bump-along bumps us right into the doublequoted
string, whose contents look like the start of a comment. It would be nice if
we could tell the regex engine that when it hits a double-quoted string, it should
zip right on past it. Well, we can.
6.7.1 A Helping Hand to Guide the Match
Consider:
$COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}; #
regex to match a comment
$DOUBLE = qr{"(?:\\.|[^"\\])*"}; #regex to
match double-quoted string
$text =~ s/$DOUBLE|$COMMENT//g;
There are two new things here. One is that this time the regex operand,
$DOUBLE|$COMMENT, is made up of two variables, each of which is constructed
with Perl''''s special qr/···/ regex-style "
double-quoted string" operator. As discussed
at length in Chapter 3 (see Section 3.3), one must be careful when using strings that are
meant to be interpreted as regular expressions. Perl alleviates this problem by providing
the qr/···/ operator, which treats its operand as a regular expression, but
doesn''''t actually apply it. Rather, it returns a "regex object" value that can later be
used to build up a larger regular expression. It''''s extremely convenient, as we saw
briefly in Chapter 2 (see Section 2.3.6.7). Like m/···/ and s/···/···/, you can pick delimiters to
suit your needs (see Section 2.3.6.4), as we''''ve done here using braces.
The other new thing here is the matching of
double-quoted strings via the
$DOUBLE portion. When the transmission has brought us to a position where the
$DOUBLE part can match, it will do so, thereby bypassing the whole string in one
fell swoop. It''''s possible to have both alternatives because they''''re entirely unambiguous with respect to each other. When applied to a string, as you read from the beginning, any point you that the text is:
Matchable by the comment part, thereby skipping immediately to the end of
the comment, or...
Matchable by the
double-quoted string part, thereby skipping immediately to
the end of the string, or...
Not matchable by either, causing the attempt to fail. This means that the normal
bump-along will skip only the one, uninteresting character.
This way, the regex will never be started from within a string or comment, the
key to its success. Well, actually, right now it isn''''t helpful yet, since it removes the
strings as well as the comments, but a slight change puts us back on track.
Consider:
$COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/}; #
regex to match a comment
$DOUBLE = qr{"(?:\\.|[^"\\])*"}; # Regex to match
double-quoted string
$text =~ s/
($DOUBLE)|$COMMENT/
$1/g;
The only differences are that we''''ve:
Added the parentheses to fill $1 if the match is via the string alternative. If the
match is via the comment alternative, $1 is left empty.
Made the replacement value that same $1. The effect is that if a double-quoted
string is matched, the replacement is that same
double-quoted string the
string is not removed and the substitute becomes an effective no-op (but has
the side effect of getting us past the string, which is the reason to add it in the
first place). On the other hand, if the comment alternative is the one that
matches, the $1 is empty, so the comment is replaced by nothingness just as
we want.
[9]
[9] In Perl, if $1 is not filled during the match, it''''s given a special "no value" value "undef". When used
in the replacement value, undef is treated as an empty string, so it works as we want. But, if you
have Perl warnings turned on (as every good programmer should), the use of an undef value in this way causes a warning to be printed. To avoid this, you can use the ''''no warnings;'''' pragma before the regular expression is used, or use this special Perl form of the substitute operator:
$text =~ s/($DOUBLE )|$COMMENT/
defined($1) ? $1 : "
/ge;
Finally, we need to take care of single-quoted C constants such as ''''\t'''' and the like. This is easywe simply add another alternative inside the parentheses. If we
would like to removeC++ /Java/C# style // comments too, that''''s as simple as
adding
//[^\n]*
as a fourth alternative, outside the parentheses:
$COMMENT = qr{/\*[^*]*\*+(?:[^/*][^*]*\*+)*/};
# regex to match a comment
$COMMENT2 = qr{//[^\n]*};
# regex to match a C++ // comment
$DOUBLE = qr{"(?:\\.|[^"\\])*"};
# regex to match double-quoted string
$SINGLE = qr{''''(?:\\.|[^''''\\])*''''};
# regex to match single-quoted string
$text =~ s/($DOUBLE|$SINGLE)
|$COMMENT|$COMMENT2/$1/g;
The basic premise is quite slick: when the engine checks the text, it quickly grabs
(and if appropriate, removes) these special constructs. On my system, this Perl
snippet took about 16.4 seconds to remove all the comments from a 16-megabyte,
500,000-line test file. This is fast, but we''''ll speed it up considerably.
6.7.2 A Well-Guided Regex is a Fast Regex
With just a little hand holding, we can help direct the flow of the regex engine''''s
attention to match much faster. Let''''s consider the long spans of normal C code
between the comments and strings. For each such character, the regex engine has
to try each of the four alternatives to see whether it''''s something that should be
gobbled up, and only if all four fail does it bump-along to bypass the character as
uninteresting. This is a lot of work that we really don''''t need to do.
We know, for example, that for any of the alternatives to have a chance at matching,
the lead character must be a slash, a single quote, or a double quote. One of
these doesn''''t guarantee a match, but not being one does guarantee a non-match.
So, rather than letting the engine figure this out the slow and painful way, let''''s just
tell it directly by adding
[^''''"/]
as an alternative. In fact, any number of such
characters in a row can be scooped right up, so let''''s use
[^''''"/]+
instead. If you
remember the neverending match, you might feel worried about the added plus.
Indeed, it could be of great concern if it were within some kind of (···)* loop, but
in this stand-alone case it''''s quite fine (there''''s nothing that follows that could force
it to backtrack at all). So, adding:
$OTHER = qr{[^"''''/]}; # Stuff that couldn''''t
possibly begin one of the other alternatives
.
.
.
$text =~ s/($DOUBLE|$SINGLE|$OTHER+)|$COMMENT|$COMMENT2/$1/g;
For reasons that will become apparent after a bit, I''''ve put the plus quantifier after
$OTHER, rather than part of the contents of $OTHER.
So, I retry my benchmarks, and wow, this one change cuts the time by over 75%!
We''''ve crafted the regex to remove most of the overhead of having to try all the
alternatives so often. There are still a few cases where none of the alternatives can
match (such as at ''''c / 3.14''''), and at such times, we''''ll have to be content with the bump-along to get us by.
However, we''''re not done yetwe can still help the engine flow to a faster match:
In most cases, the most popular alternative will be
$OTHER+
, so let''''s put that
first inside the parentheses. This isn''''t an issue for a POSIX NFA engine because
it must always check all alternatives anyway, but for a Traditional NFA, which
stops once a match has been found, why make it check for relatively rare
matches before checking the one we believe will match most often?
After one of the quoted items matches, it will likely be followed by some
$OTHER before another string or a comment is found. If we add
$OTHER*
after
each item, we tell the engine that it can immediately flow right into matching
$OTHER without bothering with the /g looping. This is similar to the unrollingthe-
loop technique. In fact, unrolling the loop gains much of its speed from
the way it leads the regex engine to a match, using our global knowledge to
create the local optimizations that feed the engine just what it needs to work
quickly.
Note that it is very important that this $OTHER, added after each string-matching
subexpression, be quantified with star, while the previous $OTHER (the one
we moved to the head of the alternation) be quantified by plus. If it''''s not
clear, consider what could happen if the appended $OTHER had plus and there
were, say, two double-quoted strings right in a row. Also, if the leading
$OTHER used star, it would always match!
These changes yield
($OTHER+|$DOUBLE$OTHER*|$SINGLE$OTHER*)|$COMMENT|$COMMENT2
as the regex, and further cuts the time by an additional five percent.
Let''''s step back and think about these last two changes. If we go to the trouble of
scooping up $OTHER* after each quoted string, there are only two situations in
which the original $OTHER+ (which we moved to be the first alternative) can
match: 1> at the very start of the whole s/···/···/g, before any of the quoted strings
get a chance to match, and 2) after any comment. You might be tempted to think
"Hey, to take care of point #2, let''''s just add $OTHER* after the comments as well!"
This would be nice, except everything we want to keep must be inside that first
set of parenthesesputting it after the comments would throw out the baby code
with the comment bathwater.
So, if the original $OTHER+ is useful primarily only after a comment, do we really
want to put it first? I guess that depends on the dataif there are more comments
than quoted strings, then yes, placing it first makes sense. Otherwise, I''''d place it
later. As it turns out with my test data, placing it first yields better results. Placing it
later takes away about half the gains we achieved in the last step.
6.7.3 Wrapup
We''''re not quite done yet. Don''''t forget, each of the quoted-string subexpressions is
ripe for unrolling heck, we spent a long section of this chapter on that very
topic. So, as a final change, let''''s replace the two string subexpressions with:
$DOUBLE = qr{"[^"\\]*(?:\\.[^"\\]*)*"};
$SINGLE = qr{''''[^''''\\]*(?:\\.[^''''\\]*)*''''};
This change yields yet another 15 percent gain. Just a few changes has sped things
up from 16.4 seconds to 2.3 secondsa speedup of over 7x.
This last change also shows how convenient a technique it can be to use variables
to build up a regular expression. Individual components, such as $DOUBLE, can be
considered in relative isolation, and can be changed without having to wade into
the full expression. There are still some overall issues (the counting of capturing
parentheses, among others) that must be kept in mind, but it''''s a wonderful
technique.
One of the features that makes it so convenient in this case is Perl''''s qr/···/ operator,
which acts like a regex-related type of "string." Other languages don''''t have this
exact functionality, but many languages do have string types that are amenable to
building regular expressions. See "Strings as Regular Expressions" starting on
Section 3.3.1 .
You''''ll particularly appreciate the building up of regular expressions this way when
you see the raw regex. Here it is, broken across lines to fit the page:
([^"\''''/]+|"[^"\\]*(?:\\.[^"\\]*)*"[^"\''''/]*|''''[^''''\\]*
(?:\\.[^''''\\]*)*''''[^"\''''/]*)|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*