Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Regular Expressions (2nd Edition) [Electronic resources] - نسخه متنی

Jeffrey E. F. Friedl

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید










6.4 Common Optimizations


A smart regex implementation has many ways to optimize how quickly it produces
the results you ask of it. Optimizations usually fall into two classes:

Doing something faster Some types of operations, such as
\d+
, are so common
that the engine might have special-case handling set up to execute them
faster than the general engine mechanics would.

Avoiding work If the engine can decide that some particular operation is
unneeded in producing a correct result, or perhaps that some operation can
be applied to less text than originally thought, skipping those operations can
result in a time savings. For example, a regex beginning with
\A
(start-of-line)
can match only when started at the beginning of the string, so if no match is
found there, the transmission need not bother checking from other positions.


Over the next dozen or so sections, we'll look at many of the different and ingenious
optimizations that I've seen. No one language or tool has them all, or even
the same mix as another language or tool, and I'm sure that there are plenty of
other optimizations that I've not yet seen, but this chapter should leave you much
more empowered to take advantage of whatever optimizations your tool offers.


6.4.1 No Free Lunch


Optimizations often result in a savings, but not always. There's a benefit only if the
amount of time saved is more than the extra time spent checking to see whether
the optimization is applicable in the first place. In fact, if the engine checks to see
if an optimization is applicable and the answer is "no," the overall result is slower
because it includes the fruitless check on top of the subsequent normal application
of the regex. So, there's a balance among how much time an optimization takes,
how much time it saves, and importantly, how likely it is to be invoked.

Let's look at an example. The expression
\b\B
(word boundary at the same location
as a non-word boundary) can't possibly match. If an engine were to realize
that a regex contained
\b\B
in such a way that it was required for any match, the
engine would know that the overall regex could never match, and hence never
have to actually apply that regex. Rather, it could always immediately report failur
e. If applied to long strings, the savings could be substantial.

Yet, no engine that I know of actually uses this optimization. Why not? Well, first
of all, it's not necessarily easy to decide whether it applies to a particular regex.
It's certainly possible for a regex to have
\b\B
somewhere in it, yet still match,[5] so
the engine has to do extra work ahead of time to be absolutely certain. Still, the
savings could be truly substantial, so it could be worth doing the extra work if

\b\B
was expected to be common. But, it's not common (I think it's silly!), so
even though the savings could be huge, it's not worth slowing every other regex
by the extra overhead required to do the check.

[5] I've used
\b\B
before to cause one part of a larger expression to fail, during testing. For example, I
might insert in at the marked point of
···(this | this other)···
to guaranteed failure of the first alternative.
These days, when I need a "must fail" component, I use
(?!)
. You can see an interesting
Perl-specific example of this in Section 7.8.2.2.



6.4.2 Everyone's Lunch is Different


Keep this in mind when looking at the various kinds of optimizations that this
chapter discusses. Even though I've tried to pick simple, clean names for each
one, it may well be that every engine that implements it does so in a different
way. A seemingly innocuous change in a regex can cause it to become substantially
faster with one implementation, but substantially slower with another.


6.4.3 The Mechanics of Regex Application


Before looking at the ways advanced systems optimize their regex performance,
and ways we can take advantage of those optimizations, it's important to first
understand the basics of regex application. We've already covered the details
about backtracking, but in this short section, we'll step back a bit to look at the
broader picture.

Here are the main steps taken in applying a regular expression to a target string:

Regex Compilation The regex is inspected for errors, and if valid, compiled
into an internal form.

Transmission Begins The transmission "positions" the engine at the start of
the target string.

Component Tests The engine works through the regex and the text, moving
from component to component in the regex, as described in Chapter 4. We've
already covered backtracking for NFAs in great detail, but there are a few
additional points to mention:

With components next to each other, as with the
S, u, b, j, e . . .
, of

Subject
, each component is tried in turn, stopping only if one fails.

With quantifiers, control jumps between the quantifier (to see whether
the quantifier should continue to make additional attempts) and the component
quantified (to test whether it matches).

There is some overhead when control enters or exits a set of capturing
parentheses. The actual text matched by the parentheses must be remember
ed so that $1 and the like are supported. Since a set of parentheses
may be "backtracked out of," the state of the parentheses is part of the
states used for backtracking, so entering and exiting capturing parentheses
requires some modification of that state.

Finding a Match If a match is found, a Traditional NFA "locks in" the current
state and reports overall success. On the other hand, a POSIX NFA merely
remembers the possible match if it is the longest seen so far, and continues
with any saved states still available. Once no more states are left, the longest
match that was seen is the one reported.

Transmission Bump-Along If no match is found, the transmission bumps the
engine along to the next character in the text, and the engine applies the
regex all over again (going back to step 3).

Overall Failure If no match is found after having applied the engine at every
character in the target string (and after the last character as well), overall failur
e must be reported.


The next few sections discuss the many ways this work can be reduced by smart
implementations, and taken advantage of by smart users.


6.4.4 Pre-Application Optimizations


A good regex engine implementation can reduce the amount of work that needs
to be done before the actual application of a regex, and sometimes can even
decide quickly beforehand that the regex can never match, thereby avoiding the
need to even apply the regex in the first place.


6.4.4.1 Compile caching


Recall the mini mail program from Chapter 2 (see Section 2.3.4). The skeleton of the main
loop, which processes every line of the header, looks like:

     while (···) {
if ($line =~ m/^\s*$/ ) ···
if ($line =~ m/^Subject: (.*)/) ···
if ($line =~ m/^Date: (.*)/) ···
if ($line =~ m/^Reply-To: (\S+)/)···
if ($line =~ m/^From: (\S+) \(([^()]*)\)/)···
·
·
·
}

The first thing that must be done before a regular expression can be used is that it
must be inspected for errors, and compiled into an internal form. Once compiled,
that internal form can be used in checking many strings, but will it? It would certainly
be a waste of time to recompile each regex each time through the loop.
Rather, it is much more time efficient (at the cost of some memory) to save, or
cache, the internal form after it's first compiled, and then use that same internal
form for each subsequent application during the loop.

The extent to which this can be done depends on the type of regular-expression
handling the application offers. As described starting in Section 3.2, the three types
of handling are integrated, procedural, and object-oriented.


6.4.4.1.1 Compile caching in the integrated approach


An integrated approach, like Perl's and awk's, allows compile caching to be done
with ease. Internally, each regex is associated with a particular part of the code,
and the compiled form can be associated with the code the first time it's executed,
and merely referenced subsequent times. This provides for the maximum optimization
of speed at the cost of the memory needed to hold all the cached
expressions.

The ability to interpolate variables into the regex operand (that is, use the contents
of a variable as part of the regular expression) throws somewhat of a monkey
wrench into the caching plan. When variables are interpolated, as with something
like
m/^Subject:•\Q $DesiredSubject \E\s*$/
, the actual regular expression may change from iteration to iteration because it depends on the value in the variable,
which can change from iteration to iteration. If it changes every time, the
regex must be compiled every time, so nothing can be reused.


DFAs, Tcl, and Hand-Tuning Regular Expressions


For the most part, the optimizations described in this chapter simply don't
apply to DFAs. The compile caching optimization, discussed in Section 6.4.4.1,
does apply to all types of engines, but none of the techniques for hand-tuning
discussed throughout this chapter apply to DFAs. As Chapter 4 makes
clear (see Section 4.4), expressions that are logically equivalent
this|that
and

th(is|at)
, for example are equivalent to a DFA. It's because they're not
necessarily equivalent to an NFA that this chapter exists.

But what about Tcl, which has a hybrid DFA/NFA engine? Tcl's regex engine
was custom built for Tcl by regular-expression legend Henry Spencer (see Section 3.1.1.6),
who has done a fantastic job blending the best of both DFA and NFA worlds.
Henry noted himself in an April 2000 Usenet posting:


In general, the Tcl RE-matching engine is much less sensitive to the exact
form of the RE than traditional matching engines. Things that it does
quickly will be fast no matter how you write them; things that it does
slowly will be slow no matter how you write them. The old folklore about
hand-optimizing your REs simply does not apply.


Henry's Tcl regex engine is an important step forward. If this technology
were more widespread, much of this chapter would not be needed.

Well, the regular expression might change with each iteration, but that doesn't
mean it needs to be recompiled each time. An intermediate optimization is to
check the results of the interpolation (the actual value to be used as the regular
expression), and recompile only if it's different from the previous time. If the value
actually changes each time, there's no optimization, as the regex indeed must be
recompiled each time. But, if it changes only sporadically, the regular expression
need only be checked (but not compiled) most times, yielding a handsome
optimization.


6.4.4.1.2 Compile caching in the procedural approach


With an integrated approach, regex use is associated with a particular location in a
program, so the compiled version of the regex can be cached and used the next
time that location in the program is executed. But, with a procedural approach,
there is just a general "apply this regex" function that is called as needed. This
means that there's no location in a program with which to associate the compiled
form, so the next time the function is called, the regex must be compiled from
scratch again. That's how it works in theory, but in practice, it's much too inefficient to abandon all attempts at caching. Rather, what's often done is that a
mapping of recently used regex patterns is maintained, linking each pattern to its
resulting compiled form.

When the apply-this-regex function is called, it compares the pattern argument
with those in the cache of saved regular expressions, and uses the cached version
if it's there. If it's not, it goes ahead and compiles the regex, saving it to the cache
(and perhaps flushing an old one out, if the cache has a size limit). When the
cache has become full and a compiled form must be thrown out, it's usually the
least recently used one.

GNU Emacs keeps a cache of 20 expressions, while Tcl keeps 30. A large cache
size is important because if more regular expressions are used within a loop than
the size of the cache, by the time the loop restarts, the first regex will have been
flushed from the cache, guaranteeing that every expression will have to be compiled
from scratch every time.


6.4.4.1.3 Compile caching in the object-oriented approach


The object-oriented approach puts control of when a regex is compiled directly
into the programmer's hands. Compilation of the regex is exposed to the user via
object constructors such as
New Regex
,
re.compile
, and
Pattern.compile

(which are from .NET, Python, and java.util.regex). In the simple examples
from Chapter 3 where these are introduced (starting in Section 3.2.2), the compilation
is done just before the regex is actually used, but there's no reason that they can't
be done earlier (such as sometime before a loop, or even at program initialization)
and then used freely as needed. This is done, in the benchmarking examples in
Section 6.3.2, Section 6.3.3, and Section 6.3.4.

The object-oriented approach also affords the programmer control over when a
compiled form is thrown away, via the object's destructor. Being able to immediately
throw away compiled forms that will no longer be needed saves memory.


6.4.4.2 Pre-check of required character/substring optimization


Searching a string for a particular character (or perhaps some literal substring) is a
much "lighter" operation than applying a full NFA regular expression, so some systems
do extra analysis of the regex during compilation to determine if there are
any characters or substrings that are required to exist in the target for a possible
match. Then, before actually applying the regex to a string, the string is quickly
checked for the required character or string if it's not found, the entire application
of the regex can be bypassed.

For example, with
^Subject:•(.*)
, the string 'Subject:•' is required. A program
can look for the entire string, perhaps using the Boyer-Moore search algorithm
(which is fast way to search for literal strings within text the longer the literal string, the more efficient the search). A program not wishing to implement
the Boyer-Moore algorithm can still gain a benefit by picking a required character
and just checking every character in the target text. Picking a character less likely
to be found in the target (such as picking ':' over 't' from our 'Subject:•' example)
is likely to yield better results.

While it's trivial for a regex engine to realize what part of
^Subject:•(.*)
is a
fixed literal string required for any match, it's more work to recognize that 'th' is
required for any match of
this|that|other
, and most don't do it. It's not exactly
black and white an implementation not realizing that 'th' is required may well
still be able to easily realize that 'h' and 't' are required, so at least do a one-character
check.

There is a great variety in how well different applications can recognize required
characters and strings. Most are thwarted by the use of alternation. With such systems,
using
th(is|at)
can provide an improvement over
this|that
. Also, be
sure to see the related section "Initial character/class/substring discrimination" in
the next section.


6.4.4.3 Length-cognizance optimization



^Subject:•(.*)
can match arbitrarily long text, but any match is certainly at least
nine characters long. Therefore, the engine need not be started up and applied to
strings shorter than that length. Of course, the benefit is more pronounced with a
regex with a longer required length, such as
:\d{79}:
(81 characters in any
match).

Also see the length-cognizance transmission optimization in Section 6.4.5.6.


6.4.5 Optimizations with the Transmission


If the regex engine can't decide ahead of time that a particular string can never
match, it may still be able to reduce the number of locations that the transmission
actually has to apply the regex.


6.4.5.1 Start of string/line anchor optimization


This optimization recognizes that any regex that begins with
^
can match only
when applied where
^
can match, and so need be applied at those locations only.

The comments in the "Pre-check of required character/substring" section below about the ability of the regex engine to derive just when the optimization
is applicable to a regex is also valid here. Any implementation attempting this
optimization should be able to recognize that
^(this|that)
can match starting only at locations where
^
can match, but many won't come to the same realization
with
^this|^that
. In such situations, writing
^(this|that)
or (even better)

^(?:this|that)
can allow a match to be performed much faster.

Similar optimizations involves
\A
, and for repeated matches,
\G
.


6.4.5.2 Implicit-anchor optimization


An engine with this optimization realizes that if a regex begins with
.*
or
.+
,
and has no global alternation, an implicit
^
can be prepended to the regex. This
allows the start of string/line anchor optimization of the previous section to be
used, which can provide a lot of savings.

More advanced systems may realize that the same optimization can also be
applied when the leading
.*
or
.+
is within parentheses, but care must be taken
when the parentheses are capturing. For example, the regex
(.+)x\1
finds locations
where a string is repeated on either side of 'x', and an implicit leading
^

causes it to improperly not match '1234x2345'.
[6]

[6] It's interesting to note that Perl had this over-optimization bug unnoticed for over 10 years until Perl developer Jeff Pinyan discovered (and fixed) it in early 2002. Apparently, regular expressions like

(.+)x\1
aren't used often, or the bug would have been discovered sooner. Most regex engines don't
have this bug because they don't have this optimization, but some still do have the bug. These
include Ruby, PCRE, and tools that use PCRE, such as PHP.



6.4.5.3 End of string / line anchor optimization


This optimization recognizes that some regexes ending with
$
or other end
anchors (see Section 3.4.3) have matches that start within a certain number of bytes from the
end of the string. For example, with
regex(es)?$
, any match must start no more
than eight [7] characters from the end of the string, so the transmission can jump
directly to that spot, potentially bypassing many positions if the target string
is long.

[7] I say eight characters rather than seven because in many flavors,
$
can match before a string-ending
newline (see Section 3.4.3.2).



6.4.5.4 Initial character/class/substring discrimination optimization


A more generalized version of the pre-check of required character/string optimization,
this optimization uses the same information (that any match by the regex
must begin with a specific character or literal substring) to let the transmission use
a fast substring check so that it need apply the regex only at appropriate spots in
the string. For example
this|that|other
can match only at locations beginning
with
[ot]
, so having the transmission pre-check each character in the string and
applying the regex only at matching positions can afford a huge savings. The
longer the substring that can be pre-checked, the fewer "false starts" are likely.


6.4.5.5 Embedded literal string check optimization


This is almost exactly like the initial string discrimination optimization, but is
more advanced in that it works for literal strings embedded a known distance into
any match.
\b(perl|java)\.regex\.info\b
, for example, has '.regex.info'
four characters into any match, so a smart transmission can use a fast Boyer-Moore
literal-string check to find '.regex.info', and then actually apply the regex starting
four characters before.

In general, this works only when the literal string is embedded a fixed distance
into any match. It doesn't apply to
\b(vb|java)\.regex\.info\b
, which does
have a literal string, but one that's embedded either two or four characters into
any match. It also doesn't apply to
\b(\w+)\.regex\.info\b
, whose literal
string is embedded any number of characters into any match.


6.4.5.6 Length-cognizance transmission optimization


Directly related to the Length-cognizance optimization in Section 6.4.4.3, this optimization
allows the transmission to abandon the attempt if it's gotten too close to the
end of the string for a match to be possible.


6.4.6 Optimizations of the Regex Itself



6.4.6.1 Literal string concatenation optimization


Perhaps the most basic optimization is that
abc
can be treated by the engine as
"one part," rather than the three parts "a
then
b
then
c."
If this is done, the one
part can be applied by one iteration of the engine mechanics, avoiding the overhead
of three separate iterations.


6.4.6.2 Simple quantifier optimization


Uses of start, plus, and friends that apply to simple items, such as literal characters
and character classes, are often optimized such that much of the step-by-step overhead
of a normal NFA engine is removed. The main control loop inside a regex
engine must be general enough to deal with all the constructs the engine supports.
In programming, "general" often means "slow," so this important optimization
makes simple quantifiers like
.*
into one "part," replacing the general engine
mechanics of quantifier processing with fast, specialized processing. Thus, the
general engine is short-circuited for these tests.

For example,
.*
and
(?:.)*
are logically identical, but for systems with this optimization,
the simple
.*
is substantially faster than
(?:.)*
. A few examples: with
Sun's Java regex package, it's about 10% faster, but with Ruby and the .NET languages,
it's about two and a half times faster. With Python, it's about 50 times
faster, and with PCRE/PHP, it's about 150 times faster. Because Perl has the optimization discussed in the next section, both
.*
and
(?:.)*
are the same
speed. (Be sure to see the sidebar below for a discussion on how to interpret
these numbers.)


6.4.6.3 Needless parentheses elimination


If an implementation can realize that
(?:.)*
is exactly the same as
.*
, it opens
up the latter to the previous optimization.


Understanding Benchmarks in This Chapter


For the most part, benchmarks in this chapter are reported as relative ratios
for a given language. For example, in Section 6.4.5.5, I note that a certain optimized
construct is 10% faster than the unoptimized construct, at least with
Sun's Java regex package. In the .NET Framework, the optimized and unoptimized
constructs differ by a factor of two and a half, but in PCRE, it's a factor
of about whopping 150x. In Perl, it's a factor of one (i.e., they are the same
speedno difference).

From this, what can you infer about the speed of one language compared to
another? Absolutely nothing. The 150x speedup for the optimization in PCRE
may mean that the optimization has been implemented particularly well, relative
to the other languages, or it may mean that the unoptimized version is
particularly slow. For the most part, I report very little timing information
about how languages compare against each other, since that's of interest
mostly for bragging rights among language developers.

But, for what it's worth, it may be interesting to see the details behind such
different results as Java's 10% speedup and PCRE's 150x speedup. It turns out
that PCRE's unoptimized
(?:.)*
is about 11 times slower than Java's, but its
optimized
.*
is about 13 times faster. Java's and Ruby's optimized versions
are about the same speed, but Ruby's unoptimized version is about 2.5 times
slower than Java's unoptimized version. Ruby's unoptimized version is only
about 10% slower than Python's unoptimized version, but Python's optimized
version is about 20 times faster than Ruby's optimized version.

All of these are slower than Perl's. Both Perl's optimized and unoptimized
versions are 10% faster than Python's fastest. Note that each language has its
own strong points, and these numbers are for only one specific test case.

For an example of a head-to-head comparison, see "Warning: Benchmark
results can cause drowsiness!" in Chapter 8 (see Section 8.3.2.1).


6.4.6.4 Needless character class elimination


A character class with a single character in it is a bit silly because it invokes the
processing overhead of a character class, but without any benefits of one. So, a
smarter implementation internally converts something like
[.]
to
\.
.


6.4.6.5 Character following lazy quantifier optimization


With a lazy quantifier, as in
"(.*?)"
, the engine normally must jump between
checking what the quantifier controls (the dot) with checking what comes after
(
"
). For this and other reasons, lazy quantifiers are generally much slower than
greedy ones, especially for greedy ones that are optimized with the simple quanti-
fier optimization discussed two sections ago. Another factor is that if the lazy
quantifier is inside capturing parentheses, control must repeatedly jump in and out
of the capturing, which causes additional overhead.

So, this optimization involves the realization that if a literal character follows the
lazy quantifier, the lazy quantifier can act like a normal greedy quantifier so long
as the engine is not at that literal character. Thus, implementations with this optimization
switch to specialized lazy quantifier processing for use in these situations,
which quickly checks the target text for the literal character, bypassing the normal
"skip this attempt" if the target text is not at that special literal character.

Variations on this optimization might include the ability to pre-check for a class of
characters, rather than just a specific literal character (for instance, a pre-check for

['"]
with
['"](.*?)["']
, which is similar to the initial character discrimination
optimization discussed in Section 6.4.5.4).


6.4.6.6 "Excessive" backtracking detection


The problem revealed with the "Reality Check" in Section 6.1.4 is that certain combinations
of quantifiers, such as
(.+)*
, can create an exponential amount of backtracking.
One simple way to avoid this is to keep a count of the backtracking, and
abort the match when there's "too much." This is certainly useful in the realitycheck
situation, but it puts an artificial limit on the amount of text that some regular
expressions can be used with.

For example, if the limit is 10,000 backtracks,
.*?
can never match text longer
than 10,000 characters, since each character matched involves a backtrack. Working
with these amounts of text is not all that uncommon, particularly when working
with, say, web pages, so the limitation is unfortunate.

For different reasons, some implementations have a limit on the size of the backtrack
stack (on how many saved states there can be at any one time). For example,
Python allows at most 10,000. Like a backtrack limit, it limits the length of text
some regular-expressions can work with.

This issue made constructing some of the benchmarks used while researching this
book rather difficult. To get the best results, the timed portion of a benchmark
should do as much of the target work as possible, so I created huge strings and
compared the time it took to execute, say,
"(.*)"
,
"(.)*"
,
"(.)*?"
, and

"([^"])*?"
. To keep meaningful results, I had to limit the length of the strings so
as not to trip the backtrack-count or stack-size limitations. You can see an example
in Section 6.3.3.


6.4.6.7 Exponential (a.k.a, super-liner) short-circuiting


A better solution to avoid matching forever on an exponential match is to detect
when the match attempt has gone super-linear. You can then make the extra effort
to keep track of the position at which each quantifier's subexpression has been
attempted, and short-circuit repeat attempts.

It's actually fairly easy to detect when a match has gone super-linear. A quantifier
should rarely "iterate" (loop) more times than there are characters in the target
string. If it does, it's a good sign that it may be an exponential match. Having been
given this clue that matching may go on forever, it's a more complex issue to
detect and eliminate redundant matches, but since the alternative is matching for a
very, very long time, it's probably a good investment.

One negative side effect of detecting a super-linear match and returning a quick
failure is that a truly inefficient regex now has its inefficiency mostly hidden. Even
with exponential short-circuiting, these matches are much slower than they need
to be, but no longer slow enough to be easily detected by a human (instead of finishing
long after the sun has gone dormant, it may take 1/100 of a secondquick
to us, but still an eternity in computer time).

Still, the overall benefit is probably worth it. There are many people who don't
care about regex efficiency they're scared of regular expressions and just want
the thing to work, and don't care how. (You may have been this way before, but I
hope reading this book emboldens you, like the title says, to master the use of
regular expressions.)


6.4.6.8 State-suppression with possessive quantifiers


After something with a normal quantifier has matched, a number of "try the nonmatch
option" states have been created (one state per iteration of the quantifier).
Possessive quantifiers (see Section 3.4.5.10) don't leave those states around. This can be accomplished
by removing the extra states after the quantifier has run its course, or, it
can be done more efficiently by removing the previous iteration's state while
adding the current iteration's. (During the matching, one state is always required
so that the regex can continue once the quantified item can no longer match.)

The reason the on-the-fly removal is more efficient is because it takes less memory.
Applying
.*
leaves one state per character matched, which could consume a
vast amount of memory if the string is long.


Automatic "Possessification"


Recall the example from Chapter 4 (see Section 4.5.6.1.1) where
^\w+:
is applied to
'Subject'. Once
\w+
matches to the end of the string, the subsequent colon
can't match, and the engine must waste the effort of trying
:
at each position
where backtracking forces
\w+
to give up a character. The example then
concluded that we could have the engine avoid that extra work by using
atomic grouping,
^(?>\w+):
, or possessive quantifiers,
^\w++>:
.

A smart implementation should be able to do this for you. When the regex is
first compiled, the engine can see that what follows the quantifier can't be
matched by what is quantified, so the quantifier can be automatically turned
into a possessive one.

Although I know of no system that currently has this optimization, I include
it here to encourage developers to consider it, for I believe it can have a substantial
positive impact.


6.4.6.9 Small quantifier equivalence


Some people like to write
\d\d\d\d
directly, while some like to use a small quantifier
and write
\d{4}
. Is one more efficient than the other? For an NFA, the
answer is almost certainly "yes," but which is faster depends on the tool. If the
tool's quantifier has been optimized, the
\d{4}
version is likely faster unless the
version without the quantifier can somehow be optimized more. Sound a bit confusing?
It is.

My tests show that with Perl, Python, PCRE, and .NET,
\d{4}
is faster by as much
as 20%. On the other hand, with Ruby and Sun's Java regex package,
\d\d\d\d
is
faster sometimes several times faster. So, this seems to make it clear that the
small quantifier is better for some, but worse for others. But, it can be more complex
than that.

Compare
====
with
={4}
. This is a quite different example because this time, the
subject of the repetition is a literal character, and perhaps using
====
directly
makes it easier for the regex engine to recognize the literal substring. If it can, the
highly effective initial character/substring discrimination optimization (see Section 6.4.5.4) can
kick in, if supported. This is exactly the case for Python and Sun's Java regex
package, for whom the
====
version can be up to 100x faster than
={4}
.

More advanced still, Perl, Ruby, and .NET recognize this optimization with either

====
or
={4}
, and as such, both are equally fast (and in either case, can be hundreds or thousands of times faster than the
\d\d\d\d
and
\d{4}
counterparts).
On the other hand, PCRE doesn't recognize it in either case.


6.4.6.10 Need cognizance


One simple optimization is if the engine realizes that some aspect of the match
result isn't needed (say, the capturing aspect of capturing parentheses), it can eliminate
the work to support them. The ability to detect such a thing is very language
dependent, but this optimization can be gained as easily as allowing an extra
match-time option to disable various high-cost features.

One example of a system that has this optimization is Tcl. Its capturing parentheses
don't actually capture unless you explicitly ask. Conversely, the .NET Framework
regular expressions have an option that allows the programmer to indicate
that capturing parentheses shouldn't capture.


/ 83