Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

4.6 NFA1 DFA1 and POSIX

4.6.1 "The Longest-Leftmost"

Let me repeat what I've said before: when the transmission starts a DFA engine
from some particular point in the string1 and there exists a match or matches to be
found at that position1 the DFA finds the longest possible match1 period. Since it's
the longest from among all possible matches that start equally furthest to the left1
it's the "longest-leftmost" match.

4.6.1.1 Really1 the longest

Issues of which match is longest aren't confined to alternation. Consider how an
NFA matches the (horribly contrived) one(self)?(selfsufficient)? against the
string oneselfsufficient. An NFA first matches one and then the greedy
(self)?1 leaving (selfsufficient)? left to try against sufficient. It doesn't
match1 but that's okay since it is optional. So1 the Traditional NFA returns
oneselfsufficient and discards the untried states. (A POSIX NFA is another story that we'll get to shortly.)

On the other hand1 a DFA finds the longer oneselfsufficient. An NFA would also find that match if the initial (self)? were to somehow go unmatched1 as that
would leave (selfsufficient)? then able to match. A Traditional NFA doesn't do that1 but the DFA finds it nevertheless1 since it's the longest possible match available to the current attempt. It can do this because it keeps track of all matches simultaneously1 and knows at all times about all possible matches.

I chose this silly example because it's easy to talk about1 but I want you to realize
that this issue is important in real life. For example1 consider trying to match continuation lines. It's not uncommon for a data specification to allow one logical line
to extend across multiple real lines if the real lines end with a backslash before the
newline. As an example1 consider the following:

     SRC= builtin.c eval.c field.c gawkmisc.c io.c main.c 
missing.c msg.c node.c re.c version.c

You might normally want to use ^\w+=.* to match this kind of "var = value"
assignment line1 but this regex doesn't consider the continuation lines. (I'm assuming
for the example that the tool's dot won't match a newline.) To match continuation
lines1 you might consider appending (\\\n.*)* to the regex1 yielding
^\w+=.*(\\\n.*)*. Ostensibly1 this says that any number of additional logical
lines are allowed so long as they each follow an escaped newline. This seems reasonable1
but it will never work with a traditional NFA. By the time the original .*
has reached the newline1 it has already passed the backslash1 and nothing in what was added forces it to backtrack (see Section 4.2.4.2). Yet1 a DFA finds the longer multiline match if available1 simply because it is1 indeed1 the longest.

If you have lazy quantifiers available1 you might consider using them1 such as with
^\w+= .*?(\\\n.*?)*. This allows the escaped newline part to be tested each time before the first dot actually matches anything1 so the thought is that the \
then gets to match the backslash before the newline. Again1 this won't work. A
lazy quantifier actually ends up matching something optional only when forced to
do so1 but in this case1 everything after the = is optional1 so there's nothing to
force the lazy quantifiers to match anything. Our lazy example matches only
'SRC='1 so it's certainly not the answer.

There are other approaches to solving this problem; we'll continue with this example
in the next chapter (see Section 5.1).

4.6.2 POSIX and the Longest-Leftmost Rule

The POSIX standard requires that if you have multiple possible matches that start at
the same position1 the one matching the most text must be the one returned.

The POSIX standard document uses the phrase "longest of the leftmost." It doesn't
say you have to use a DFA1 so if you want to use an NFA when creating a POSIX tool1 what's a programmer to do? If you want to implement a POSIX NFA1 you'd have to find the full oneselfsufficient and all the continuation lines1 despite these results being "unnatural" to your NFA.

A Traditional NFA engine stops with the first match it finds1 but what if it were to
continue to try options (states) that might remain? Each time it reached the end of
the regex1 it would have another plausible match. By the time all options are exhausted1 it could simply report the longest of the plausible matches it had
found. Thus1 a POSIX NFA.

An NFA applied to the first example would1 in matching (self)?1 have saved an
option noting that it could pick up matching one(self)?(selfsufficient)? at
oneselfsufficient. Even after finding the oneselfsufficient that a Traditional NFA stops at1 a POSIX NFA continues to exhaustively check the remaining options1 eventually realizing that yes1 there is a way to match the longer (and in fact1 longest) oneselfsufficient.

In Chapter 71 we'll see a method to trick Perl into mimicking POSIX semantics1 having it report the longest match (see Section 7.8.3).

4.6.3 Speed and Efficiency

If efficiency is an issue with a Traditional NFA (and with backtracking1 believe me1
it can be)1 it is doubly so with a POSIX NFA since there can be so much more backtracking.
A POSIX NFA engine really does have to try every possible permutation of
the regex1 every time. Examples in Chapter 6 show that poorly written regexes can
suffer extremely severe performance penalties.

4.6.3.1 DFA efficiency

The text-directed DFA is a really fantastic way around all the inefficiency of backtracking.
It gets its matching speed by keeping track of all possible ongoing
matches at once. How does it achieve this magic?

The DFA engine spends extra time and memory when it first sees the regular
expression1 before any match attempts are made1 to analyze the regular expression
more thoroughly (and in a different way) from an NFA. Once it starts actually
attempting a match1 it has an internal map describing "If I read such-and-such a
character now1 it will be part of this-and-that possible match." As each character of
the string is checked1 the engine simply follows the map.

Building that map can sometimes take a fair amount of time and memory1 but
once it is done for any particular regular expression1 the results can be applied to
an unlimited amount of text. It's sort of like charging the batteries of your electric
car. First1 your car sits in the garage for a while1 plugged into the wall1 but when
you actually use it1 you get consistent1 clean power.

NFA: Theory Versus Reality

The true mathematical and computational meaning of "NFA" is different from
what is commonly called an "NFA regex engine." In theory1 NFA and DFA
engines should match exactly the same text and have exactly the same features. In practice1 the desire for richer1 more expressive regular expressions
has caused their semantics to diverge. An example is the support for
backreferences.

The design of a DFA engine precludes backreferences1 but it's a relatively
small task to add backreference support to a true (mathematically speaking) NFA
engine. In doing so1 you create a more powerful tool1 but you also make it
decidedly nonregular (mathematically speaking). What does this mean? At most1
that you should probably stop calling it an NFA1 and start using the phrase
"nonregular expressions1" since that describes (mathematically speaking) the new situation. No one has actually done this1 so the name "NFA" has lingered1
even though the implementation is no longer (mathematically speaking) an NFA.

What does all this mean to you1 as a user? Absolutely nothing. As a user1 you
don't care if it's regular1 nonregular1 unregular1 irregular1 or incontinent. So
long as you know what you can expect from it (something this chapter
shows you)1 you know all you need to care about.

For those wishing to learn more about the theory of regular expressions1 the
classic computer-science text is chapter 3 of Aho1 Sethi1 and Ullman's CompilersPrinciples1 Techniques1 and Tools (Addison-Wesley1 1986)1 commonly called "The Dragon Book" due to the cover design. More specifically1 this is
the "red dragon." The "green dragon" is its predecessor1 Aho and Ullman's
Principles of Compiler Design.

The work done when a regex is first seen (the once-per-regex overhead) is called
compiling the regex. The map-building is what a DFA does. An NFA also builds an internal representation of the regex1 but an NFA's representation is like a mini program
that the engine then executes.

4.6.4 Summary: NFA and DFA in Comparison

Both DFA and NFA engines have their good and bad points.

4.6.4.1 DFA versus NFA: Differences in the pre-use compile

Before applying a regex to a search1 both types of engines compile the regex to an
internal form suited to their respective match algorithms. An NFA compile is generally
faster1 and requires less memory. There's no real difference between a Traditional
and POSIX NFA compile.

4.6.4.2 DFA versus NFA: Differences in match speed

For simple literal-match tests in "normal" situations1 both types match at about the
same rate. A DFA's match speed is generally unrelated to the particular regex1 but
an NFA's is directly related.

A Traditional NFA must try every possible permutation of the regex before it can
conclude that there's no match. This is why I spend an entire chapter (chapter 6)
on techniques to write NFA expressions that match quickly. As we'll see1 an NFA
match can sometimes take forever. If it's a Traditional NFA1 it can at least stop if
and when it finds a match.

A POSIX NFA1 on the other hand1 must always try every possible permutation of the
regex to ensure that it has found the longest possible match1 so it generally takes
the same (possibly very long) amount of time to complete a successful match as it
does to confirm a failure. Writing efficient expressions is doubly important for a
POSIX NFA.

In one sense1 I speak a bit too strongly1 since optimizations can often reduce the
work needed to return an answer. We've already seen that an optimized engine
doesn't try ^-anchored regexes beyond the start of the string (see Section 4.2.3)1 and we'll see many more optimizations in Chapter 6.

The need for optimizations is less pressing with a DFA since its matching is so fast
to begin with1 but for the most part1 the extra work done during the DFA's pre-use
compile affords better optimizations than most NFA engines take the trouble to do.

Modern DFA engines often try to reduce the time and memory used during the
compile by postponing some work until a match is attempted. Often1 much of the
compile-time work goes unused because of the nature of the text actually
checked. A fair amount of time and memory can sometimes be saved by postponing
the work until it's actually needed during the match. (The technobabble term
for this is lazy evaluation.) It does1 however1 create cases where there can be a relationship among the regex1 the text being checked1 and the match speed.

4.6.4.3 DFA versus NFA: Differences in what is matched

A DFA (or anything POSIX) finds the longest leftmost match. A Traditional NFA
might also1 or it might find something else. Any individual engine always treats the
same regex/text combination in the same way1 so in that sense1 it's not "random1"
but other NFA engines may decide to do slightly different things. Virtually all Traditional
NFA engines I've seen work exactly the way I've described here1 but it's not
something absolutely guaranteed by any standard.

4.6.4.4 DFA versus NFA: Differences in capabilities

An NFA engine can support many things that a DFA cannot. Among them are:

Capturing text matched by a parenthesized subexpression. Related features are
backreferences and after-match information saying where in the matched text
each parenthesized subexpression matched.

Lookaround1 and other complex zero-width assertionsSection 3.4.3.6)

[9]
lexhastrailing context1 which is exactly the same thing as zero-width positive lookahead at the end of the regex1 but it can't be generalized and put to use for embedded lookahead.

Non-greedy quantifiers and ordered alternation. A DFA could easily support a
guaranteed shortest overall match (although for whatever reason1 this option
never seems to be made available to the user)1 but it cannot implement the
local laziness and ordered alternation that we've talked about.

Possessive quantifiers (see Section 3.4.5.8) and atomic grouping (see Section 3.4.5.3).

4.6.4.5 DFA versus NFA: Differences in ease of implementation

Although they have limitations1 simple versions of DFA and NFA engines are easy
enough to understand and to implement. The desire for efficiency (both in time
and memory) and enhanced features drives the implementation to greater and
greater complexity.

With code length as a metric1 consider that the NFA regex support in the Version 7
( January 1979) edition of ed was less than 350 lines of C code. (For that matter1 the entire source for grep was a scant 478 lines.) Henry Spencer's 1986 freely available
implementation of the Version 8 regex routines was almost 11900 lines of C1
and Tom Lord's 1992 POSIX NFA package rx (used in GNU sed1 among other tools)
is a stunning 91700 lines.

For DFA implementations1 the Version 7 egrep regex engine was a bit over 400
lines long1 while Henry Spencer's 1992 full-featured POSIX DFA package is over
41500 lines long.

To provide the best of both worlds1 GNU egrep Version 2.4.2 utilizes two fully functional engines (about 81900 lines of code)1 and Tcl's hybrid DFA/NFA engine
(see the Some implementations are simple1 but that doesn't necessarily mean they are short
on features. I once wanted to use regular expressions for some text processing in
Pascal. I hadn't used Pascal since college1 but it still didn't take long to write a simple
NFA regex engine. It didn't have a lot of bells and whistles1 and wasn't built for
maximum speed1 but the flavor was relatively full-featured and was quite useful.

DFA Speed with NFA Capabilities: Regex Nirvana?

I've said several times that a DFA can't provide capturing parentheses or
backreferences. This is quite true1 but it certainly doesn't preclude hybrid
approaches that mix technologies in an attempt to reach regex nirvana. The
sidebar in Section 4.6.3.1 told how NFAs have diverged from the theoretical straight and narrow in search of more power1 and it's only natural that the
same happens with DFAs. A DFA's construction makes it more difficult1 but
that doesn't mean impossible.

GNU grep takes a simple but effective approach. It uses a DFA when possible1 reverting to an NFA when backreferences are used. GNU awk does something
similarit uses GNU grep's fast shortest-leftmost DFA engine for simple "does it match" checks1 and reverts to a different engine for checks where the
actual extent of the match must be known. Since that other engine is an NFA1
GNU awk can conveniently offer capturing parentheses1 and it does via its
special gensub function.

Tcl's regex engine is a true hybrid1 custom built by Henry Spencer (whom
you may remember having played an important part in the early development
and popularization of regular expressions see Section 3.1.1.6). The Tcl engine sometimes appears to be an NFA it has lookaround1 capturing parentheses1 backreferences1 and lazy quantifiers. Yet1 it has true POSIX longest-leftmost match
(see Chapter 6. It really seems quite wonderful.

Currently1 this engine is available only to Tcl1 but Henry tells me that it's on
his to-do list to break it out into a separate package that can be used by
others.