Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

4.3 Regex-Directed Versus Text-Directed

The two basic engine types reflect a fundamental difference in algorithms available
for applying a regular expression to a string. I call the gasoline-driven NFA engine
"regex-directed," and the electric-driven DFA "text-directed."

4.3.1 NFA Engine: Regex-Directed

Let's consider one way an engine might match to(nite|knight|night) against
the text '···tonight···' . Starting with the t, the regular expression is examined one
component at a time, and the "current text" is checked to see whether it is
matched by the current component of the regex. If it does, the next component is
checked, and so on, until all components have matched, indicating that an overall
match has been achieved.

With the to(nite;knight;night) example, the first component is t, which repeatedly fails until a 't' is reached in the target string. Once that happens, the o
is checked against the next character, and if it matches, control moves to the next
component. In this case, the "next component" is (nite|knight|night) which
really means " nite or knight or night. " Faced with three possibilities, the engine just tries each in turn. We (humans with advanced neural nets between our ears) can see that if we're matching tonight, the third alternative is the one that leads to a match. Despite their brainy origins (see Section 3.1.1)), a regex-directed engine can't come to that conclusion until actually going through the motions to check.

Attempting the first alternative, nite, involves the same component-at-a-time
treatment as before: " Try to match n, then i, then t, and finally e. " If this fails,
as it eventually does, the engine tries another alternative, and so on until it
achieves a match or must report failure. Control moves within the regex from component
to component, so I call it "regex-directed."

4.3.1.1 The control benefits of an NFA engine

In essence, each subexpression of a regex in a regex-directed match is checked
independently of the others. Other than backreferences, there's no interrelation
among subexpressions, except for the relation implied by virtue of being thrown
together to make a larger expression. The layout of the subexpressions and regex
control structures (e.g., alternation, parentheses, and quantifiers) controls an
engine's overall movement through a match.

Since the regex directs the NFA engine, the driver (the writer of the regular expression)
has considerable opportunity to craft just what he or she wants to happen.
(Chapters 5 and 6 show how to put this to use to get a job done correctly and efficiently.) What this really means may seem vague now, but it will all be spelled out soon.

4.3.2 DFA Engine: Text-Directed

Contrast the regex-directed NFA engine with an engine that, while scanning the
string, keeps track of all matches "currently in the works." In the tonight example,
the moment the engine hits t, it adds a potential match to its list of those currently
in progress:

in string in regex

after ···tonight···

possible matches: to(nite|knight|night).

Each subsequent character scanned updates the list of possible matches. After a
few more characters are matched, the situation becomes

in string in regex

after ···tonight···

possible matches: to(nite|knight|night)

with two possible matches in the works (and one alternative, knight, ruled out).
With the g that follows, only the third alternative remains viable. Once the h and t are scanned as well, the engine realizes it has a complete match and can return
success.

I call this "text-directed" matching because each character scanned from the text
controls the engine. As in the example, a partial match might be the start of any
number of different, yet possible, matches. Matches that are no longer viable are
pruned as subsequent characters are scanned. There are even situations where a
"partial match in progress" is also a full match. If the regex were to(···)?, for
example, the parenthesized expression becomes optional, but it's still greedy, so
it's always attempted. All the time that a partial match is in progress inside those
parentheses, a full match (of 'to') is already confirmed and in reserve in case the
longer matches don't pan out.

If the engine reaches a character in the text that invalidates all the matches in the
works, it must revert to one of the full matches in reserve. If there are none, it
must declare that there are no matches at the current attempt's starting point.

4.3.3 First Thoughts: NFA and DFA in Comparison

If you compare these two engines based only on what I've mentioned so far, you
might conclude that the text-directed DFA engine is generally faster. The regexdir
ected NFA engine might waste time attempting to match different subexpressions
against the same text (such as the three alternatives in the example).

You would be right. During the course of an NFA match, the same character of the
target might be checked by many different parts of the regex (or even by the same
part, over and over). Even if a subexpression can match, it might have to be
applied again (and again and again) as it works in concert with the rest of the
regex to find a match. A local subexpression can fail or match, but you just never
know about the overall match until you eventually work your way to the end of
the regex. (If I could find a way to include "It's not over until the fat lady sings." in
this paragraph, I would.) On the other hand, a DFA engine is deterministic each character in the target is checked once (at most). When a character matches, you
don't know yet if it will be part of the final match (it could be part of a possible
match that doesn't pan out), but since the engine keeps track of all possible
matches in parallel, it needs to be checked only once, period.

The two basic technologies behind regular-expression engines have the somewhat
imposing names Nondeterministic Finite Automaton (NFA) and Deterministic Finite Automaton (DFA). With mouthfuls like this, you see why I stick to just "NFA" and "DFA." We won't be seeing these phrases spelled out again.[5]

[5] I suppose I could explain the underlying theory that goes into these names, if I only knew it! As I hinted, the word deterministic is pretty important, but for the most part the theory is not relevant, so long as we understand the practical effects. By the end of this chapter, we will.

4.3.3.1 Consequences to us as users

Because of the regex-directed nature of an NFA, the details of how the engine
attempts a match are very important. As I said before, the writer can exercise a fair
amount of control simply by changing how the regex is written. With the tonight
example, perhaps less work would have been wasted had the regex been written
differently, such as in one of the following ways:

to(ni(ght|te)|knight)

tonite|toknight|tonight

to(k?night|nite)

With any given text, these all end up matching exactly the same thing, but in
doing so direct the engine in different ways. At this point, we don't know enough
to judge which of these, if any, are better than the others, but that's coming soon.

It's the exact opposite with a DFA since the engine keeps track of all matches
simultaneously, none of these differences in representation matter so long as in
the end they all represent the same set of possible matches. There could be a hundr
ed different ways to achieve the same result, but since the DFA keeps track of
them all simultaneously (almost magically more on this later), it doesn't matter
which form the regex takes. To a pure DFA, even expressions that appear as different
as abc and [aa-a](b|b{1}|b)c are utterly indistinguishable.

Three things come to my mind when describing a DFA engine:

DFA matching is very fast.

DFA matching is very consistent.

Talking about DFA matching is very boring.

I'll eventually expand on all these points.

The regex-directed nature of an NFA makes it interesting to talk about. NFAs provide
plenty of room for creative juices to flow. There are great benefits in crafting
an expression well, and even greater penalties for doing it poorly. A gasoline
engine is not the only engine that can stall and conk out completely. To get to the
bottom of this, we need to look at the essence of an NFA engine: backtracking.

Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی