Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Regular Expressions (2nd Edition) [Electronic resources] - نسخه متنی

Jeffrey E. F. Friedl

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید










4.1 Start Your Engines!


Let's see how much I can milk this engine analogy. The whole point of having an
engine is so that you can get from Point A to Point B without doing much work.
The engine does the work for you so you can relax and enjoy the sound system.
The engine's primary task is to turn the wheels, and how it does that isn't really a
concern of yours. Or is it?


4.1.1 Two Kinds of Engines


Well, what if you had an electric car? They've been around for a long time, but
they aren't as common as gas cars because they're hard to design well. If you had
one, though, you would have to remember not to put gas in it. If you had a gasoline
engine, well, watch out for sparks! An electric engine more or less just runs,
but a gas engine might need some babysitting. You can get much better performance
just by changing little things like your spark plug gaps, air filter, or brand of
gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls.

Each engine might do its work differently, but the end result is that the wheels
turn. You still have to steer properly if you want to get anywhere, but that's an
entirely different issue.


4.1.2 New Standards


Let's stoke the fire by adding another variable: the California Emissions Standards.[1] Some engines adhere to California's strict pollution standards, and some engines
don't. These aren't really different kinds of engines, just new variations on what's
already around. The standard regulates a result of the engine's work, the emissions,
but doesn't say anything about how the engine should go about achieving
those cleaner results. So, our two classes of engine are divided into four types:
electric (adhering and non-adhering) and gasoline (adhering and non-adhering).

[1] California has rather strict standards regulating the amount of pollution a car can produce. Because of this, many cars sold in America come in "California" and "non-California" models.


Come to think of it, I bet that an electric engine can qualify for the standard without
much changethe standard just "blesses" the clean results that are already par
for the course. The gas engine, on the other hand, needs some major tweaking
and a bit of re-tooling before it can qualify. Owners of this kind of engine need to
pay particular care to what they feed ituse the wrong kind of gas and you're in
big trouble.


4.1.2.1 The impact of standards


Better pollution standards are a good thing, but they require that the driver exercise
more thought and foresight (well, at least for gas engines). Frankly, however,
the standard doesn't impact most people since all the other states still do their own
thing and don't follow California's standard.

So, you realize that these four types of engines can be classified into three groups
(the two kinds for gas, and electric in general). You know about the differences,
and that in the end they all still turn the wheels. What you don't know is what the
heck this has to do with regular expressions! More than you might imagine.


4.1.3 Regex Engine Types


There are two fundamentally different types of regex engines: one called "DFA"
(the electric engine of our story) and one called "NFA" (the gas engine). The
details of what NFA and DFA mean follow shortly (see Section 4.3.3), but for now just consider them names, like Bill and Ted. Or electric and gas.

Both engine types have been around for a long time, but like its gasoline counterpart,
the NFA type seems to be used more often. Tools that use an NFA engine
include the .NET languages, Ruby, Perl, Python, GNU Emacs, ed, sed, PHP, vi, most
versions of grep, and even a few versions of egrep and awk. On the other hand, a
DFA engine is found in almost all versions of egrep and awk, as well as lex and
flex . Some systems have a multi-engine hybrid system, using the most appropriate
engine for the job (or even one that swaps between engines for different parts of
the same regex, as needed to get the best combination of features and speed). Table 4-1 lists a few common programs and the regex engine that most versions use.
If your favorite program is not in the list, the section Testing the Engine Type can help you find out which it is.

Table 1. Some Tools and Their Regex Engines

Engine typePrograms
DFAawk (most versions), egrep (most versions), flex, lex, MySQL, Procmail
Traditional NFAGNU Emacs, Java, grep (most versions), less, more, .NET languages, PCRE library, Perl, PHP (pcre routines), Python, Ruby, sed (most versions), vi
POSIX NFA

mawk, Mortice Kern Systems' utilities, GNU Emacs (when requested)

Hybrid NFA/DFAGNU awk, GNU grep / egrep, Tcl

As Chapter 3 illustrated, 20 years of development with both DFAs and NFAs
resulted in a lot of needless variety. Things were dirty. The POSIX standard came
in to clean things up by clearly specifying not only which metacharacters and featur
es an engine should support, as mentioned in the previous chapter, but also
exactly the results you could expect from them. Superficial details aside, the DFAs
(our electric engines) were already well suited to adhere to this new standard, but
the kind of results an NFA traditionally provided were different, so changes were
needed. As a result, broadly speaking, there are three types of regex engines:

DFA (POSIX or notsimilar either way)

Traditional NFA (most common: Perl, .NET, Java, Python, . . . )

POSIX NFA


Here, we use "POSIX" to refer to the match semanticsthe expected operation of a regex that the POSIX standard specifies (which we'll get to later in this chapter).
Don't confuse this use of "POSIX" with uses surrounding regex features introduced in that same standard (see Section 3.4.2.7). Many programs support the features without supporting the full match semantics.

Old (and minimally featured) programs like egrep, awk, and lex were normally
implemented with the electric DFA engine, so the new standard primarily just con-
firmed the status quo no big changes. However, there were some gas-powered versions of these programs which had to be changed if they wanted to be POSIXcompliant.
The gas engines that passed the California Emission Standards tests
(POSIX NFA) were fine in that they produced results according to the standard, but
the necessary changes only increased how fickle they were to proper tuning.
Where before you might get by with slightly misaligned spark plugs, you now
have absolutely no tolerance. Gasoline that used to be "good enough" now causes
knocks and pings. But, so long as you know how to maintain your baby, the
engine runs smoothly and cleanly.


4.1.4 From the Department of Redundancy Department


At this point, I'll ask you to go back and review the story about engines. Every
sentence there rings with some truth about regular expressions. A second reading
should raise some questions. Particularly, what does it mean that an electric DFA
regex engine more or less "just runs?" What affects a gas-powered NFA? How can I
tune my regular expressions to run as I want on an NFA? What special concerns
does an emissions-controlled POSIX NFA have? What's a "stalled engine" in the
regex world?


4.1.5 Testing the Engine Type


Because the type of engine used in a tool influences the type of features it can
offer, and how those features appear to work, we can often learn the type of
engine a tool has merely by checking to see how it handles a few test expressions.
(After all, if you can't tell the difference, the difference doesn't matter.) At this
point in the book, I wouldn't expect you to understand why the following test
results indicate what they do, but I want to offer these tests now so that if your
favorite tool is not listed in Table 4-1, you can investigate before continuing with this and the subsequent chapters.


4.1.5.1 Traditional NFA or not?


The most commonly used engine is a Traditional NFA, and spotting it is easy. First,
are lazy quantifiers (see Section 3.4.5.8) supported? If so, it's almost certainly a Traditional NFA.
As we'll see, lazy quantifiers are not possible with a DFA, nor would they make
any sense with a POSIX NFA. However, to make sure, simply apply the regex
nfa|nfa•not
to the string 'nfa•not'if only 'nfa' matches, it's a Traditional NFA.
If the entire 'nfa•not' matches, it's either a POSIX NFA or a DFA.


4.1.5.2 DFA or POSIX NFA?


Differentiating between a POSIX NFA and a DFA is sometimes just as simple. Capturing
parentheses and backreferences are not supported by a DFA, so that can be
one hint, but there are systems that are a hybrid mix between the two engine
types, and so may end up using a DFA if there are no capturing parentheses.

Here's a simple test that can tell you a lot. Apply
X(.+)+X
to a string like '=XX======================', as with this egrep command:

    echo =XX========================================= | egrep 'X(.+)+X'

If it takes a long time to finish, it's an NFA (and if not a Traditional NFA as per the
test in the previous section, it must be a POSIX NFA). If it finishes quickly, it's either
a DFA or an NFA with some advanced optimization. Does it display a warning message
about a stack overflow or long match aborted? If so, it's an NFA.


/ 83