Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Regular Expressions (2nd Edition) [Electronic resources] - نسخه متنی

Jeffrey E. F. Friedl

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید












7.3 Regex-Related Perlisms



A variety of general Perl concepts pertain to our study of regular expressions. The
next few sections discuss:



Context An important concept in Perl is that many functions and operators
respond to the context they're used in. For example, Perl expects a scalar
value as the conditional of a while loop, but a list of values as the arguments
to a print statement. Since Perl allows expressions to "respond" to the context
in which they're in, identical expressions in each case might produce
wildly different results.



Dynamic Scope Most programming languages support the concept of local
and global variables, but Perl provides an additional twist with something
known as dynamic scoping. Dynamic scoping temporarily "protects" a global
variable by saving a copy of its value and automatically restoring it later. It's
an intriguing concept that's important for us because it affects $1 and other
match-related variables.




7.3.1 Expression Context



The notion of context is important throughout Perl, and in particular, to the match
operator. An expression might find itself in one of three contexts, list, scalar, or void, indicating the type of value expected from the expression. Not surprisingly, a
list context is one where a list of values is expected of an expression. A scalar
context is one where a single value is expected. These two are very common and
of great interest to our use of regular expressions. Void context is one in which no
value is expected.


Consider the two assignments:


     $s = expression one;
@a = expression two;


Because $s is a simple scalar variable (it holds a single value, not a list), it expects
a simple scalar value, so the first expression, whatever it may be, finds itself in a
scalar context. Similarly, because @a is an array variable and expects a list of values,
the second expression finds itself in a list context. Even though the two
expressions might be exactly the same, they might return completely different values,
and cause completely different side effects while they're at it. Exactly what
happens depends on each expression.


For example, the localtime function, if used in a list context, returns a list of values
representing the current year, month, date, hour, etc. But if used in a scalar
context, it returns a textual version of the current time along the lines of 'Mon Jan
20 22:05:15 2003
'.


As another example, an I/O operator such as <MYDATA> returns the next line of the
file in a scalar context, but returns a list of all (remaining) lines in a list context.


Like localtime and the I/O operator, many Perl constructs respond to their context.
The regex operators do as well the match operator m/···/, for example,
sometimes returns a simple true/false value, and sometimes a list of certain match
results. All the details are found later in this chapter.



7.3.1.1 Contorting an expression



Not all expressions are natively context-sensitive, so Perl has rules about what
happens when a general expression is used in a context that doesn't exactly match
the type of value the expression normally returns. To make the square peg fit into
a round hole, Perl "contorts" the value to make it fit. If a scalar value is returned in
a list context, Perl makes a list containing the single value on the fly. Thus,

@a = 42
is the same as
@a = (42)
.


On the other hand, there's no general rule for converting a list to a scalar. If a literal
list is given, such as with


     $var = ($this, &is, 0xA, 'list');


the comma-operator returns the last element, 'list', for $var. If an array is
given, as with
$var = @array
, the length of the array is returned.


Some words used to describe how other languages deal with this issue are cast,
promote, coerce, and convert, but I feel they are a bit too consistent (boring?) to describe Perl's attitude in this respect, so I use "contort."



7.3.2 Dynamic Scope and Regex Match Effects



Perl's two types of storage (global and private variables) and its concept of
dynamic scoping are important to understand in their own right, but are of particular
interest to our study of regular expressions because of how after-match information
is made available to the rest of the program. The next sections describe
these concepts, and their relation to regular expressions.



7.3.2.1 Global and private variables



On a broad scale, Perl offers two types of variables: global and private. Private
variables are declared using my(···). Global variables are not declared, but just pop
into existence when you use them. Global variables are always visible from anywher
e and everywhere within the program, while private variables are visible, lexically,
only to the end of their enclosing block. That is, the only Perl code that can
directly access the private variable is the code that falls between the my declaration
and the end of the block of code that encloses the my.


The use of global variables is normally discouraged, except for special cases, such
as the myriad of special variables like $1, $_, and @ARGV. Regular user variables
are global unless declared with my, even if they might "look" private. Perl allows
the names of global variables to be partitioned into groups called packages, but
the variables are still global. A global variable $Debug within the package
Acme::Widget has a fully qualified name of $Acme::Widget::Debug, but no
matter how it's referenced, it's still the same global variable. If you
use strict;
,
all (non-special) globals must either be referenced via fully-qualified names, or via
a name declared with our (our declares a name, not a new variablesee the Perl
documentation for details).



7.3.2.2 Dynamically scoped values



Dynamic scoping is an interesting concept that few programming languages provide.
We'll see the relevance to regular expressions soon, but in a nutshell, you
can have Perl save a copy of the value of a global variable that you intend to
modify within a block, and restore the original copy automatically at the time
when the block ends. Saving a copy is called creating a new dynamic scope, or
localizing.


One reason that you might want to do this is to temporarily update some kind of
global state that's maintained in a global variable. Let's say that you're using a
package, Acme::Widget, and it provides a debugging flag via the global variable
$Acme::Widget::Debug. You can temporarily ensure that debugging is turned on
with code like:


    .
.
.
{
local($Acme::Widget::Debug) = 1; # Ensure it's turned on
# work with Acme::Widget while debugging is on
.
.
.
}
# $Acme::Widget::Debug is now back to whatever it had been before
.
.
.


It's that extremely ill-named function local that creates a new dynamic scope. Let
me say up front that the call to local does not create a new variable. local is an
action, not a declaration. Given a global variable, local does three things:



Saves an internal copy of the variable's value



Copies a new value into the variable (either undef, or a value assigned to the
local)



Slates the variable to have its original value restored when execution runs off
the end of the block enclosing the local




This means that "local" refers only to how long any changes to the variable will
last. The localized value lasts as long as the enclosing block is executing. Even if a
subroutine is called from within that block, the localized value is seen. (After all,
the variable is still a global variable.) The only difference from a non-localized
global variable is that when execution of the enclosing block finally ends, the previous
value is automatically restored.


An automatic save and restore of a global variable's value is pretty much all there
is to local. For all the misunderstanding that has accompanied local, it's no
more complex than the snippet on the right of Table 7-4 illustrates.


As a matter of convenience, you can assign a value to local($SomeVar), which is
exactly the same as assigning to $SomeVar in place of the undef assignment. Also,
the parentheses can be omitted to force a scalar context.


As a practical example, consider having to call a function in a poorly written
library that generates a lot of "Use of uninitialized value" warnings. You use Perl's
-w option, as all good Perl programmers should, but the library author apparently
didn't. You are exceedingly annoyed by the warnings, but if you can't change the
library, what can you do short of stop using -w altogether? Well, you could set a
local value of $^W, the in-code debugging flag (the variable name ^W can be
either the two characters, caret and 'W', or an actual control-W character):



Table 4. The Meaning of local


Normal Perl

Equivalent Meaning

     {
local($SomeVar); # save copy
$SomeVar = 'My Value';
.
.
.
.
} # Value automatically restored


     {
my $TempCopy = $SomeVar;
$SomeVar = undef;
$SomeVar = 'My Value';
.
.
.
$SomeVar = $TempCopy;
}


     {
local $^W = 0; # Ensure warnings are off.
UnrulyFunction(···);
}
# Exiting the block restores the original value of $^W.


The call to local saves an internal copy of the value of the global variable $^W,
whatever it might be. Then that same $^W receives the new value of zero that we
immediately scribble in. When UnrulyFunction is executing, Perl checks $^W and
sees the zero we wrote, so doesn't issue warnings. When the function returns, our
value of zero is still in effect.


So far, everything appears to work just as if local isn't used. However, when the
block is exited right after the subroutine returns, the original value of $^W is
restored. Your change of the value was local, in time, to the life of the block.
You'd get the same effect by making and restoring a copy yourself, as in Table 7-4,
but local conveniently takes care of it for you.


For completeness, let's consider what happens if I use my instead of local.[4]
Using
my creates a new variable with an initially undefined value. It is visible only within
the lexical block it is declared in (that is, visible only by the code written between
the my and the end of the enclosing block). It does not change, modify, or in any
other way refer to or affect other variables, including any global variable of the
same name that might exist. The newly created variable is not visible elsewhere in
the program, including from within UnrulyFunction. In our example snippet, the
new $^W is immediately set to zero but is never again used or referenced, so it's
pretty much a waste of effort. (While executing UnrulyFunction and deciding
whether to issue warnings, Perl checks the unrelated global variable $^W.)



[4] Perl doesn't allow the use of my with this special variable name, so the comparison is only academic.




7.3.2.3 A better analogy: clear transparencies



A useful analogy for local is that it provides a clear transparency (like used with
an overhead projector) over a variable on which you scribble your own changes.
You (and anyone else that happens to look, such as subroutines and signal handlers)
will see the new values. They shadow the previous value until the point in
time that the block is finally exited. At that point, the transparency is automatically
removed, in effect, removing any changes that might have been made since the
local.


This analogy is actually much closer to reality than saying "an internal copy is
made." Using local doesn't actually make a copy, but instead puts your new
value earlier in the list of those checked whenever a variable's value is accessed
(that is, it shadows the original). Exiting a block removes any shadowing values
added since the block started. Values are added manually, with local, but here's
the whole reason we've been looking localization: regex side-effect variables have
their values dynamically scoped automatically.



7.3.2.4 Regex side effects and dynamic scoping



What does dynamic scoping have to do with regular expressions? A lot. A number
of variables like $& (refers to the text matched) and $1 (refers to the text matched
by the first parenthesized subexpression) are automatically set as a side effect of a
successful match. They are discussed in detail in the next section. These variables
have their value dynamically scoped automatically upon entry to every block.


To see the benefit of this design choice, realize that each call to a subroutine
involves starting a new block, which means a new dynamic scope is created for
these variables. Because the values before the block are restored when the block
exits (that is, when the subroutine returns), the subroutine can't change the values
that the caller sees.


As an example, consider:


     if ( m/(···)/ )
{
DoSomeOtherStuff();
print "the matched text was $1.\n";
}


Because the value of $1 is dynamically scoped automatically upon entering each
block, this code snippet neither cares, nor needs to care, whether the function
DoSomeOtherStuff changes the value of $1 or not. Any changes to $1 by the
function are contained within the block that the function defines, or perhaps
within a sub-block of the function. Therefore, they can't affect the value this snippet
sees with the print after the function returns.


The automatic dynamic scoping is helpful even when not so apparent:


     if ($result =~ m/ERROR=(.*)/) {
warn "Hey, tell $Config{perladmin} about $1!\n";
}


The standard library module Config defines an associative array %Config, of
which the member $Config{perladmin} holds the email address of the local
Perlmaster. This code could be very surprising if $1 were not automatically
dynamically scoped, because %Config is actually a tied variable. That means any
reference to it involves a behind-the-scenes subroutine call, and the subroutine
within Config that fetches the appropriate value when $Config{···} is used
invokes a regex match. That match lies between your match and your use of $1,
so if $1 were not dynamically scoped, it would be destroyed before you used it.
As it is, any changes in $1 during the $Config{···} processing are safely hidden
by dynamic scoping.



7.3.2.5 Dynamic scoping versus lexical scoping



Dynamic scoping provides many rewards if used effectively, but haphazard
dynamic scoping with local can create a maintenance nightmare, as readers of a
program find it difficult to understand the increasingly complex interactions among
the lexically disperse local, subroutine calls, and references to localized variables.


As I mentioned, the my(···) declaration creates a private variable with lexical scope. A private variable's lexical scope is the opposite of a global variable's global
scope, but it has little to do with dynamic scoping (except that you can't local
the value of a my variable). Remember, local is just an action, while my is both an action and, importantly, a declaration.



7.3.3 Special Variables Modified by a Match



A successful match or substitution sets a variety of global, read-only variables that
are always automatically dynamically scoped. These values never change if a
match attempt is unsuccessful, and are always set when a match is successful.
When appropriate, they are set to the empty string (a string with no characters in
it), or undefined (a "no value" value, similar to, yet testably distinct from, an
empty string). Table 7-5 shows examples.


In more detail, here are the variables set after a match:




$&

A copy of the text successfully matched by the regex. This variable (along
with $' and $', described next) is best avoided for performance reasons.
(See the discussion in Section 7.9.3.3.) $& is never undefined after a successful match, although it can be an empty string.



Table 5. Example Showing After-Match Special Variables



After the match of




the following special variables are given the values shown.


Variable Meaning
Value






$'


$&


$'


$1


$2


$3


$4


$+


$^N


@-


@+




Text before match


Text matched


Text after match


Text matched within 1st set of parentheses


Text matched within 2nd set of parentheses


Text matched within 3rd set of parentheses


Text matched within 4th set of parentheses


Text from highest-numbered $1, $2, etc.


Text from most recently closed $1, $2, etc.


Array of match-start indices into target text


Array of match-end indices into target text




Pi•is•


3.14159


,•roughly


3.14159


undef


3.14159


.14159


.14159


3.14159


(6, 6, undef, 6, 7)


(13, 13, undef, 13, 13)






$'

A copy of the target text in front of (to the left of) the match's start. When
used in conjunction with the /g modifier, you might wish $' to be the text
from start of the match attempt, but it's the text from the start of the whole
string, each time. $' is never undefined after a successful match.


$'

A copy of the target text after (to the right of) the successfully matched text.
$' is never undefined after a successful match. After a successful match, the
string "$'$&$'" is always a copy of the original target text.[5]


$1
,


$2
,
$3
, etc.
The text matched by the 1st, 2nd, 3rd, etc., set of capturing parentheses. (Note that $0 is not included hereit is a copy of the script name and not related
to regular expressions.) These are guaranteed to be undefined if they refer to a set of parentheses that doesn't exist in the regex, or to a set that wasn't
actually involved in the match.
These variables are available after a match, including in the replacement operand of s/···/···/. They can also be used within the code parts of an embedded-code or dynamic-regex construct (see Section 7.8). Otherwise, it makes
little sense to use them within the regex itself. (That's what
\1
and friends are for.) See "Using $1 Within a Regex?" in Section 7.3.3.1.
The difference between
(\w+)
and
(\w)+

can be seen in how $1 is set.
Both regexes match exactly the same text, but they differ in what
subexpression falls within the parentheses. Matching against the string
'tubby', the first one results in $1 having the full 'tubby', while the latter
one results in it having only 'y' : with
(\w)+

, the plus is outside the parentheses,
so each iteration causes them to start capturing anew, leaving only
the last character in $1.
Also, note the difference between
(x)?
and
(x?)
. With the former, the
parentheses and what they enclose are optional, so $1 would be either 'x'
or undefined. But with
(x?)
, the parentheses enclose a match what is optional are the contents. If the overall regex matches, the contents matches something, although that something might be the nothingness
x?
allows. Thus, with
(x?)
the possible values of $1 are 'x' and an empty string. The following table shows some examples:


Sample MatchResulting $1
Sample MatchResulting $1


"::" =~ m/:(A?):/


"::" =~ m/:(A)?:/


":A:" =~ m/:(A?):/


":A:" =~ m/:(A)?:/




empty string


undefined


A


A




"::" =~ m/:(\w*):/


"::" =~ m/:(\w)*:/


":Word:" =~ m/:(\w*):/


":Word:" =~ m/:(\w)*:/




empty string


undefined


Word


d



When adding parentheses just for capturing, as was done here, the decision
of which to use is dependent only upon the semantics you want. In these
examples, since the added parentheses have no affect on the overall match
(they all match the same text), the only differences among them is in the
side effect of how $1 is set.


$+

This is a copy of the highest numbered $1, $2, etc. explicitly set during the
match. This might be useful after something like

    $url =~ m{
href \s* = \s* # Match the "href=" part, then the value . . .
(?: "([^"]*)" # a double-quoted value, or . . .
| '([^']*)' # a single-quoted value, or . . .
| ([^'"<>]+) ) # an unquoted value.
}ix;

to access the value of the href. Without $+, you would have to check each
of $1, $2, and $3 and use the one that's not undefined.
If there are no capturing parentheses in the regex (or none are used during
the match), it becomes undefined.


$^N

A copy of the most-recently-closed $1, $2, etc. explicitly set during the
match (i.e., the $1, $2, etc., associated with the final closing parenthesis). If there are no capturing parentheses in the regex (or none used during the match), it becomes undefined. A good example of its use is given starting
in Section 7.8.8.


@-
and @+
These are arrays of starting and ending offsets (string indices) into the target
text. They might be a bit confusing to work with, due to their odd names.
The first element of each refers to the overall match. That is, the first element
of @-, accessed with $-[0], is the offset from the beginning of the target
string to where the match started. Thus, after

     $text = "Version 6 coming soon?";
.
.
.
$text =~ m/\d+/;

the value of $-[0] is 8, indicating that the match started eight characters
into the target string. (In Perl, indices are counted started at zero.)
The first element of @+, accessed with $+[0], is the offset to the end of the
match. With this example, it contains 9, indicating that the overall match
ended nine characters from the start of the string. So, using them together,

substr($text, $-[0], $+[0] - $-[0])
is the same as $& if $text has not been modified, but doesn't have the performance penalty that $& has
(see Section 7.9.3.2.1). Here's an example showing a simple use of @-:

     1 while $line =~ s/\t/' ' x (8 - $-[0] % 8)/e;

Given a line of text, it replaces tabs with the appropriate number of spaces.[6]
Subsequent elements of each array are the starting and ending offsets for
captured groups. The pair $-[1] and $+[1] are the offsets into the target
text where $1 was taken, $-[2] and $+[2] for $2, and so on.


$^R

This variable is useful only within the code parts of embedded-code and
dynamic-regex constructs (see Section 7.8), and has no value outside of a regex. It is
the resulting value of the most recently executed embedded-code construct,
except that an embedded-code construct used as the if of a

(?


if then


|


else


)


conditional (see Section 3.4.5.6) does not set $^R. It is automatically localized to each part of the match, so values of $^R set by code that gets "unmatched" due to backtracking are properly forgotten. Put another way, it has the "most recent" value with respect to the match path that got the engine to the current location.



[5] Actually, if the original target is undefined, but the match successful (unlikely, but possible), "$'$&$'" would be an empty string, not undefined. This is the only situation where the two differ.




Section 3.3.2.2).




When a regex is applied repeatedly with the /g modifier, each iteration sets these
variables afresh. That's why, for instance, you can use $1 within the replacement
operand of s/···/···/g and have it represent a new slice of text with each match.



7.3.3.1 Using $1 within a regex?



The Perl man page makes a concerted effort to point out that
\1
is not available as a backreference outside of a regex. (Use the variable $1 instead.) The variable
$1 refers to a string of static text matched during some previously completed successful
match. On the other hand,
\1
is a true regex metacharacter that matches text similar to that matched within the first parenthesized subexpression at the time
that the regex-directed NFA reaches the

\1
. What it matches might change over the course of an attempt as the NFA tracks and backtracks in search of a match.


The opposite question is whether $1 and other after-match variables are available
within a regex operand. They are commonly used within the code parts of embedded-
code and dynamic-regex constructs (see Section 7.8), but otherwise make little sense
within a regex. A $1 appearing in the "regex part" of a regex operand is treated
exactly like any other variable: its value is interpolated before the match or substitution
operation even begins. Thus, as far as the regex is concerned, the value of
$1 has nothing to do with the current match, but rather is left over from some previous
match.



/ 83