Perl Cd Bookshelf [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Cd Bookshelf [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید

6.15. Greedy and Non-Greedy Matches


6.15.1. Problem


You have a pattern with a greedy
quantifier like *, +,
?, or { }, and you want to stop
it from being greedy.

A classic example is the naïve substitution to remove tags
from HTML. Although it looks appealing,
s#<TT>.*</TT>##gsi deletes everything
from the first open TT tag through the last
closing one. This would turn "Even
<TT>vi</TT> can
edit <TT>troff</TT>
effectively." into "Even
effectively", completely changing the meaning of
the sentence!

6.15.2. Solution


Replace the offending greedy quantifier with the corresponding
non-greedy version. That is, change *,
+, ?, and {
}
into *?, +?,
??, and { }?, respectively.

6.15.3. Discussion



Perl
has two sets of quantifiers: the maximal
ones—*, +,
?, and { }—and the
minimal ones—*?,
+?, ??, and
{ }?. Less formally, these two sorts of
quantifiers are often referred to as greedy and
non-greedy (or sometimes
lazy), respectively. For instance, given the
string "Perl is
a Swiss Army
Chainsaw!", the pattern
/(r.*s)/ matches "rl
is a Swiss
Army Chains", whereas
/(r.*?s)/ matches "rl
is".

With maximal quantifiers, when you ask to match a variable number of
times, such as zero or more times for * or one or
more times for +, the matching engine prefers the
"or more" portion of that description. Thus
/foo.*bar/ matches the first
"foo" through the last "bar" in
the string, rather than only through the next
"bar" as some might expect. That's because the
greedy .* first expands to the rest of the string,
but since that wouldn't leave any characters for
"bar" to match, the engine backs up one character
at a time until it finds "bar".

To make any repetition operator match minimally instead of maximally,
add an extra ?. So *? still
matches zero or more times, but rather than match as much as it can,
the way * would, it matches as little as it can.

# greedy pattern
s/<.*>//gs; # try to remove tags, very badly
# nongreedy pattern
s/<.*?>//gs; # try to remove tags, better (but still rather badly)

This approach doesn't remove tags from all possible HTML correctly,
because a single regular expression is seldom an acceptable
replacement for a real parser. See Recipe 20.6 for the right way.

Minimal matching isn't all it's cracked up to be. Don't fall into the
trap of thinking that including the partial pattern
BEGIN.*?END in a pattern amidst other elements
will always match the shortest amount of text between occurrences of
BEGIN and END. Consider the
pattern /BEGIN(.*?)END/. If matched against the
string "BEGIN and
BEGIN and
END", $1 would contain
"and BEGIN
and". This is probably not what you want.

Imagine trying to pull out everything between bold-italic pairs:

<b><i>this</i> and <i>that</i> are important</b>
Oh, <b><i>me too!</i></b>

A pattern to find only text between bold-italic
HTML pairs, that is, text that doesn't include them, might appear to
be:

m{ <b><i>(.*?)</i></b> }sx

You might be surprised to learn that the pattern doesn't find such
pairs. Many people incorrectly understand this as matching a
"<b><i>" sequence, then anything up to
a "</i></b>" sequence, leaving the
intervening text in $1. While it often works out
that way due to the input data, that's not what
it says. There's nothing in that pattern that says
.*? can't match
"<b><i>" again (and again and again)
before it comes to "</i></b>". If the
intention were to extract only stuff between
"<b><i>" and its
corresponding
"</i></b>", with no other bold-italic
tags in between, that pattern would be incorrect.

If the string in question is just one character, and if what follows
the minimal match is not a literal character, a negated class is
remarkably more efficient than a minimal match, as in
/X([^X]*)X/. But the general way to say "match
BEGIN, then not BEGIN, then END" for any arbitrary values of BEGIN
and END would be as follows (this also stores the intervening part in
$1):

/BEGIN((?:(?!BEGIN).)*)END/s

or, more legibly:

{
BEGIN # locate initial portion
( # save this group into $1
(?: # non-capturing group
(?! BEGIN) # assert: can't be at another BEGIN
. # now match any one character
) * # entire group 0 or more
) # end $1 group
END # locate final portion
}sx

However, this might not be what you're after, either. The greedy star
quantifier means that the non-BEGIN portion in $1
will be maximized, giving fence posts of the last BEGIN through not
the first END, but the last one. So if your string were:

$_ = "BEGIN1 BEGIN2 BEGIN3 3END 2END 1END";

$1 would contain "3 3END 2END
1
". Making the quantifier a minimal matching one:

/BEGIN((?:(?!BEGIN).)*?)END/s

puts "3 3" in $1 for you. Now
add another lookahead negation, (?!END), next to
the existing one. Written out with plenty of whitespace, we now have:

m{
BEGIN # locate initial portion
( # save this group into $1
(?: # non-capturing group
(?! BEGIN ) # can't be at a BEGIN
(?! END ) # also can't be at an END
. # finally, match any one char
) * # repeat entire group ad libitum
) # end $1 capture
END
}sx

Instead of adding another lookahead, another possibility is to use
alternation within the existing one:
(?!BEGIN|END). Applying this approach to the
HTML-matching code, we end up with something like:

m{ <b><i>(  (?: (?!</b>|</i>). )*  ) </i></b> }sx

or perhaps:

m{ <b><i>(  (?: (?!</[ib]>). )*  ) </i></b> }sx

Jeffrey Friedl points out that this quick-and-dirty method isn't
particularly efficient. He suggests crafting a more elaborate pattern
when speed matters, such as:

m{
<b><i>
[^<]* # stuff not possibly bad, and not possibly the end.
(?:
# at this point, we can have '<' if not part of something bad
(?! </?[ib]> ) # what we can't have
< # okay, so match the '<'
[^<]* # and continue with more safe stuff
) *
</i> </b>
}sx

This is a variation on Jeffrey's unrolling-the-loop technique,
described in Chapter 6 of Mastering Regular
Expressions
, Second Edition.

6.15.4. See Also



The
non-greedy quantifiers in the "Regular Expressions" section of
perlre(1) and in Chapter 5 of
Programming Perl

/ 875