Perl Cd Bookshelf [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

6.16. Detecting Doubled Words

6.16.1. Problem

You want to check for doubled
words in a document.

6.16.2. Solution

Use backreferences in your
pattern.

6.16.3. Discussion

Parentheses in a pattern make the matching engine remember what text
that portion of the pattern matched. Later in the pattern, refer to
the actual string that matched with \1 (indicating
the string matched by the first set of parentheses),
\2 (for the string matched by the second set of
parentheses), and so on. Don't use $1 within a
regex, because it would be a variable interpolated before the match
began. The pattern /([A-Z])\1/ matches a capital
letter followed not just by any capital letter, but by whichever one
was just matched (i.e., captured by the first set of parentheses in
that pattern).

The next sample code reads its input files by paragraph, with the
definition of paragraph following Perl's
notion of a paragraph—a chunk of text terminated by two or more
contiguous newlines. Within each paragraph, the code finds all
doubled words. It ignores case and can match across newlines.

Here we use /x to embed whitespace and comments to
improve readability. The /i permits both instances
of "is" in the sentence "Is
is this ok?"
to match, even though they differ in case. We use
/g in a while loop to keep
finding doubled words until we run out of text.

$/ = ';                      # paragrep mode
while (<>) {
while ( m{
\b            # start at a word boundary (begin letters)
(\S+)         # find chunk of non-whitespace
\b            # until another word boundary (end letters)
(
\s+       # separated by some whitespace
\1        # and that very same chunk again
\b        # until another word boundary
) +           # one or more sets of those
}xig
)
{
print "dup word '$1' at paragraph $.\n";
}
}

That code finds the duplicated test in the
following paragraph:

This is a test
test of the doubled word finder.

Word boundary anchors surrounding \S+ are often a
bad idea because they do something you might not be expecting. That's
because word boundaries in Perl are defined as transitions between
alphanumunders (that's a \w) and either the edge
of the string or a non-alphanumunder. Surrounding
\S+ with \b subtly changes
\S+ from its normal meaning of one or more
non-whitespace characters to a stretch of non-whitespace whose first
and last character must be an alphanumunder.

Sometimes, though, this might be just what you're looking for.
Consider the string:

$string = q("I can't see this," she remarked.);
@a = $string =~ /\b\S+\b/g;
@b = $string =~ /\S+/g;

The elements of @a are now:

0  I
1  can't
2  see
3  this
4  she
5  remarked

but those of @b are:

0  "I
1  can't
2  see
3  this,"
4  she
5  remarked.

Here's another interesting demonstration of backreferences. Imagine
two words in which the end of the first word is the same as the start
of the next one, such as "nobody" and
"bodysnatcher". You'd like to find that
overlapping part and come up with
"nobodysnatcher". This is a variant on the doubled
word problem.

Conventional character-by-character processing the way a C programmer
would write it would take a great deal of tricky code. But with a
backtracking pattern matcher, it just takes one simple pattern match.

$a = 'nobody';
$b = 'bodysnatcher';
if ("$a $b" =~ /^(\w+)(\w+) \2(\w+)$/) {
print "$2 overlaps in $1-$2-$3\n";
}
body overlaps in no-body-snatcher

You might think that $1 would first grab up all of
"nobody" due to greediness. It does—for a
while. But once it's done so, there aren't any more characters to put
in $2. So the engine backs up, and
$1 begrudgingly gives up one character to
$2. The space character matches successfully, but
then sees \2, which currently holds a lone
"y". The next character in the string is not a
"y", but a "b". This makes the
engine back up, eventually forcing $1 to surrender
enough to $2 that the pattern can match some
string, a space, and then that same string again.

That won't quite work out if the overlap is itself the product of a
doubling, as in "rococo" and
"cocoon". The preceding algorithm would have
decided that the overlapping string, $2, must be
just "co" rather than "coco".
But we don't want a "rocococoon"; we want a
"rococoon". Adding a minimal matching quantifier
to the $1 part gives the much better pattern:

/^(\w+?)(\w+) \2(\w+)$/,

which solves this problem.

Backtracking is more powerful than you
might imagine. Example 6-8 offers another take on
the prime factorization problem from Chapter 1.

Example 6-8. prime-pattern

  #!/usr/bin/perl
# prime_pattern -- find prime factors of argument using pattern matching
for ($N = ('o' x shift); $N =~ /^(oo+?)\1+$/; $N =~ s/$1/o/g) {
print length($1), " ";
}
print length ($N), "\n";

Although not practical, this approach marvelously demonstrates the
power of backtracking.

Here's another example. Using a brilliant insight first illustrated
by Doug McIlroy (or so says Andrew Hume), you can find solutions to
Diophantine equations of order one with regular expressions. Consider
the equation 12x +
15y + 16z
= 281. Can you think of
possible values for x, y,
and z? Perl can!

# solve for 12x + 15y + 16z = 281, maximizing x
if (($X, $Y, $Z)  =
(('o' x 281)  =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/))
{
($x, $y, $z) = (length($X), length($Y), length($Z));
print "One solution is: x=$x; y=$y; z=$z.\n";
} else {
print "No solution.\n";
}
One solution is: x=17; y=3; z=2.

Because the first o* was greedy,
x was allowed to grow as large as it could.
Changing one or more * quantifiers to
*?, +, or +?
can produce different solutions.

('o' x 281)  =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/
One solution is: x=17; y=3; z=2
('o' x 281)  =~ /^(o*?)\1{11}(o*)\2{14}(o*)\3{15}$/
One solution is: x=0; y=7; z=11.
('o' x 281)  =~ /^(o+?)\1{11}(o*)\2{14}(o*)\3{15}$/
One solution is: x=1; y=3; z=14.

An important lesson to be learned from these amazing feats of
mathematical prowess by a lowly pattern matcher is that a
pattern-matching engine, particularly a backtracking one, very much
wants to give you an answer, and it will work phenomenally hard to do
so. But solving a regular expression with backreferences can take
time exponentially proportional to the length of the input to
complete. For all but trivial inputs, such algorithms make
continental drift seem brisk.

6.16.4. See Also

The explanation of backreferences in the "Regular Expressions"
section of perlre(1), and in "The Little Engine
That /Could(n't)?/" section of Chapter 5 of Programming
Perl; the "The Doubled-Word Thing" section in Chapter 2
of Mastering Regular Expressions

Perl Cd Bookshelf [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی