6.6. Matching Within Multiple Lines
6.6.1. Problem
You want to use regular expressions on a
string containing more than one logical line, but the special
characters . (any character but newline),
^ (start of string), and $ (end
of string) don't seem to work for you. This might happen if you're
reading in multiline records or the whole file at
once.
6.6.2. Solution
Use /m,
/s, or both as pattern modifiers.
/s allows . to match a newline
(normally it doesn't). If the target string has more than one line in
it, /foo.*bar/s could match a
"foo" on one line and a "bar"
on a following line. This doesn't affect dots in character classes
like [#%.], since they are literal periods anyway.
The /m modifier
allows ^ and $ to match
immediately before and after an embedded newline, respectively.
/^=head[1-7]/m would match that pattern not just
at the beginning of the record, but anywhere right after a newline as
well.
6.6.3. Discussion
A common, brute-force approach to parsing documents where newlines
are not significant is to read the file one paragraph at a time (or
sometimes even the entire file as one string) and then extract tokens
one by one. If the pattern involves dot, such as
.+ or .*?, and must match
across newlines, you need to do something special to make dot match a
newline; ordinarily, it does not. When you've read more than one line
into a string, you'll probably prefer to have ^
and $ match beginning- and end-of-line, not just
beginning- and end-of-string.The difference between /m and
/s is important: /m allows
^ and $ to match next to an
embedded newline, whereas /s allows
. to match newlines. You can even use them
together—they're not mutually exclusive.Example 6-2 creates a simplistic filter to strip
HTML tags out of each file in @ARGV and then send
those results to STDOUT. First we undefine the
record separator so each read operation fetches one entire file.
(There could be more than one file, because @ARGV
could have several arguments in it. If so, each readline would fetch
the entire contents of one file.) Then we strip out instances of
beginning and ending angle brackets, plus anything in between them.
We can't use just .* for two reasons: first, it
would match closing angle brackets, and second, the dot wouldn't
cross newline boundaries. Using .*? in conjunction
with /s solves these problems.
Example 6-2. killtags
#!/usr/bin/perl
# killtags - very bad html tag killer
undef $/; # each read is whole file
while (<>) { # get one whole file at a time
s/<.*?>//gs; # strip tags (terribly)
print; # print file to STDOUT
}
Because this is just a single character, it would be much faster to
use s/<[^>]*>//gs, but that's still a
naïve approach: it doesn't correctly handle tags inside
HTML comments or angle brackets in quotes (<IMG
SRC=">" ALT="<<Ooh
la la!>>">). Recipe 20.6 explains how to avoid these problems.Example 6-3 takes a plain text document and looks
for lines at the start of paragraphs that look like
"Chapter 20:
Better Living
Through Chemisery". It wraps
these with an appropriate HTML level-one header. Because the pattern
is relatively complex, we use the /x modifier so
we can embed whitespace and comments.
Example 6-3. headerfy
#!/usr/bin/perl
# headerfy: change certain chapter headers to html
$/ = ';
while (<> ) { # fetch a paragraph
s{
\A # start of record
( # capture in $1
Chapter # text string
\s+ # mandatory whitespace
\d+ # decimal number
\s* # optional whitespace
: # a real colon
. * # anything not a newline till end of line
)
}{<H1>$1</H1>}gx;
print;
}
Here it is as a one-liner from the command line for those of you for
whom the extended comments just get in the way of understanding:
% perl -00pe 's{\A(Chapter\s+\d+\s*:.*)}{<H1>$1</H1>}gx' datafile
This problem is interesting because we need to be able to specify
start-of-record and end-of-line in the same pattern. We could
normally use ^ for start-of-record, but we need
$ to indicate not only end-of-record, but
end-of-line as well. We add the /m modifier, which
changes ^ and $. Instead of
using ^ to match beginning-of-record, we use
\A instead. We're not using it here, but in case
you're interested, the version of $ that always
matches end-of-record with an optional newline, even in the presence
of /m, is \Z. To match the real
end without the optional newline, use \z.The following example demonstrates using /s and
/m together. That's because we want
^ to match the beginning of any line in the
paragraph; we also want dot to match a newline. The predefined
variable $. represents the record number of the
filehandle most recently read from using
readline(FH) or <FH>. The
predefined variable $ARGV is the name of the file
that's automatically opened by implicit
<ARGV> processing.
$/ = '; # paragraph read mode
while (<ARGV>) {
while (/^START(.*?)^END/sm) { # /s makes . span line boundaries
# /m makes ^ match near newlines
print "chunk $. in $ARGV has <<$1>>\n";
}
}
If you're already committed to the /m modifier,
use \A and \Z for the old
meanings of ^ and $,
respectively. But what if you've used the /s
modifier and want the original meaning of dot? You use
[^\n].Finally, although $ and \Z can
match one before the end of a string if that last character is a
newline, \z matches only at the very end of the
string. We can use lookaheads to define the other two as shortcuts
involving \z:
|
|
|
|
|
|
6.6.4. See Also
The $/ variable in perlvar(1)
and in the "Per-Filehandle Variables" section of Chapter 28 of
Programming Perl; the /s
and /m modifiers in perlre(1)
and "The Fine Print" section of Chapter 2 of Programming
Perl; the "Anchors and Other Zero-Width Assertions"
section in Chapter 3 of Mastering Regular
Expressions; we talk more about the special variable
$/ in Chapter 8