Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Regular Expressions (2nd Edition) [Electronic resources] - نسخه متنی

Jeffrey E. F. Friedl

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید












7.7 The Split Operator



The multifaceted split operator (often called a function in casual conversation) is commonly used as the converse of a list-context m/···/g (see Section 7.5.3.3). The latter returns text matched by the regex, while a split with the same regex returns text separated by matches. The normal match
$text =~ m/:/g
applied against a $text of '
IO.SYS:225558:95-10-03:-a-sh:optional
', returns the four-element list


     ( ':', ':', ':', ':' )


which doesn't seem useful. On the other hand,
split(/:/, $text)
returns the
five-element list:


     ( 'IO.SYS', '225558', '95-10-03', '-a-sh', 'optional' )


Both examples reflect that
:
matches four times. With split, those four matches partition a copy of the target into five chunks, which are returned as a list of five
strings.


That example splits the target string on a single character, but it you can split on
any arbitrary regular expression. For example,


     @Paragraphs = split(m/\s*<p>\s*/i, $html);


splits the HTML in $html into chunks, at <p> or <P>, surrounded by optional whitespace. You can even split on locations, as with


     @Lines = split(m/^/m, $lines);


to break a string into its logical lines.


In its most simple form with simple data like this, split is as easy to understand
as it is useful. However, there are many options, special cases, and special
situations that complicate things. Before getting into the details, let me show two
particularly useful special cases:



The special match operand
//
causes the target string to be split into its component characters. Thus,
split(//, "short test")
returns a list of ten elements:

("s", "h", "o", ···, "s", "t")
.



The special match operand "•" (a normal string with a single space) causes
the target string to be split on whitespace, similar to using m/\s+/ as the
operand, except that any leading and trailing whitespace are ignored. Thus,

split("•", "•••a•short•••test•••")
returns the strings 'a', 'short', and 'test'.




These and other special cases are discussed a bit later, but first, the next sections
go over the basics.



7.7.1 Basic Split



split is an operator that looks like a function, and takes up to three operands:


     split(match operand, target string, chunk-limit operand)


The parentheses are optional. Default values (discussed later in this section) are
provided for operands left off the end.


split is always used in a list context. Common usage patterns include:


     ($var1, $var2, $var3, ···) = split(···);
...........................
@array = split(···);
...........................
for my $item (split(···)) {
.
.
.
}


7.7.1.1 Basic match operand



The match operand has several special-case situations, but it is normally the same
as the regex operand of the match operator. That means that you can use /···/ and
m{···} and the like, a regex object, or any expression that can evaluate to a string.
Only the core modifiers described in Section 7.2.3 are supported.


If you need parentheses for grouping, be sure to use the
(?:···)
non-capturing kind. As we'll see in a few pages, the use of capturing parentheses with split turns on a very special feature.



7.7.1.2 Target string operand



The target string is inspected, but is never modified by split. The content of $_ is
the default if no target string is provided.



7.7.1.3 Basic chunk-limit operand



In its primary role, the chunk-limit operand specifies a limit to the number of
chunks that split partitions the string into. With the sample data from the first
example,
split(/:/, $text, 3)
returns:


     ( 'IO.SYS', '225558', '95-10-03:-a-sh:optional' )


This shows that split stopped after /:/ matched twice, resulting in the
requested three-chunk partition. It could have matched additional times, but that's
irrelevant because of this example's chunk limit. The limit is an upper bound, so
no more than that many elements will ever be returned (unless the regex has capturing
parentheses, which is covered in a later section). You may still get fewer
elements than the chunk limit; if the data can't be partitioned enough to begin
with, nothing extra is produced to "fill the count." With our example data,

split(/:/, $text, 99)
still returns only a five-element list. However, there is an important difference between
split(/:/, $text)
and
split(/:/, $text, 99)

which does not manifest itself with this example keep this in mind when the
details are discussed later.


Remember that the chunk-limit operand refers to the chunks between the matches, not to the number of matches themselves. If the limit were to refer to the
matches themselves, the previous example with a limit of three would produce


     ( 'IO.SYS', '225558', '95-10-03', '-a-sh:optional' )


which is not what actually happens.


One comment on efficiency: let's say you intended to fetch only the first few
fields, such as with:


     ($filename, $size, $date) = split(/:/, $text);


As a performance enhancement, Perl stops splitting after the fields you've
requested have been filled. It does this by automatically providing a chunk limit of
one more than the number of items in the list.



7.7.1.4 Advanced split



split can be simple to use, as with the examples we've seen so far, but it has
three special issues that can make it somewhat complex in practice:



Returning empty elements



Special regex operands



A regex with capturing parentheses




The next sections cover these in detail.



7.7.2 Returning Empty Elements



The basic premise of split is that it returns the text separated by matches, but
there are times when that returned text is an empty string (a string of length zero,
e.g., "). For example, consider


     @nums = split(m/:/, "12:34::78");


This returns


     ("12", "34", ", "78")


The regex
:
matches three times, so four elements are returned. The empty third element reflects that the regex matched twice in a row, with no text in between.



7.7.2.1 Trailing empty elements



Normally, trailing empty elements are not returned. For example,


     @nums = split(m/:/, "12:34:
:78:::");


sets @nums to the same four elements


     ("12", "34", ", "78")


as the previous example, even though the regex was able to match a few extra
times at the end of the string. By default, split does not return empty elements at
the end of the list. However, you can have split return all trailing elements by
using an appropriate chunk-limit operand . . .



7.7.2.2 The chunk-limit operand's second job



In addition to possibly limiting the number of chunks, any non-zero chunk-limit
operand also preserves trailing empty items. (A chunk limit given as zero is exactly
the same as if no chunk limit is given at all.) If you don't want to limit the number
of chunks returned, but do want to leave trailing empty elements intact, simply
choose a very large limit. Or, better yet, use -1, because a negative chunk limit is
taken as an arbitrarily large limit:
split(/:/, $text, -1)
returns all elements,
including any trailing empty ones.


At the other extreme, if you want to remove all empty items, you could put

grep {length}
before the split. This use of grep lets pass only list elements with non-zero lengths (in other words, elements that aren't empty):


     my @NonEmpty = grep { length } split(/:/, $text);


7.7.2.3 Special matches at the ends of the string



A match at the very beginning normally produces an empty element:


     @nums = split(m/:/, ":12:34::78");


That sets @nums to:


     (", "12", "34", ", "78")


The initial empty element reflects the fact that the regex matched at the beginning
of the string. However, as a special case, if the regex doesn't actually match any
text when it matches at the start or end of the string, leading and/or trailing empty
elements are not produced. A simple example is
split(/\b/, "a simple test")
,
which can match at the six marked locations in '
a•simple•test
'. Even though it
matches six times, it doesn't return seven elements, but rather only the five elements:

("a", ", "simple", ", "test")
. Actually, we've already seen this special
case, with the
@Lines = split(m/^/m, $lines)
example in Section 7.7.



7.7.3 Split's Special Regex Operands



split's match operand is normally a regex literal or a regex object, as with the
match operator, but there are some special cases:



An empty regex for split does not mean "Use the current default regex," but
to split the target string into a list of characters. We saw this before at the start
of the split discussion, noting that
split(//, "short test")
returns a list
of ten elements:
("s", "h", "o", &bigmidddot, "s", "t")
.



A match operand that is a string (not a regex) consisting of exactly one space
is a special case. It's almost the same as /\s+/, except that leading whitespace
is skipped. Trailing whitespace is ignored as well if an appropriately large (or
negative) chunk-limit operand is given. This is all meant to simulate the
default input-record-separator splitting that awk does with its input, although
it can certainly be quite useful for general use.


If you'd like to keep leading whitespace, just use m/\s+/ directly. If you'd like
to keep trailing whitespace, use -1 as the chunk-limit operand.



If no regex operand is given, a string consisting of one space (the special case
in the previous point) is used as the default. Thus, a raw split without any
operands is the same as
split('•', $_, 0)
.



If the regex
^
is used, the /m modifier (for the enhanced line-anchor match
mode) is automatically supplied for you. (For some reason, this does not happen
for
$
.) Since it's so easy to just use m/^/m
explicitly, I would recommend
doing so, for clarity. Splitting on m/^/m is an easy way to break a multiline
string into individual lines.




7.7.3.1 Split has no side effects



Note that a split match operand often looks like a match operator, but it has none of the side effects of one. The use of a regex with split doesn't affect the
default regex for later match or substitution operators. The variables $&, $', $1,
and so on are not set or otherwise affected by a split. A split is completely isolated
from the rest of the program with respect to side effects.[8]



[8] Actually, there is one side effect remaining from a feature that has been deprecated for many years,
but has not actually been removed from the language yet. If split is used in a scalar context, it
writes its results to the @_ variable (which is also the variable used to pass function arguments, so be
careful not to use split in a scalar context by accident).
use warnings
or the -w command-line argument will warn you if split is used in a scalar context.




7.7.4 Split's Match Operand with Capturing Parentheses



Capturing parentheses change the whole face of split. When they are used, the
returned list has additional, independent elements interjected for the item(s) captur
ed by the parentheses. This means that some or all text normally not returned
by split is now included in the returned list.


For example, as part of HTML processing, split(/(<[^>]*>)/) turns


     ···•and•<B>very•<FONT•color=red>very></FONT>•much</B>•effort···


into:


     ( '...•and ', '<B>', 'very•', '<FONT•color=red>',
'very', '</FONT>', '•much', '</B>', '•effort...' )


With the capturing parentheses removed, split(/<[^>]*>/) returns:


     ( '...•and ', 'very•', 'very', '•much', '•effort...' )


The added elements do not count against a chunk limit. (The chunk limit limits the
chunks that the original string is partitioned into, not the number of elements
returned.)


If there are multiple sets of capturing parentheses, multiple items are added to the
list with each match. If there are sets of capturing parentheses that don't contribute
to a match, undef elements are inserted for them.



/ 83