Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

A.16 Regular Expressions

Regular expressions are, in effect,
an extra language that lives
inside the Perl language. In Perl, they have quite a lot of features.
First, I'll summarize how regular expressions work
in Perl; then, I'll present some of their many
features.

A.16.1 Overview

Regular expressions describe patterns in strings. The pattern
described by a single regular expression may match many different
strings.

Regular expressions are used in pattern matching, that is, when you
look to see if a certain pattern exists in a string. They can also
change strings, as with the s/// operator that
substitutes the pattern, if found, for a replacement. Additionally,
they are used in the tr function that can
transliterate several characters into replacement characters
throughout a string. Regular expressions are case-sensitive, unless
explicitly told otherwise.

The simplest pattern match is a string that matches itself. For
instance, to see if the pattern 'abc' appears in
the string 'abcdefghijklmnopqrstuvwxyz', write the
following in Perl:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /abc/ ) {
print $&;
}

The =~ operator binds a pattern match to a string.
/abc/ is the pattern abc,
enclosed in forward slashes to indicate that it's a
regular-expression pattern. $& is set to the
matched pattern, if any. In this case, the match succeeds, since
'abc' appears in the string
$alphabet, and the code just given prints out
abc.

Regular expressions are made from two kinds of characters. Many
characters, such as 'a' or 'Z',
match themselves. Metacharacters have a special meaning in the
regular-expression language. For instance, parentheses are used to
group other characters and don't match themselves.
If you want to match a metacharacter such as ( in
a string, you have to precede it with the backslash metacharacter
\( in the pattern.

There are three basic ideas behind regular expressions. The first is
concatenation: two items next to each other in a regular-expression
pattern (that's the string between the forward
slashes in the examples) must match two items next to each other in
the string being matched (the $alphabet in the
examples). So, to match 'abc' followed by
'def', concatenate them in the regular expression:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /abcdef/ ) {
print $&; 
}

This prints:

abcdef

The second major idea is alternation. Items separated by the
| metacharacter match any one of the items. For
example, the following:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /a(b|c|d)c/ ) {
print $&;
}

prints as:

abc.

The example also shows how parentheses group things in a regular
expression. The parentheses are metacharacters that
aren't matched in the string; rather, they group the
alternation, given as b|c|d, meaning any one of
b, c, or d
at that position in the pattern. Since b is
actually in $alphabet at that position, the
alternation, and indeed the entire pattern
a(b|c|d)c, matches in the
$alphabet. (One additional point:
ab|cd means (ab)|(cd), not
a(b|c)d.)

The third major idea of regular expressions is repetition (or
closure). This is indicated in a pattern with the quantifier
metacharacter *, sometimes called the Kleene star
after one of the inventors of regular expressions. When
* appears after an item, it means that the item
may appear 0, 1, or any number of times at that place in the string.
So, for example, all of the following pattern matches will succeed:

'AC' =~ /AB*C/;
'ABC' =~ /AB*C/;
'ABBBBBBBBBBBC' =~ /AB*C/;

A.16.2 Metacharacters

The following
are
metacharacters:

\ | ( ) [ { ^ $ * + ? .

A.16.2.1 Escaping with \

A backslash \ before a metacharacter causes
it to match itself; for instance, \ matches a
single \ in the string.

A.16.2.2 Alternation with |

The
pipe
| indicates alternation, as described previously.

A.16.2.3 Grouping with ( )

The parentheses ( ) provide grouping, as described
previously.

A.16.2.4 Character classes

Square brackets [ ] specify a
character
class. A character class matches one character, which can be any
character specified. For instance, [abc] matches
either a, or b, or
c at that position (so it's the
same as a|b|c). A -Z is a range
that matches any uppercase letter, a-z matches any
lowercase letter, and 0-9 matches any digit. For
instance, [A-Za-z0-9] matches any single letter or
digit at that position. If the first character in a character class
is ^, any character except those specified match;
for instance, [^0-9] matches any character that
isn't a digit.

A.16.2.5 Matching any character with a dot

The period or dot . represents
any character except a newline. (The pattern modifier
/s makes it also match a newline.) So,
. is like a character class that specifies every
character.

A.16.2.6 Beginning and end of strings with ^ and $

The ^ metacharacter
doesn't match a character; rather, it asserts that
the item that follows must be at the beginning of the string.
Similarly, the $
metacharacter doesn't match a character but asserts
that the item that precedes it must be at the end of the string (or
before the final newline). For example: /^Watson
and Crick/ matches if the
string starts with Watson and
Crick; and /Watson
and Crick$/ matches if the
string ends with Watson and Crick or
Watson and Crick\n.

A.16.2.7 Quantifiers

These metacharacters
indicate the repetition of an item. The *
metacharacter indicates zero, one, or more of the preceding item. The
+ metacharacter indicates one or more of the preceding item. The
brace { } metacharacters
let you specify exactly the number of previous items, or a range. For
instance, {3} means exactly three of the preceding
item; {3,7} means three, four, five, six, or seven
of the preceding item; and {3,} means three or
more of the preceding item. The ? matches none or
one of the preceding item.

A.16.2.8 Making quantifiers match minimally with ?

The quantifiers just shown are greedy
(or maximal) by default, meaning that they match as many items as
possible. Sometimes, you want a minimal match that will match as few
items as possible. You get that by following each of
* + {}
? with a ?. So, for instance,
*? tries to match as few as possible, perhaps even
none, of the preceding item before it tries to match one or more of
the preceding item. Here's a maximal match:

'hear ye hear ye hear ye' =~ /hear.*ye/;
print $&;

This matches 'hear' followed by
.* (as many characters as possible), followed by
'ye', and prints:

hear ye hear ye hear ye

Here is a minimal match:

'hear ye hear ye hear ye' =~ /hear.*?ye/;
print $&;

This matches 'hear' followed by
.*? (the fewest number of characters possible),
followed by 'ye', and prints:

hear ye

A.16.3 Capturing Matched Patterns

You can place parentheses around parts of
the pattern for which you want to know the matched string. Take, for
example, the following:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
$alphabet =~ /k(lmnop)q/;
print $1;

This prints:

lmnop

You can place as many pairs of parentheses in a regular expression as
you like; Perl automatically stores their matched substrings in
special variables named $1, $2,
and so on. The matches are numbered in order of the left-to-right
appearance of their opening parenthesis.

Here's a more intricate example of capturing parts
of a matched pattern in a
string:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
$alphabet =~ /(((a)b)c)/;
print "First pattern = ", $1,"\n";
print "Second pattern = ", $2,"\n";
print "Third pattern = ", $3,"\n";

This prints:

First pattern = abc
Second pattern = ab
Third pattern = a

A.16.4 Metasymbols

Metasymbols are sequences of two or more
characters
consisting of backslashes before normal characters. These
metasymbols have special meanings in Perl regular expressions (and in
double-quoted strings for most of them). There are quite a few of
them, but that's because they're so
useful. Table A-3 lists most of these metasymbols.
The column "Atomic" indicates Yes
if the metasymbol matches an item, No if the metasymbol just makes an
assertion, and - if it takes some other action.

Table A-3. Alphanumeric metasymbols
Symbol	Atomic	Meaning
\0	Yes	Match the null character (ASCII NULL)
\NNN	Yes	Match the character given in octal, up to 377
\n	Yes	Match `n`th previously captured string (decimal)
\a	Yes	Match the alarm character (BEL)
\A	No	True at the beginning of a string
\b	Yes	Match the backspace character (BS)
\b	No	True at word boundary
\B	No	True when not at word boundary
\cX	Yes	Match the control character Control-X
\d	Yes	Match any digit character
\D	Yes	Match any nondigit character
\e	Yes	Match the escape character (ASCII ESC, not backslash)
\E	-	End case (\L, \U) or metaquote (\Q) translation
\f	Yes	Match the formfeed character (FF)
\G	No	True at end-of-match position of prior m//g
\l	-	Lowercase the next character only
\L	-	Lowercase till \E
\n	Yes	Match the newline character (usually NL, but CR on Macs)
\Q	-	Quote (do-meta) metacharacters till \E
\r	Yes	Match the return character (usually CR, but NL on Macs)
\s	Yes	Match any whitespace character
\S	Yes	Match any nonwhitespace character
\t	Yes	Match the tab character (HT)
\u	-	Titlecase the next character only
\U	-	Uppercase (not titlecase) till \E
\w	Yes	Match any "word" character (alphanumerics plus _ )
\W	Yes	Match any nonword character
\x{abcd}	Yes	Match the character given in hexadecimal
\z	No	True at end of string only
\Z	No	True at end of string or before optional newline

A.16.5 Extending Regular-Expression Sequences

Table A-4 includes several useful features that
have been added to Perl's regular-expression
capabilities.

Table A-4. Extended regular-expression sequences
Extension	Atomic	Meaning
(?#...)	No	Comment, discard
(?:...)	Yes	Cluster-only parentheses, no capturing
(?imsx-imsx)	No	Enable/disable pattern modifiers
(?imsx-imsx:...)	Yes	Cluster-only parentheses plus modifiers
(?=...)	No	True if lookahead assertion succeeds
(?!...)	No	True if lookahead assertion fails
(?<=...)	No	True if lookbehind assertion succeeds
(?<!...)	No	True if lookbehind assertion fails
(?>...)	Yes	Match nonbacktracking subpattern
(?{...})	No	Execute embedded Perl code
(??{...})	Yes	Match regex from embedded Perl code
(?(...)...\|...)	Yes	Match with if-then-else pattern
(?(...)...)	Yes	Match with if-then pattern

A.16.6 Pattern Modifiers

Pattern modifiers
are single-letter commands placed after the forward slashes. They
delimit a regular expression or a substitution and change the
behavior of some regular-expression features. Table A-5 lists the most common pattern modifiers,
followed by an example.

Table A-5. Pattern modifiers
Modifier	Meaning
/i	Ignore upper- or lowercase distinctions
/s	Let . match newline
/m	Let ^ and $ match next to embedded \n
/x	Ignore (most) whitespace and permit comments in patterns
/o	Compile pattern once only
/g	Find all matches, not just the first one

As an example, say you were looking for a name in text, but you
didn't know if the name had an initial capital
letter or was all capitalized. You can use the
/i modifier, like
so:

$text = "WATSON and CRICK won the Nobel Prize";
$text =~ /Watson/i;
print $&;

This matches (since /i causes upper- and lowercase
distinctions to be ignored) and prints out the matched string
WATSON.

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

A.16 Regular Expressions

A.16.1 Overview

A.16.2 Metacharacters

A.16.2.1 Escaping with \

A.16.2.2 Alternation with |

A.16.2.3 Grouping with ( )

A.16.2.4 Character classes

A.16.2.5 Matching any character with a dot

A.16.2.6 Beginning and end of strings with ^ and $

A.16.2.7 Quantifiers

A.16.2.8 Making quantifiers match minimally with ?

A.16.3 Capturing Matched Patterns

A.16.4 Metasymbols

Table A-3. Alphanumeric metasymbols

A.16.5 Extending Regular-Expression Sequences

Table A-4. Extended regular-expression sequences

A.16.6 Pattern Modifiers

Table A-5. Pattern modifiers