Web Database Applications With Php And Mysql (2nd Edition) [Electronic resources] نسخه متنی

3.3 Regular Expressions

In this section, we show how regular
expressions

can achieve more sophisticated
pattern matching to find, extract, and replace complex substrings
within a string. While regular expressions provide capabilities
beyond those described in the last section, complex pattern matching
isn't as efficient as simple string comparisons. The
functions described in the previous section are more efficient than
those that use regular expressions and should be used if complex
pattern searches aren't required.

This section begins with a brief description of the POSIX regular
expression syntax. This isn't a complete description
of all of the capabilities, but we do provide enough details to
create quite powerful regular expressions. The second half of the
section describes the functions that use POSIX regular expressions.
Examples of regular expressions can also be found in Chapter 9.

3.3.1 Regular Expression Syntax

A
regular
expression follows a strict syntax to describe patterns of
characters. PHP has two sets of functions that use regular
expressions: one set supports the

Perl Compatible
Regular Expression (PCRE) syntax, and the other supports the

POSIX extended regular expression syntax.
In this book, we use the POSIX functions.

To demonstrate the syntax of regular expressions, we introduce the
function ereg( )
:

boolean ereg(string pattern, string subject [, array var])

ereg( ) returns true if the
regular expression pattern is found in the
subject string. We discuss how the
ereg( ) function can extract values into the
optional array variable var later in this
section.

The following trivial example shows how ereg( )
is called to find the literal pattern cat in the
subject string "raining cats and dogs":

// prints "Found 'cat'"
if (ereg("cat", "raining cats and dogs"))
print "Found 'cat'";

The regular expression cat matches the
subject string, and the fragment prints
"Found 'cat'".

3.3.1.1 Characters and wildcards

To represent any character in a pattern, a period is used as a
wildcard. The pattern c.. matches any three-letter
string that begins with a lowercase c; for
example, cat, cow,
cop, and so on. To express a pattern that actually
matches a period, use the backslash character \.
For example, .com matches both
.com and xcom but
\.com matches only .com.

The use of the backslash in a regular expression can cause confusion.
To include a backslash in a double-quoted string, you need to escape
the meaning of the backslash with a backslash. The following example
shows how the regular expression pattern "\.com"
is represented:

// Sets $found to true
$found = ereg("\\.com", "www.ora.com");

It's better to avoid the confusion and use single
quotes when passing a string as a regular expression:

$found = ereg('\.com', "www.ora.com");

3.3.1.2 Character lists

Rather than using a wildcard that
matches any character, a list of characters enclosed in

brackets can be specified
within a pattern. For example, to match a three-character string that
starts with a "p", ends with a
"p", and contains a vowel as the middle letter,
you can use the following expression:

ereg("p[aeiou]p", $var)

This returns true for any string that contains
"pap", "pep",
"pip", "pop", or
"pup". The character list in the regular
expression "p[aeiou]p" matches with exactly one
character, so strings like "paep"
don't match.
A range of characters can also be
specified; for example, "[0-9]" specifies the
numbers 0 through 9:

// Matches "A1", "A2", "A3", "B1", ...
$found = ereg("[ABC][123]", "A1 Quality");  // true
// Matches "00" to "39"
$found = ereg("[0-3][0-9]", "27");  //true
$found = ereg("[0-3][0-9]", "42");  //false

A list can specify characters that aren't matches
using the

not operator ^
as the first character in the brackets. The pattern
"[^123]" matches any character other than 1, 2, or
3. The following examples show regular expressions that make use of
the not operator in lists:

// true for "pap", "pbp", "pcp", etc. but not "php"
$found = ereg("p[^h]p", "pap"); //true
// true if $var does not contain alphanumeric characters
$found = ereg("[^0-9a-zA-Z]", "123abc"); // false

The ^ character can be used without meaning by
placing it in a position other than the start of the characters
enclosed in the brackets. For example,
"[0-9^]" matches the characters
to 9 and the ^ character. Similarly, the -
character can be matched by placing it at the start or the end of the
list; for example, "[-123]"
matches the characters -, 1,
2, or 3. The characters
^ and - have different meanings outside the
[] character lists.

3.3.1.3 Anchors

A

regular expression can specify that
a pattern occurs at the start or end of a subject string using
anchors. The ^

anchors a pattern to
the start, and the $

character anchors a pattern to the end of a string.
(Don't confuse this use of ^ with its completely
different use in character lists in the previous section.) For
example, the expression:

 ereg("^php", $var)

matches strings that start with "php" but not
others. The following code shows the operation of both:

$var = "to be or not to be";
$match = ereg('^to', $var); // true
$match = ereg('be$', $var); // true
$match = ereg('^or', $var); // false

The following illustrates the difference between the use of
^ as an anchor and the use of ^
in a character list:

$var = "123467";
// match strings that start with a digit
$match = ereg("^[0-9]", $var); // true
// match strings that contain any character other than a digit
$match = ereg("[^0-9]", $var); // false

Both start and end anchors can be used in a single regular expression
to match a whole string. The following example illustrates this:

// Must match "Yes" exactly
$match = ereg('^Yes$', "Yes");     // true
$match = ereg('^Yes$', "Yes sir"); // false

3.3.1.4 Optional and repeating characters

When a character in a regular
expression is followed by a
?

operator, the pattern matches zero
or one times. In other words, ? marks something
that is optional. A character followed by + matches one or more
times. And a character followed by * matches zero
or more times. Let's look at concrete examples of
these powerful operators.

The ? operator allows zero or one occurrence of a
character, so the expression:

ereg("pe?p", $var)

matches either "pep" or "pp",
but not the string "peep". The
* operator allows zero or many occurrences of the
"o" in the expression:

ereg("po*p", $var)

and matches "pp", "pop",
"poop", "pooop", and so on.
Finally, the + operator allows one to many
occurrences of "b" in the expression:

ereg("ab+a", $var)

so while strings such as "aba",
"abba", and "abbba" match,
"aa" doesn't.

The operators ?, *, and
+ can also be used with a wildcard or a list of
characters. The following examples show you how:

$var = "www.rmit.edu.au";
// True for strings that start with "www" and end with "au"
$matches = ereg('^www.*au$', $var); // true
$hexString = "x01ff";
// True for strings that start with 'x' followed by at least 
// one hexadecimal digit
$matches = ereg('x[0-9a-fA-F]+$', $hexString); // true

The first example matches any string that starts with
"www" and ends with "au"; the
pattern ".*" matches a sequence of any characters,
including an empty string. The second example matches any sequence
that starts with the character "x" followed by one
or more characters from the list [0-9a-fA-F].

A fixed number of occurrences can be specified in

braces. For example, the
pattern "[0-7]{3}" matches three-character numbers
that contain the digits 0 through 7:

$valid = ereg("[0-7]{3}", "075"); // true
$valid = ereg("[0-7]{3}", "75");  // false

The braces syntax also allows the minimum and maximum occurrences of
a pattern to be specified as demonstrated in the following examples:

$val = "58273";
// true if $val contains numerals from start to end
// and is between 4 and 6 characters in length
$valid = ereg('^[0-9]{4,6}$', $val); // true
$val = "5827003";
$valid = ereg('^[0-9]{4,6}$', $val); // false
// Without the anchors at the start and end, the 
// matching pattern "582768" is found
$val = "582768986456245003";
$valid = ereg("[0-9]{4,6}", $val);   // true

3.3.1.5 Groups

Subpatterns
in a regular expression can be grouped by placing parentheses around
them. This allows the optional and repeating operators to be applied
to groups rather than just a single character. For example, the
expression:

 ereg("(123)+", $var)

matches "123", "123123",
"123123123", and so on. Grouping characters allows
complex patterns to be expressed, as in the following example that
matches an alphabetic-only URL:

// A simple, incomplete, HTTP URL regular expression 
// that doesn't allow numbers
$pattern = '^(http://)?[a-zA-Z]+(\.[a-zA-z]+)+$';
$found = ereg($pattern, "www.ora.com"); // true

Figure 3-1 shows the parts of this complex regular
expression and how they're interpreted. The regular
expression assigned to $pattern includes both the
start and end anchors, ^ and $,
so the whole subject string,
"www.ora.com" must match the pattern. The start of
the pattern is the optional group of characters
"http://", as specified by
"(http://)?". This doesn't match
any of the subject string in the example but doesn't
rule out a match, because the "http://" pattern is
optional. Next the "[a-zA-Z]+" pattern specifies
one or more alpha characters, and this matches
"www" from the subject
string. The next pattern is the group
"(\.[a-zA-z]+)". This pattern must start with a
period (the wildcard meaning of . is escaped with
the backslash) followed by one or more alphabetic characters. The
pattern in this group is followed by the +
operator, so the pattern must occur at least once in the subject and
can repeat many times. In the example, the first occurrence is
".ora" and the second occurrence is
".com".

Figure 3-1. Regular expression with groups

Groups can also define subpatterns when ereg( )
extracts values into an array. We discuss the use of ereg(
) to extract values later in this section.

3.3.1.6 Alternative patterns

Alternatives in a pattern are
specified with the | operator; for example, the
pattern "cat|bat|rat" matches
"cat", "bat", or
"rat". The | operator has the
lowest precedence of the regular expression operators, treating the
largest surrounding expressions as alternative patterns. To match
"cat", "bat", or
"rat" another way, the following expression can be
used:

$var = "bat";
$found = ereg("(c|b|r)at", $var);  // true

Another example shows alternative endings to a pattern:

// match some URL damains
$pattern = '(com$|net$|gov$|edu$)';
$found = ereg($pattern, "http://www.ora.com"); // true
$found = ereg($pattern, "http://www.rmit.edu.au"); // false

3.3.1.7 Escaping special characters

We've already
discussed the need to escape the special meaning of characters used
as operators in a regular expression. However, when to escape the
meaning depends on how the character is used. Escaping the special
meaning of a character is done with the

backslash character
as with the expression "2\+3, which matches the
string "2+3". If the +
isn't escaped, the pattern matches one or many
occurrences of the character 2 followed by the
character 3. Another way to write this expression
is to express the + in the list of characters as
"2[+]3". Because +
doesn't have the same meaning in a list, it
doesn't need to be escaped in that context.

Using
character lists in this way can improve readability. The following
examples show how escaping is used and avoided:

// need to escape '(' and ')'
$phone = "(03) 9429 5555";
$found = ereg("^\([0-9]{2,3}\)", $phone); // true
// No need to escape (*.+?)| within brackets
$special = "Special Characters are (, ), *, +, ?, |";
$found = ereg("[(*.+?)|]", $special); // true
// The backslash always needs to be quoted
$backSlash = 'The backslash \ character';
$found = ereg('^[a-zA-Z \\]*$', $backSlash); //true
// Don't need to escape the dot within brackets
$domain = "www.ora.com";
$found = ereg("[.]com", $domain); //true

Another complication arises due to the fact that a regular expression
is passed as a string to the regular expression functions. Strings in
PHP can also use the backslash character to escape quotes and to
encode tabs, newlines, and so on. Consider the following example,
which matches a backslash character:

// single-quoted string containing a backslash
$backSlash = '\ backslash';
// Evaluates to true 
$found = ereg("^\\\\ backslash", $backSlash);

The regular expression looks quite odd: to match a backslash, the
regular expression function needs to escape the meaning of backslash,
but because we are using a double-quoted string, each of the two
backslashes needs to be escaped.

3.3.1.8 Metacharacters

Metacharacters
can also be used in regular expressions. For example, the tab
character is represented as \t and the
carriage-return character as \n. There are also
shortcuts: \d means any digit, and
\s means any whitespace. The following example
returns true because the tab character,
\t, is contained in the $source
string:

$source = "fast\tfood";
$result = ereg('\s', $source); // true

Special metacharacters in the form [:...:] can be
used in character lists to match other character classes. For
example, the character class specifications
[:alnum:] can be used to check for alphanumeric
strings:

$str = "abc123";
// Evaluates to true
$result = ereg('^[[:alnum:]]+$', $str);
$str = "abc\xf623";
// Evaluates to false because of the \xf6 character
$result = ereg('^[[:alnum:]]+$', $str);

Be careful to use special metacharacter specifications only within a
character list. Outside this context, the regular expression
evaluator treats the sequence as a list specification:

$str = "abc123";
// Oops, left out the enclosing [] pair, Evaluates to false
$result = ereg('^[:alnum:]+$', $str);

Table 3-2 shows the

POSIX character class specifications
supported by PHP.

Table 3-2. POSIX character classes
Pattern	Matches
[:alnum:]	Letters and digits
[:alpha:]	Letters
[:blank:]	The Space and Tab characters
[:cntrl:]	Control charactersthose with an ASCII code less than 32
[:digit:]	Digits. Equivalent to \d
[:graph:]	Characters represented with a visible character
[:lower:]	Lowercase letters
[:print:]	Characters represented with a visible character, and the space and tab characters
[:space:]	Whitespace characters. Equivalent to \s
[:upper:]	Uppercase letters
[:xdigit:]	Hexadecimal digits

The behavior of these character class specifications depends on your
locale settings. By default, the classes are interpreted for the
English language, however other interpretations can be achieved by
calling setlocale( ) as discussed in Chapter 9.

3.3.2 Regular Expression Functions

PHP
has several functions that use POSIX regular expressions to find and
extract substrings, replace substrings, and split a string into an
array. The functions to perform these tasks come in pairs: a
case-sensitive version and a case-insensitive version.

3.3.2.1 Finding and extracting values

The ereg( )
function, and the case-insensitive
version eregi( )
, are defined as:

boolean ereg(string pattern, string subject [, array var])boolean eregi(string pattern, string subject [, array var])

Both functions return true if the regular
expression pattern is found in the
subject string. An optional array variable
var can be passed as the third argument; it is
populated with the portions of subject that are
matched by up to nine grouped subexpressions in
pattern. Subexpressions consist of characters
enclosed in parentheses. Both functions return
false if the pattern
isn't found in the subject.

To extract values from a string into an array, patterns can be
arranged in groups contained by parentheses in the regular
expression. The following example shows how the year, month, and day
components of a date can be extracted into an array:

$parts = array( );
$value = "2007-04-12";
$pattern = '^([0-9]{4})-([0-9]{2})-([0-9]{2})$';
ereg($pattern, $value, $parts);
// Array ( [0] => 2007-04-12  [1] => 2007  [2] => 04  [3] => 12 )
print_r($parts);

The expression:

'^([0-9]{4})-([0-9]{2})-([0-9]{2})$'

matches dates in the format YYYY-MM-DD. After
calling ereg( ), $parts[0] is
assigned the portion of the string that matches the whole regular
expression, in this case the whole string
2007-04-12. The portion of the date that matches
each group in the expression is assigned to the following array
elements: $parts[1] contains the year matched by
([0-9]{4}), $parts[2] contains
the month matched by ([0-9]{2}), and
$parts[3] contains the day matched by
([0-9]{2}).

3.3.2.2 Replacing substrings

The

following functions create new
strings by replacing substrings:

string ereg_replace(string pattern, string replacement, string source)string eregi_replace(string pattern, string replacement, string source)

They create a new string by replacing substrings of the
source string that match the regular expression
pattern with a replacement
string. These functions are similar to the str_replace(
) function described earlier in
"Replacing Characters and
Substrings," except that the replaced substrings are
identified using a regular expression. Consider the examples:

$source = "The quick red fox jumps";
// prints "The quick brown fox jumps"
print ereg_replace("red", "brown", $source);
$source = "The quick brown fox jumps
over    the   lazy    dog";
// replace all whitespace sequences with a single space
// prints "The quick brown fox jumps over the lazy dog"; 
print ereg_replace("[[:space:]]+", " ", $source);

You can also use include patterns matched by subexpressions in the
replacement string. The following example replaces all occurrences of
uppercase letters with the matched letter surrounded by
<b> and </b> tags:

$source = "The quick red fox jumps over the lazy Dog.";
// prints "<b>T</b>he quick brown fox jumps over the lazy <b>D</b>og"
print ereg_replace("([A-Z])", '<b>\1</b>', $source);

The grouped subexpression is referenced in the replacement string
with the \1 sequence. Multiple subexpressions can
be referenced with \2, \3, and
so on. The following example uses three subexpressions to rearrange a
data from YYYY-MM-DD format to
DD/MM/YYYY format:

$value = "2004-08-24";
$pattern = '^([0-9]{4})-([0-9]{2})-([0-9]{2})$';
// prints "24/08/2004"
print ereg_replace($pattern, '\3/\2/\1', $value);

3.3.2.3 Splitting a string into an array

The following two functions split
strings:

array split(string pattern, string source [, integer limit])array spliti(string pattern, string source [, integer limit])

They split the source string into an array,
breaking the string where the matching pattern
is found. These functions perform a similar task to the
explode( )
function described earlier and as
with explode( ), a limit
can be specified to determine the maximum number of elements in the
array.

The following simple example shows how split( )
can break a sentence into an array of
"words" by recognizing any sequence
of non-alphabetic characters as separators:

$sentence = "I wonder why he does\nBuzz, buzz, buzz";
$words = split("[^a-zA-Z]+", $sentence);
print_r($words);

The $words array now contains each word as an
element:

Array
(
[0] => I
[1] => wonder
[2] => why
[3] => he
[4] => does
[5] => Buzz
[6] => buzz
[7] => buzz
)

When complex patterns aren't needed to break a
string into an array, the explode( ) function is
a better, faster choice.