String Replacement with Regular Expressions
Using ereg or eregi withthe regular expression syntax we've just learned, we can easily detect the
presence of tags in a given text string. However, what we need to do is pinpoint
those tags, and replace them with appropriate HTML tags. To achieve this,
we need to look at a couple more regular expression functions offered by PHP: ereg_replace and eregi_replace.ereg_replace, like ereg, accepts
a regular expression and a string of text, and attempts to match the regular
expression in the string. In addition, ereg_replace takes
a second string of text, and replaces every match of the regular expression
with that string.The syntax for ereg_replace is as follows:
$newstr = ereg_replace(regexp, replacewith, oldstr);
Here, regexp is the regular
expression, and replacewith is
the string that will replace matches to regexp in oldstr. The function returns the new string that's the outcome of the replacement
operation. In the above, this newly-generated string is stored in $newstr.eregi_replace, as you might expect, is identical
to ereg_replace, except that the case of letters is not
considered when searching for matches.We're now ready to build our custom markup language.
Boldface and Italic Text
Let's start by implementing tags that create boldface and italic text. Let's say we want [B] to
begin bold text and [EB] to end bold text. Obviously, we
must replace [B] with <strong> and [EB] with </strong>
[2]. Achieving this is a simple application of eregi_replace[3]:
$joketext = eregi_replace('\[b]','<strong>',$joketext);
$joketext = eregi_replace('\[eb]','</strong>',$joketext);
Notice that, because [ normally indicates the start
of a set of acceptable characters in a regular expression, we put a backslash
before it in order to remove its special meaning. Without a matching [, the ] loses
its special meaning, so it doesn't need a backslash, although you could put
a backslash in front of it as well if you wanted to be thorough.Also notice that, as we're using eregi_replace,
which is case insensitive, both [B] and [b] will
work as tags in our custom markup language.Italic text can be done the same way:
$joketext = eregi_replace('\[i]','<em>',$joketext);
$joketext = eregi_replace('\[ei]','</em>',$joketext);
Paragraphs
While we could create tags for paragraphs just as we did for boldface
and italicized text above, a simpler approach makes even more sense. As users
will type the content into a form field that allows them to format text using
the enter key, we shall take a single new line (\n) to
indicate a line break (<br />) and a double new line
(\n\n) to indicate a new paragraph (</p><p>).
Of course, because Windows computers represent an end-of-line as a new line/carriage
return pair (\n\r) and Macintosh computers represent it
as a carriage return/new line pair (\r\n), we must strip
out carriage returns first. The code for all this is as follows:
// Strip out carriage returns
$joketext = ereg_replace("\r",'',$joketext);
// Handle paragraphs
$joketext = ereg_replace("\n\n",'</p><p>',$joketext);
// Handle line breaks
$joketext = ereg_replace("\n",'<br />',$joketext);
That's it! The text will now appear in the paragraphs expected by the
user, who hasn't had to learn any custom tags to format content into paragraphs.
Hyperlinks
While supporting the inclusion of hyperlinks in the text of jokes may
seem unnecessary, this feature makes plenty of sense in other applications.
Hyperlinks are a little more complicated than the simple conversion of a fixed
code fragment into an HTML tag. We need to be able to output a URL, as well
as the text that should appear as the link.Another feature of ereg_replace and eregi_replace comes
into play here. If you surround a portion of the regular expression with parentheses,
you can capture the corresponding portion of the matched
text, and use it in the replace string. To do this, you'll
use the code \\n, where n is
1 for the first parenthesized portion of the regular expression, 2 for the
second, up to 9 for the 9th. Consider this example:
$text = 'banana';
$text = eregi_replace('(.*)(nana)', '\\2', $text);
echo($text); // outputs "nanaba"
In the above, gets replaced with ba in
the replace string, which corresponds to (.*) (zero or
more non-new line characters) in the regular expression. \\2 gets
replaced with nana, which corresponds to (nana) in
the regular expression.We can use the same principle to create our hyperlinks. Let's begin
with a simple form of link, where the text of the link is the same as the
URL. We want to support this syntax:
Visit [L]http://www.php.net/[EL].
The corresponding HTML code, which we want to output, is as follows:
Visit <a href="http://www.php.net/">http://www.php.net/</a>.
First, we need a regular expression that will match links of this form.
The regular expression is as follows:
\[L][-_./a-zA-Z0-9!&%#?+,'=:~]+\[EL]
Again, we've placed backslashes in front of the opening square brackets
in [L] and [EL] to indicate that they
are to be taken literally. We then use square
brackets to list all the characters we wish to accept as part of the URL[4]. We place a + after the square brackets to
indicate that the URL will be composed of one or more characters taken from
this list.To output our link, we'll need to capture the URL and output it both
as the href attribute of the a tag,
and as the text of the link. To capture the URL, we surround the corresponding
portion of our regular expression with parentheses:
\[L]([-_./a-zA-Z0-9!&%#?+,'=:~]+)\[EL]
So we convert the link with the following code:
$joketext = ereg_replace(
'\[L]([-_./a-zA-Z0-9!&%#?+,\'=:~]+)\[EL]',
'<a href="></a>', $joketext);
Note that we had to escape the quote (') in the regular
expression with a backslash (\') to prevent PHP from thinking
it indicated the end of the regular expression string.. Meanwhile, in
the replacement string gets replaced by the URL for the link, and the output
is as expected!We'd also like to support hyperlinks whose link text differs from the
URL. Let's say the form of our link is as follows:
Check out [L=http://www.php.net/]PHP[EL].
Here's our regular expression (wrapped to fit on the page):
\[L=([-_./a-zA-Z0-9!&%#?+,'=:~]+)]
([-_./a-zA-Z0-9 !&%#?+$,'"=:;~]+)\[EL]
Quite a mess, isn't it? Squint at it for a little while, and you'll
see it achieves exactly what we need it to, capturing both the URL ()
and the text (\\2) for the link. The PHP code that performs
the substitution is as follows:
$joketext = ereg_replace(
'\[L=([-_./a-zA-Z0-9!&%#?+,\'=:~]+)]'.
'([-_./a-zA-Z0-9 !&%#?+$,\'"=:;~]+)\[EL]',
'<a href=">\\2</a>', $joketext);
Matching Tags
A nice side-effect of the regular expressions we developed to read hyperlinks
is that they will only find matched pairs of [L] and [EL] tags.
A [L] tag missing its [EL] or vice versa
will not be detected, and will appear unchanged in the finished document,
allowing the person updating the site to spot the error and fix it.In contrast, the PHP code we developed for boldface and italic text
in "Boldface and Italic Text" will convert unmatched [B] and [I] tags
into unmatched HTML tags! This can easily lead to ugly situations like the
entire text of a joke starting from an unmatched tag being displayed in bold—possibly
even spilling into subsequent content on the page.We can rewrite our code for bold/italic text in the same style as we
used for hyperlinks to solve this problem by only processing matched pairs
of tags:
$joketext = eregi_replace(
'\[b]([-_./a-zA-Z0-9 !&%#?+$,\'"=:;~]+)\[eb]',
'<strong></strong>',$joketext);
$joketext = eregi_replace(
'\[i]([-_./a-zA-Z0-9 !&%#?+$,\'"=:;~]+)\[ei]',
'<em></em>',$joketext);
If unmatched tags aren't much of a concern for you,
however, you can actually simplify your code by not using regular expressions
at all! PHP's str_replace function works a lot like ereg_replace,
except that it only searches for strings—not patterns.
$newstr = str_replace(searchfor, replacewith, oldstr);
We can therefore rewrite our bold/italic code as follows:
$joketext = str_replace('[b]','<strong>',$joketext);
$joketext = str_replace('[eb]','</strong>',$joketext);
$joketext = str_replace('[i]','<em>',$joketext);
$joketext = str_replace('[ei]','</em>',$joketext);
One difference remains between this and our regular expression code.
We used eregi_replace in our previous code to match both
lowercase [b] and uppercase [B] tags,
as that function was case-insensitive. str_replace is
case sensitive, so we need to make a further modification to allow uppercase
tags:
$joketext = str_replace(
array('[b]','[B]'),'<strong>',$joketext);
$joketext = str_replace(
array('[eb]','[EB]'),'</strong>',$joketext);
$joketext = str_replace(
array('[i]','[I]'),'<em>',$joketext);
$joketext = str_replace(
array('[ei]','[EI]'),'</em>',$joketext);
str_replace lets you give an array for the search
string, so the above code will replace either [b] or [B] with <strong>, [eb] or [EB] with </strong>, and so
on. For more information about the intricacies of str_replace, refer to the PHP manual.While this code looks more complicated than the original version with eregi_replace, str_replace is
a lot more efficient because it doesn't need to interpret your search string
for regular expression codes. Whenever str_replace can
do the job, you should always use it instead of ereg_replace or eregi_replace.The joke.php file included in the code archive
makes use of str_replace; feel free to replace it with
the regular expression code above if you are worried about unmatched tags.[2]You may be more accustomed to using <b> and <i> tags;
however, I have chosen to respect the most recent HTML standards, which recommend
replacing these with <strong> and <em>,
respectively.[3]Experienced developers may object to this use of regular expressions.
Yes, regular expressions are not required for this simple example, and yes,
a single regular expression for both tags would be more appropriate than two
separate expressions. I'll address both of these issues later in this chapter.[4]I have not included a space in the list of characters I want to allow
in a link URL. Although Microsoft Internet Explorer supports such URLs, spaces
in the path or file name portions of a URL should be replaced with the code %20,
and spaces in the query string should be replaced by +.
If you want to allow spaces in your URLs, feel free to add a space to the
list of characters in square brackets.