Build Your Own DatabaseDriven Website Using PHP amp;amp; MySQL [Electronic resources]

Kevin Yank

نسخه متنی -صفحه : 190/ 63

String Replacement with Regular Expressions

Using ereg or eregi with the regular expression syntax we've just learned, we can easily detect the presence of tags in a given text string. However, what we need to do is pinpoint those tags, and replace them with appropriate HTML tags. To achieve this, we need to look at a couple more regular expression functions offered by PHP: ereg_replace and eregi_replace.

ereg_replace, like ereg, accepts a regular expression and a string of text, and attempts to match the regular expression in the string. In addition, ereg_replace takes a second string of text, and replaces every match of the regular expression with that string.

The syntax for ereg_replace is as follows:

$newstr = ereg_replace(regexp, replacewith, oldstr);

Here, regexp is the regular expression, and replacewith is the string that will replace matches to regexp in oldstr. The function returns the new string that's the outcome of the replacement operation. In the above, this newly-generated string is stored in $newstr.

eregi_replace, as you might expect, is identical to ereg_replace, except that the case of letters is not considered when searching for matches.

We're now ready to build our custom markup language.

Boldface and Italic Text

Let's start by implementing tags that create boldface and italic text. Let's say we want [B] to begin bold text and [EB] to end bold text. Obviously, we must replace [B] with  and [EB] with  ^[2]. Achieving this is a simple application of eregi_replace^[3]:

$joketext = eregi_replace('\[b]','<strong>',$joketext);
$joketext = eregi_replace('\[eb]','</strong>',$joketext);

Notice that, because [ normally indicates the start of a set of acceptable characters in a regular expression, we put a backslash before it in order to remove its special meaning. Without a matching [, the ] loses its special meaning, so it doesn't need a backslash, although you could put a backslash in front of it as well if you wanted to be thorough.

Also notice that, as we're using eregi_replace, which is case insensitive, both [B] and [b] will work as tags in our custom markup language.

Italic text can be done the same way:

$joketext = eregi_replace('\[i]','<em>',$joketext);
$joketext = eregi_replace('\[ei]','</em>',$joketext);

Paragraphs

While we could create tags for paragraphs just as we did for boldface and italicized text above, a simpler approach makes even more sense. As users will type the content into a form field that allows them to format text using the enter key, we shall take a single new line (\n) to indicate a line break ( ) and a double new line (\n\n) to indicate a new paragraph (). Of course, because Windows computers represent an end-of-line as a new line/carriage return pair (\n\r) and Macintosh computers represent it as a carriage return/new line pair (\r\n), we must strip out carriage returns first. The code for all this is as follows:

// Strip out carriage returns
$joketext = ereg_replace("\r",'',$joketext);
// Handle paragraphs
$joketext = ereg_replace("\n\n",'</p><p>',$joketext);
// Handle line breaks
$joketext = ereg_replace("\n",'<br />',$joketext);

That's it! The text will now appear in the paragraphs expected by the user, who hasn't had to learn any custom tags to format content into paragraphs.

Hyperlinks

While supporting the inclusion of hyperlinks in the text of jokes may seem unnecessary, this feature makes plenty of sense in other applications. Hyperlinks are a little more complicated than the simple conversion of a fixed code fragment into an HTML tag. We need to be able to output a URL, as well as the text that should appear as the link.

Another feature of ereg_replace and eregi_replace comes into play here. If you surround a portion of the regular expression with parentheses, you can capture the corresponding portion of the matched text, and use it in the replace string. To do this, you'll use the code \\n, where n is 1 for the first parenthesized portion of the regular expression, 2 for the second, up to 9 for the 9th. Consider this example:

$text = 'banana';
$text = eregi_replace('(.*)(nana)', '\\2', $text);
echo($text); // outputs "nanaba"

In the above, gets replaced with ba in the replace string, which corresponds to (.*) (zero or more non-new line characters) in the regular expression. \\2 gets replaced with nana, which corresponds to (nana) in the regular expression.

We can use the same principle to create our hyperlinks. Let's begin with a simple form of link, where the text of the link is the same as the URL. We want to support this syntax:

Visit [L]http://www.php.net/[EL].

The corresponding HTML code, which we want to output, is as follows:

Visit <a href="http://www.php.net/">http://www.php.net/</a>.

First, we need a regular expression that will match links of this form. The regular expression is as follows:

\[L][-_./a-zA-Z0-9!&%#?+,'=:~]+\[EL]

Again, we've placed backslashes in front of the opening square brackets in [L] and [EL] to indicate that they are to be taken literally. We then use square brackets to list all the characters we wish to accept as part of the URL^[4]. We place a + after the square brackets to indicate that the URL will be composed of one or more characters taken from this list.

To output our link, we'll need to capture the URL and output it both as the href attribute of the a tag, and as the text of the link. To capture the URL, we surround the corresponding portion of our regular expression with parentheses:

\[L]([-_./a-zA-Z0-9!&%#?+,'=:~]+)\[EL]

So we convert the link with the following code:

$joketext = ereg_replace(
'\[L]([-_./a-zA-Z0-9!&%#?+,\'=:~]+)\[EL]',
'<a href="></a>', $joketext);

Note that we had to escape the quote (') in the regular expression with a backslash (\') to prevent PHP from thinking it indicated the end of the regular expression string.. Meanwhile, in the replacement string gets replaced by the URL for the link, and the output is as expected!

We'd also like to support hyperlinks whose link text differs from the URL. Let's say the form of our link is as follows:

Check out [L=http://www.php.net/]PHP[EL].

Here's our regular expression (wrapped to fit on the page):

\[L=([-_./a-zA-Z0-9!&%#?+,'=:~]+)]
([-_./a-zA-Z0-9 !&%#?+$,'"=:;~]+)\[EL]

Quite a mess, isn't it? Squint at it for a little while, and you'll see it achieves exactly what we need it to, capturing both the URL () and the text (\\2) for the link. The PHP code that performs the substitution is as follows:

$joketext = ereg_replace(
'\[L=([-_./a-zA-Z0-9!&%#?+,\'=:~]+)]'.
'([-_./a-zA-Z0-9 !&%#?+$,\'"=:;~]+)\[EL]',
'<a href=">\\2</a>', $joketext);

Matching Tags

A nice side-effect of the regular expressions we developed to read hyperlinks is that they will only find matched pairs of [L] and [EL] tags. A [L] tag missing its [EL] or vice versa will not be detected, and will appear unchanged in the finished document, allowing the person updating the site to spot the error and fix it.

In contrast, the PHP code we developed for boldface and italic text in "Boldface and Italic Text" will convert unmatched [B] and [I] tags into unmatched HTML tags! This can easily lead to ugly situations like the entire text of a joke starting from an unmatched tag being displayed in bold—possibly even spilling into subsequent content on the page.

We can rewrite our code for bold/italic text in the same style as we used for hyperlinks to solve this problem by only processing matched pairs of tags:

$joketext = eregi_replace(
'\[b]([-_./a-zA-Z0-9 !&%#?+$,\'"=:;~]+)\[eb]',
'<strong></strong>',$joketext);
$joketext = eregi_replace(
'\[i]([-_./a-zA-Z0-9 !&%#?+$,\'"=:;~]+)\[ei]',
'<em></em>',$joketext);

If unmatched tags aren't much of a concern for you, however, you can actually simplify your code by not using regular expressions at all! PHP's str_replace function works a lot like ereg_replace, except that it only searches for strings—not patterns.

$newstr = str_replace(searchfor, replacewith, oldstr);

We can therefore rewrite our bold/italic code as follows:

$joketext = str_replace('[b]','<strong>',$joketext);
$joketext = str_replace('[eb]','</strong>',$joketext);
$joketext = str_replace('[i]','<em>',$joketext);
$joketext = str_replace('[ei]','</em>',$joketext);

One difference remains between this and our regular expression code. We used eregi_replace in our previous code to match both lowercase [b] and uppercase [B] tags, as that function was case-insensitive. str_replace is case sensitive, so we need to make a further modification to allow uppercase tags:

$joketext = str_replace(
array('[b]','[B]'),'<strong>',$joketext);
$joketext = str_replace(
array('[eb]','[EB]'),'</strong>',$joketext);
$joketext = str_replace(
array('[i]','[I]'),'<em>',$joketext);
$joketext = str_replace(
array('[ei]','[EI]'),'</em>',$joketext);

str_replace lets you give an array for the search string, so the above code will replace either [b] or [B] with , [eb] or [EB] with , and so on. For more information about the intricacies of str_replace, refer to the PHP manual.

While this code looks more complicated than the original version with eregi_replace, str_replace is a lot more efficient because it doesn't need to interpret your search string for regular expression codes. Whenever str_replace can do the job, you should always use it instead of ereg_replace or eregi_replace.

The joke.php file included in the code archive makes use of str_replace; feel free to replace it with the regular expression code above if you are worried about unmatched tags.

^[2]You may be more accustomed to using  and  tags; however, I have chosen to respect the most recent HTML standards, which recommend replacing these with  and , respectively.

^[3]Experienced developers may object to this use of regular expressions. Yes, regular expressions are not required for this simple example, and yes, a single regular expression for both tags would be more appropriate than two separate expressions. I'll address both of these issues later in this chapter.

^[4]I have not included a space in the list of characters I want to allow in a link URL. Although Microsoft Internet Explorer supports such URLs, spaces in the path or file name portions of a URL should be replaced with the code %20, and spaces in the query string should be replaced by +. If you want to allow spaces in your URLs, feel free to add a space to the list of characters in square brackets.