Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

3.2 Care and Handling of Regular Expressions

The second concern outlined at the start of the chapter is the syntactic packaging
that tells an application "Hey, here's a regex, and this is what I want you to do
with it." egrep is a simple example because the regular expression is expected as
an argument on the command line. Any extra syntactic sugar, such as the single
quotes I used throughout the first chapter, are needed only to satisfy the command
shell, not egrep. Complex systems, such as regular expressions in programming
languages, require more complex packaging to inform the system exactly what the
regex is and how it should be used.

The next step, then, is to look at what you can do with the results of a match.
Again, egrep is simple in that it pretty much always does the same thing (displays lines that contain a match), but as the previous chapter began to show, the real
power is in doing much more interesting things. The two basic actions behind
those interesting things are match (to check if a regex matches in a string, and to
perhaps pluck information from the string), and search-and-replace, to modify a
string based upon a match. There are many variations of these actions, and many
variations on how individual languages let you perform them.

In general, a programming language can take one of three approaches to regular
expressions: integrated, procedural, and object-oriented. With the first, regular
expression operators are built directly into the language, as with Perl. In the other
two, regular expressions are not part of the low-level syntax of the language.
Rather, normal strings are passed as arguments to normal functions, which then
interpret the strings as regular expressions. Depending on the function, one or
more regex-related actions are then performed. One derivative or another of this
style is use by most (non-Perl) languages, including Java, the .NET languages, Tcl,
Python, PHP, Emacs lisp, and Ruby.

3.2.1 Integrated Handling

We've already seen a bit of Perl's integrated approach, such as this example from
Section 2.3.4:

if ($line =~ m/^Subject: (.*)/i) {
    $subject = $1;
}

Here, for clarity, variable names I've chosen are in italic, while the regex-related
items are bold, and the regular expression itself is underlined. We know that Perl
applies the regular expression ^Subject:•(.*) to the text held in $line, and if a match is found, executes the block of code that follows. In that block, the variable $1 represents the text matched within the regular expression's parentheses, and this gets assigned to the variable $subject.

Another example of an integrated approach is when regular expressions are part
of a configuration file, such as for procmail (a Unix mail-processing utility.) In the
configuration file, regular expressions are used to route mail messages to the sections
that actually process them. It's even simpler than with Perl, since the
operands (the mail messages) are implicit.

What goes on behind the scenes is quite a bit more complex than these examples
show. An integrated approach simplifies things to the programmer because it hides
in the background some of mechanics of preparing the regular expression, setting
up for the match, applying the regular expression, and deriving results from that
application. Hiding these steps makes the normal case very easy to work with, but
as we'll see later, it can make some cases less efficient or clumsier to work with.

But, before getting into those details, let's uncover the hidden steps by looking at
the other methods.

3.2.2 Procedural and Object-Oriented Handling

Procedural and object-oriented handling are fairly similar. In either case, regex
functionality is provided not by built-in regular-expression operators, but by normal
functions (procedural) or constructors and methods (object-oriented). In this
case, there are no true regular-expression operands, but rather normal string arguments
that the functions, constructors, or methods choose to interpret as regular
expressions.

The next sections show examples in Java, VB.NET, and Python.

3.2.2.1 Regex handling in Java

Let's look at the equivalent of the "Subject" example in Java, using Sun's java.util.regex package. (Java is covered in depth in
Chapter 8.)

import java.util.regex.*;
 // Make regex classes
 easily available
·
·
·
[1] 
Pattern r = Pattern.compile("^Subject: (.*)", Pattern.CASE_INSENSITIVE);
[2]
 Matcher m = r.matcher(line);
[3]
 if (m.find()) {
[4]
     subject = m.group(1);
}

Variable names I've chosen are again
in italic, the regex-related items are bold, and
the regular expression itself is underlined.
Well, to be precise, what's underlined is
a normal string literal to be interpreted as
a regular expression.

This example shows an object-oriented
approach with regex functionality supplied
by two classes in Sun's
java.util.regex package: Pattern and Matcher. The
actions performed are:

`[1]`	Inspect the regular expression and compile it into an internal form that matches in a case-insensitive manner, yielding a "`Pattern`" object.
`[2]`	Associate it with some text to be inspected, yielding a "`Matcher`" object.
`[3]`	Actually apply the regex to see if there is a match in the previously-associated text, and let us know the result.
`[4]`	If there is a match, make available the text matched within the first set of capturing parentheses.

Actions similar to these are required, explicitly or implicitly, by any program wishing
to use regular expressions. Perl hides most of these details, and this Java
implementation usually exposes them.

A procedural example. Sun's Java regex package does, however, provide a few
procedural-approach "convenience functions" that hide much of the work. Rather
than require you to first create a regex object, then use that object's methods to
apply it, these static functions create a temporary object for you, throwing it away
once done. Here's an example showing the Pattern.matches(···) function:

   if (! Pattern.matches("11s*", line))
{
// . . . line is not blank . . .
}

This function wraps an implicit ^···$ around the regex, and returns a Boolean indicating
whether it can match the input string. It's common for a package to provide
both procedural and object-oriented interfaces, just as Sun did here. The differences
between them often involve convenience (a procedural interface can be easier
to work with for simple tasks, but more cumbersome for complex tasks),
functionality (procedural interfaces generally have less functionality and options
than their object-oriented counterparts), and efficiency (in any given situation, one
is likely to be more efficient than the other a subject covered in detail in
Chapter 6).

There are many regex packages for Java (half a dozen are discussed in Chapter 8),
but Sun is in a position to integrate theirs with the language more than anyone
else. For example, they've integrated it with the string class; the previous example
can actually be written as:

   if (! line
.matches("11s*", ))
{
// . . . line is not blank . . .
}

Again, this is not as efficient as a properly-applied object-oriented approach, and
so is not appropriate for use in a time-critical loop, but it's quite convenient for
"casual" use.

3.2.2.2 Regex handling in VB and other .NET languages

Although all regex engines perform essentially the same basic tasks, they differ in
how those tasks and services are exposed to the programmer, even among implementations
sharing the same approach. Here's the "Subject" example in VB.NET
(.NET is covered in detail in Chapter 9):

   Imports System.Text.RegularExpressions
 ' Make regex classes easily available
.
.
.
Dim R as Regex = New Regex("^Subject: (.*)", RegexOptions.IgnoreCase)
Dim M as Match = R.Match(line)
If M.Success
       subject = M.Groups(1).Value
End If

Overall, this is generally similar to the Java example, except that .NET combines
steps [2] and [3], and requires an extra Value in [4]. Why the differences? One is not inherently better or worseeach was just chosen by the developers who happened to have thought was the best approach at the time. (More on this in a bit.)

.NET also provides a few procedural-approach functions. Here's one to check for a
blank line:

   If Not Regex.IsMatch(Line, "^1s*$") Then
' . . . line is not blank . . .
End If

Unlike Sun's Pattern.matches function, which adds an implicit ^···$ around the
regex, Microsoft chose to offer this more general function. It's just a simple wrapper
around the core objects, but it involves less typing and variable corralling for
the programmer, at only a small efficiency expense.

3.2.2.3 Regex handling in Python

As a final example, let's look at the Subject example in Python:

   import re;
.
.
.
R = re.compile("^Subject: (.*)", re.IGNORECASE);
M = R.search(line)
if M:
subject = M.group(1)

Again, this looks very similar to what we've seen before.

3.2.2.4 Why do approaches differ?

Why does one language do it one way, and another language another? There may
be language-specific reasons, but it mostly depends on the whim and skills of the
engineers that develop each package. In fact, there are many unrelated regularexpression packages for Java (see Chapter 8), each written by someone who
wanted the functionality that Sun didn't originally provide. Each has its own
strengths and weaknesses, but it's interesting to note that they all provide their
functionality in quite different ways from each other, and from what Sun eventually
decided to implement themselves.

3.2.3 A Search-and-Replace Example

The "Subject" example is pretty simple, so the various approaches really don't
have an opportunity to show how different they really are. In this section, we'll
look at a somewhat more complex example, further highlighting the different
designs.

In the previous chapter (see Section 2.3.6.5), we saw this Perl search-and-replace to "linkize" an email address:


   $text =~ s{
1b
# Capture the address to $1 . . .
(
1w[-.1w]*                          # username
@
[-1w]+(1.[-1w]+)*1.(com|edu|info)  # hostname
)
1b
}{<a href=" class="docEmphBold">$1">$1</a>}gix;

Let's see how this is done in other languages.

3.2.3.1 Search-and-replace in Java

Here's the search-and-replace example with Sun's java.util.regex package:

   import java.util.regex.*; // Make regex classes easily available
.
.
.
   Pattern r = Pattern.compile(
"11b                  
                                 1n"+
"# Capture the address to $1 . . .
                     1n"+
"(                                                     1n"+
"  11w[-.11w]* 
                           # username   1n"+
"    @                           
                      1n"+
"  [-11w]+(11.[-11w]+)
*11.(com|edu|info)  # hostname   1n"+
")                                                     1n"+
"11b                    
                               1n",
      Pattern.CASE_INSENSITIVE
|Pattern.COMMENTS);
   Matcher m = r.matcher(text);
String result = m.replaceAll("<a href=1"
 class="docEmphBold">$(1)1">$(1)</a>");
System.out.println(result);

There are a number of things to note. Perhaps the most important is that each '1'
wanted in the regular expression requires '11' in the string literal. Thus, using '11w'
in the string literal results in '1w' in the regular expression. This is because regular
expressions are provided as normal Java string literals, which as we've seen before
(see Section 2.2.3.1), require special handling. For debugging, it might be useful to use

   System.out.println(P.pattern());

to display the regular expression as the regex function actually received it. One
reason that I include newlines in the regex is so that it displays nicely when
printed this way. Another reason is that each '#' introduces a comment that goes
until the next newline; so, at least some of the newlines are required to restrain
the comments.

Perl uses notations like /g, /i, and /x to signify special conditions (these are the modifiers for replace all, case-insensitivity, and free formatting modes see Section 3.4.4), but java.util.regex uses either different functions (replaceAll
vs. replace) or flag arguments passed to the function (e.g., Pattern.CASE_INSENSITIVE and Pattern.COMMENTS).

3.2.3.2 Search-and-replace in VB.NET

The general approach in VB.NET is similar:

   Dim R As Regex = New Regex _
("1b                      
                          " & _
"(?# Capture the address to $1 . . . )  
           " & _
"(                                         
        " & _
"  1w[-.1w]*         
               (?# username)  " & _
"  @                             
                  " & _
"  [-1w]+(1.[-1w]+)
*1.(com|edu|info)(?# hostname)  " & _
")                              
                   " & _
"1b                                                ",  _
RegexOptions.IgnoreCase Or 
RegexOptions.IgnorePatternWhitespace)
Dim Copy As String = R.Replace (text, 
"<a href=" class="docEmphBold">${1}">${1}</a>")
Console.WriteLine(Copy)

Due to the inflexibility of VB.NET string literals (they can't span lines, and it's difficult to get newline characters into them), longer regular expressions are not as
convenient to work with as in some other languages. On the other hand, because
'1' is not a string metacharacter in VB.NET, the expression can be less visually cluttered. A double quote is a metacharacter in VB.NET string literals: to get one double
quote into the string's value, you need two double quotes in the string literal.

3.2.4 Search and Replace in Other Languages

Let's quickly look at a few examples from other traditional tools and languages.

3.2.4.1 Awk

Awk uses an integrated approach, /
regex/, to perform a match on the current input line, and uses "var ~ ···" to perform a match on other data. You can see where Perl got its notation for matching. (Perl's substitution operator, however, is
modeled after sed's.) The early versions of awk didn't support a regex substitution,
but modern versions have the sub(···) operator:

   sub(/mizpel/, "misspell")

This applies the regex mizpel to the current line, replacing the first match with
misspell. Note how this compares to Perl's (and sed's) s/mizpel/misspell/.

To replace all matches within the line, awk does not use any kind of /g modifier,
but a different operator altogether: gsub(/mizpel/, "misspell").

3.2.4.2 Tcl

Tcl takes a procedural approach that might look confusing if you're not familiar
with Tcl's quoting conventions. To correct our misspellings with Tcl, we might use:

   regsub mizpel $var misspell newvar

This checks the string in the variable var, and replaces the first match of
mizpel
with misspell, putting the now possibly-changed version of the original string
into the variable newvar (which is not written with a dollar sign in this case). Tcl expects the regular expression first, the target string to look at second, the replacement
string third, and the name of the target variable fourth. Tcl also allows
optional flags to its regsub, such as -all to replace all occurrences of the match
instead of just the first:

   regsub -all mizpel $var misspell newvar

Also, the -nocase option causes the regex engine to ignore the difference
between uppercase and lowercase characters (just like egrep's -i flag, or Perl's /i modifier).

3.2.4.3 GNU Emacs

The powerful text editor GNU Emacs (just "Emacs" from here on) supports elisp
(Emacs lisp) as a built-in programming language. It provides a procedural regex
interface with numerous functions providing various services. One of the main
ones is re-search-forward, which accepts a normal string as an argument and
interprets it as a regular expression. It then starts searching the text from the "current
position," stopping at the first match, or aborting if no match is found. (This
function is invoked when one invokes a "regexp search" while using the editor.)

As Table 3-3 shows, Emacs' flavor of regular expressions is heavily laden with backslashes. For example, 1<1([a-z]+1)1([1n•1t]1|<[^>]+>1)+111> is an expression for finding doubled words, similar to the problem in the first chapter. We couldn't use this regex directly, however, because the Emacs regex engine
doesn't understand 1t and 1n. Emacs double-quoted strings, however, do, and
convert them to the tab and newline values we desire before the regex engine
ever sees them. This is a notable benefit of using normal strings to provide regular
expressions. One drawback, particularly with elisp's regex flavor's propensity for
backslashes, is that regular expressions can end up looking like a row of scattered
toothpicks. Here's a small function for finding the next doubled word:

   (defun FindNextDbl ()
"move to next doubled word, ignoring
 <···> tags" (interactive)
(re-search-forward "11<11([a-z]+11)
11([1n 1t]11|<[^>]+
>11)+11111>")
)

Combine that with
(define-key global-map "1C-x1C-d" 'FindNextDbl)
and you can use the "Control-x
Control-d" sequence to quickly search for doubled words.

3.2.5 Care and Handling: Summary

As you can see, there's a wide range of functionalities and mechanics for achieving
them. If you are new to these languages, it might be quite confusing at this
point. But, never fear! When trying to learn any one particular tool, it is a simple
matter to learn its mechanisms.