13.2. ORO Regular Expressions
Once regular expressions are understood, the APIs for using them are straightforward. Prior to JDK 1.4 there was no regular expression facilities included with Java, and many third-party packages arose to fill the gap. One of the most comprehensive and widely used is the ORO package, originally from ORO, Inc. and later donated to the Jakarta project.ORO handles many kinds of regular expressions, most of which follow the conventions of various Unix utilities including Awk and the "globbing" patterns used by various shells. The most sophisticated regular expressions provided by ORO work like those in version 5 of the Perl language. These patterns may be considered an extension to the regular expressions discussed in the pervious section and will be the ones used throughout this book.Regular expressions are a language, and like most other languages they can be evaluated faster if they are compiled prior to being used. Pattern represents a compiled expression, and classes implementing the Compiler interface are used to build a Pattern from the string representation of a pattern.[3]
[3] Regular expressions are not compiled into Java byte codes but into an internal representation of a finite state machine.
the object returned in this case is a Perl5Pattern, which implements the Pattern interface. In general, when declaring variables it is better to use the more generic type, which in this case means declaring p to be a Pattern instead of a Perl5Pattern. This allows the compiler to ensure that p is used in such a way that a different kind of pattern could be easily substituted if the need ever arises.A number of flags that change the behavior of patterns can be specified when the pattern is compiled. A full list is available in the Java docs for the Perl5Compiler class, but two particularly useful ones are CASE_INSENSITIVE_MASK, which causes inputs to match regardless of case, and SINGLELINE_MASK, which allows patterns to span multiple lines.
import org.apache.oro.text.regex.*
Perl5Compiler compiler = new Perl5Compiler();
Pattern p = compiler.compile("^a[^az]z$");
Once a pattern has been compiled, it is usually used as an argument to a PatternMatcher along with the input to be checked.
Perl5Compiler compiler = new Perl5Compiler();
Pattern p = compiler.compile("Hello.*world",
Perl5Compiler.CASE_INSENSITIVE_MASK |
Perl5Compiler.SINGLELINE_MASK);
In addition to determining whether a string matches a pattern it is also possible to determine whether a string contains a pattern with the contains() method. Note that asking whether an input contains a pattern p is equivalent to asking whether it matches the pattern .*p.*.Following a call to matches() or contains() more information on the match is available through a MatchResult object.
Perl5Matcher m = new Perl5Matcher();
if(m.matches("Hello, you great big world",p)) {
System.out.println("It matches");
}
The meaning of the 0s used as arguments to begin() and end() will be made clear shortly.It is possible for an input to contain multiple instances of a pattern. The methods that have just been used can be used to access each match in turn, but an auxiliary class is needed to hold information such as how far into the input the parser has reached. This auxiliary class is PatternMatcherInput, and it is constructed from the input to be checked. Subsequent calls to contains() will determine whether the pattern is matched again starting at a point beyond where the previous match was found. This makes it easy to iterate over all matches.
if(m.contains(input,pattern)) {
MatchResult r = m.getMatch();
System.out.println("The portion that matches is:" +
r.toString());
System.out.println("The matching portion begins at" +
r.begin(0));
System.out.println("The matching portion ends at" +
r.end(0));
}
Perl5Compiler compiler = new Perl5Compiler();
Pattern pattern = compiler.compile(thePattern);
PatternMatcherInput input = new PatternMatcherInput(theInput);
Perl5Matcher m = new Perl5Matcher();
while(m.contains(input,pattern)) {
MatchResult r = m.getMatch();
System.out.println("The portion that matches is:" +
r.toString());
}
13.2.1. Subpatterns
Following a match it is often useful to determine which characters of the input correspond to a specific portion of the pattern. For example, the pattern "Hello, [A-Z][a-z]*" could be used to match against the first line of a letter. The name of the person being addressed will always correspond to the portion of the pattern consisting of an uppercase letter and an arbitrary number of lowercase letters. The ability to extract this information would make it easy to obtain the name for further processing.A mechanism called subpatterns is available to do this. This consists of surrounding the portion or portions of interest of the pattern with parentheses, such as "Hello, ([A-Z][a-z]*)." At first, this may seem confusing because parentheses were also used to group subpatterns together, so a modifier such as "*" or "+" could act on all of them. There is no ambiguity, however, because parentheses followed by a modifier means a grouping, whereas parentheses followed by anything else means a subpattern. The two can even be combined: "aa((ba)*)" matches strings starting with two 'a's and followed by an arbitrary number of occurrences of "ba," capturing those occurrences as a subpattern.After matching against a pattern containing subpatterns, the MatchResult object will have information about each submatch. Specifically, the begin() and end() methods will be able to identify the start and end of each submatch, and a method called group() will return the entire submatched string. Group zero is taken to be the entire pattern, which is why zero was used as an argument in previous uses of begin() and end(). The first submatch is then group one, the second is group two, and so on.
As a more realistic example, consider the task of parsing XML. An XML tag consists of an opening angle bracket, followed by a word, followed by an arbitrary number of name/value pairs, and a closing angle bracket. Then there may be a body, which for simplicity will be assumed to have no nested tags. Finally, there is a closing tag.The opening tag without attributes may be represented by the pattern <\w+>. This introduces a new convenience available with Perl5 patterns. The string \w represents any character that may appear within a word; it is equivalent to "[a-zA-Z0-9]." Similarly, \W represents any nonword character including punctuation and white space.The attributes are represented by the complex-looking pattern (\W*\w+=\w+)*. This means "any number of iterations of any amount of white space, followed by a word, followed by an equals sign, followed by another word."The body portion is refreshingly simple, [^<]* meaning any number of characters except an opening bracket.There is also a simple pattern that will match the closing tag <\w+>. This will match any closing tag, not just the one for the tag that is currently being parsed. Using generic regular expressions, this is the best that could be hoped for because, as noted previously, regular expressions have no memory and so cannot remember what the opening tag was when they come to the closing tag. However, Perl5 regular expressions can do better because they can include back references. A back reference in a pattern matches the exact sequence of characters contained in a subpattern and are denoted by a slash followed by a number indicating which subpattern to use. Therefore, the pattern ([a-z]*)\1 will match any string that repeats twice, such as "tartar."Therefore, the pattern for closing brackets will be <\1>, and writing a simple XML parser is now a matter of wrapping each portion of the pattern in parentheses and proceeding as previously:
Perl5Compiler compiler = new Perl5Compiler();
Pattern pattern =
compiler.compile("Hello ([A-Z][a-z]*)");
PatternMatcherInput input =
new PatternMatcherInput("Hello Leela, how are you?");
Perl5Matcher m = new Perl5Matcher();
if(m.contains(input,pattern)) {
MatchResult r = m.getMatch();
System.out.println("The complete match is:" +
r.group(0));
System.out.println("The letter was addressed to" +
r.group(1));
}
Note that MULTILINE_MASK is set, allowing the XML expression to span multiple lines. With this flag set, \W will match newline characters. Also note that every backslash in the regular expression needs to be preceded by an additional backslash in order to satisfy the rules for Java strings. Consequently the expression is rather complicated, but it can be understood by considering each small piece individually.
Perl5Compiler compiler = new Perl5Compiler();
Pattern pattern =
compiler.compile(
"<(\\w*)(\\W*\\w*=\"\\w*\")*>([^<]*) </\\1>",
Perl5Compiler.MULTILINE_MASK);
PatternMatcherInput input =
new PatternMatcherInput("<h1 color=\"red\">Hello! </h1>");
Perl5Matcher m = new Perl5Matcher();
if(m.contains(input,pattern)) {
MatchResult r = m.getMatch();
System.out.println("Opening tag: " + r.group(1));
System.out.println("Attributes: " + r.group(2));
System.out.println("Body: " + r.group(3));
}
13.2.2. Greediness and Reluctance
At this point is is natural to wonder if the restriction that the body must contain only text without nested tags could be removed. Because the pattern guarantees that the closing tag will match the opening tag, it would seem that arbitrary XML, such as
would match correctly. While the pattern works correctly in this case, it will not work in general. Consider the XML
<tag1>
some text
<tag2>other text</tag2>
</tag1>
Modifiers such as "*" and "+" are greedy, meaning they will consume as much of the pattern as possible without causing the match to fail. In the preceding example the closing pattern could match either of the </tag1> tags, and greediness will cause it to pick the second. This means the body will be parsed as
<tag1>
some text
<tag2>other text</tag2>
</tag1>
<tag1>Yet more text </tag1>
which is not correct.Modifiers can be made reluctant by following them with a question mark, causing them to match the minimum number of occurrences of the pattern. Given the input "aaaabbbb" the pattern (a+)([ab]*) will set group one to "aaaa" and group two to "bbbb." The pattern (a+?)([ab]*) would cause group one to be "a" and group two to be "aaabbbb."In the XML example a reluctant pattern could be used for the body that fixes the case of two tags appearing consecutively. This introduces a new problem with nested tags, the input
some text<tag2>other text</tag2></tag1><tag1>Yet more text
matched with a reluctant pattern would determine that the body is
<tag1>
some text
<tag1>Yet more text</tag1>
</tag1>
which is incorrect. The regrettable conclusion is that regular expressions are not quite powerful enough to parse full XML, more sophisticated techniques must be used.
some text<tag1> Yet more text
13.2.3. Substitutions
Another common programming task involves replacing the portion of an input that matches a pattern with newtext. For example, consider a story about some blueberries, a bluebird, and Bluebeard the pirate, and imagine that for some unknown reason the author wished to change it to blackberries, a blackbird, and Blackbeard. It would not be correct to just change all occurrences of "blue" to "black" as this would also inadvertently change "the sky was very blue" to "the sky was very black." What is needed is some way to capture strings matching the pattern "blue[a-z]+" and only change the "blue" part.This could be done with the APIs already discussed by finding the start and end of the appropriate subpatterns and manually splicing in the new text. This is such a common situation that a special substitute() method has been provided in the Util class. This method uses a Perl5Substitution, which implements the more general Substitution interface and contains the text to substitute. Within the substitution, numbers preceded by dollar signs refer to group numbers: $1 means group 1, and so on.
would result in "Blackbeard the pirate and his blackbird ate blackberries beneath this sky so blue." Note that the CASE_INSENSITIVE_MASK flag still causes the occurrence starting with a capital B to match, but the substitution has no way to determine that the first letter of "black" should be correspondingly capitalized. This could be remedied by using the pattern ([bB])lue([a-z]+) and the replacement string $1lack$2.
String text = "Bluebeard the pirate and his bluebird" +
"ate blueberries" +
"beneath this sky so blue.";
Perl5Compiler c = new Perl5Compiler();
Pattern pattern =
c.compile("blue([a-z]+)",
Perl5Compiler.CASE_INSENSITIVE_MASK);
Perl5Matcher m = new Perl5Matcher();
Substitution s =
new Perl5Substitution("black$1);
String newText =
Util.substitute(m,pattern,s,text,Util.SUBSTITUTE_ALL);
System.out.println(newText);