13.3. Jakarta Regexp
Although ORO is quite comprehensive, it is not the only available regular expression package. It is not even the only one available from Jakarta! Another tool called Regexp was donated to the Jakarta project before ORO, and it is still available on the principles that different people have different needs, and choice is always a good thing.Regexp's regular expression syntax is largely similar to OROs. All the standard features introduced in the first section of this chapter are supported, as are backrefereneces and greedy/reluctant modifiers.Regexp also supports patterns representing sets of characters as defined in the POSIX standard for regular expressions.[4] These include "[:alpha:]" for any alphabetic characters; "[:alnum:]" for alphanumerics; "[:space:]" for whitespace, and others.
[4] POSIX is a standard developed by The Portable Application Standards Committee, which is meant to ensure a level of interoperability between applications, APIs, and operating systems.
One distinguishing feature of Regexp is that it offers a more streamlined API than ORO. The class encapsulating a regular expression is called RE, and its constructor does the compilation automatically.
There are a set of flags that can be passed in as the second argument that effect how matches are done, including MATCH_CASEINDEPENDENT, which makes matching case-independent, and MATCH_SINGLELINE, which causes all input to be treated as a single line.
import org.apache.regexp.*;
RE pattern = new RE("([A-Z])([a-z ]*)");
is equivalent to a pattern of "[a-zA-Z]*" (which is also equivalent to "[:alpha:]*").An instance of RE can be used to check for a match directly.
RE pattern = new RE("[a-z]*",RE.MATCH_CASEINDEPENDENT);
Following a successful call to match(), any subpatterns will be available. RE calls these parens, and as with ORO the zeroth paren is the entire matching pattern. For example, using ([A-Z])([a-z ]*) as the pattern and Hello, world as the input pattern.getParen(0) would return Hello, world, and pattern.getParen(2) would return ello world.The match() method works like Perl5's pattern matching (see footnote 2) and hence will consider there to be a match if the pattern matches a substring of the input. However, there is no method that directly corresponds to the match() method in ORO. The equivalent effect can be achieved by adding a caret before the pattern, which matches the start of the input, and a dollar sign at the end, which matches the end of the input.Regexp also provides a number of utility methods that simplify common tasks. First, there is a method to split an input into pieces delineated by occurrences of a pattern.
if(pattern.match("Hello world")) {
System.out.println("It matches!");
}
will print "octopus," "thing," and "moose." Note that if [:digit:]* were used instead, each individual letter would be printed because any two of the letters are separated by at least zero digits.Second, Regexp provides a substitution facility much like OROs.
RE pattern = new RE("[:digit:]+");
String pieces[] = pattern.split("octopus8thing726moose000");
for (int i=0;<ipieces.length;i++) {
System.out.println(pieces[i]);
}
prints "The meese and geese are on the loose," which is correct outputs despite the fact that the plural of "moose" is not "meese."
RE pattern = new RE("([^l])oose");
System.out.println(
pattern.subst("The moose and goose are on the loose",
"$1eese",
RE.REPLACE_ALL |
RE.REPLACE_BACKREFERENCES));
Errors to Watch ForAt the time of this writing substitutions with backreferences do not work, and an exception is thrown as a result of the subst() method. |