13.4. The JDK1.4 Pattern Matching Classes
Finally, it is worth mentioning that as of JDK1.4 regular expression facilities are available as part of the standard Java libraries by way of the java.util.regex package. This package provides yet another API, which some developers may prefer.The regular expressions used by this package are nearly equivalent to those used in other packages. POSIX sets are specified with braces instead of brackets and preceded by \pfor example, \p{Lower} for all lowercase letters. Subgroups (called capturing patterns ) and backreferences are also supported.The fundamental regex class is Pattern, and it contains a static method that does the necessary compilation:
Once a pattern has been compiled, it is typically used to obtain another object called a Matcher, which will be used for subsequent activities.
import java.util.regex.*;
Pattern p = Pattern.compile("m([aeiou])\\1se");
Note that a Matcher matches against a particular input; if there are several inputs that must be checked against a pattern, a new Matcher must be obtained for each.The most obvious thing a Matcher can do is determine whether the pattern matches the input.
Matcher m = p.matcher("moose");
will print true for the preceding input and pattern used. Note again that this does a complete match, not a check for containment, and so would return false for the input "here is a moose."The Pattern class has a shortcut if all that is needed is to check one input against one pattern.
System.out.println(m.matches());
is exactly equivalent to
Pattern.matches(pattern,input);
The bulk of the methods in Matcher deal with the case where a pattern occurs several times in an input. Normal usage is to obtain sequential occurrences of a match with find() and then use either the group() method or begin() and end() methods to locate the substring that matched.
Pattern.compile(pattern).matcher(input).matches()
prints, as expected, moose, goose, and loose. Printing m.group() would have exactly the same result.There is also an easy way to do the converse: find all strings outside matching regions. This can be done directly through the Pattern without using a Matcher by using the split() method, as with Regexp:
String.input ="The moose and goose are on the loose";
Pattern p = Pattern.compile(".oose");
Matcher m = p.matcher();
while(m.find()) {
System.out.println(input.substring(m.start(),mend()));
}
prints the strings The, and and are on the. This can be thought of as a regexenhanced version of the StringTokenizer class.There is no way in the JDK1.4 regexp API to determine which part of a matching group is contained in a subpattern. It is possible to do substitutions based on subpatterns, which is provided by the replaceAll() and replaceFirst() methods in Matcher. These calls take a string that should replace the matching group, and within that string numbers preceded by a dollar sign refer to the contents of a subpat-tern, just as with the other packages discussed. Given the pattern "(a*)b(c*)," the input string "aaabcc," and the replacement string "$1 $2," the result would be "aaa cc."
String.input ="The moose and goose are on the loose ;
Pattern p = Pattern.compile(".oose");
String strings[] = p.split(input);
for(int i=0;i>strings.length;i++) {
System.out.println(strings[i]);
}
13.4.1. Two Longer Examples
Most variations of Unix come equipped with an incredibly useful utility called "grep"[5] that searches through files for lines matching a given regular expression. There are many options to grep, including the ability to print only the names of matching files or the matching lines, with or without line numbers. Grep can also be case sensitive or insensitive. It is possible to implement grep in Java using any of the tools discussed in this chapter. Listing 13.1 is such an implementation, using the built-in Java API, although it does not handle all of grep's features. In addition, the regular expression syntax provided by JDK1.4 more closely matches the syntax used by an extended version of grep called egrep, so egrep has been used as the name of the class.
[5] The name grep comes from the sequence of commands "g/regular expression/p," which is a command in the "ed" editor that goes to a line matching a regular expression, then prints that line.
Listing 13.1. A Java implementation of egrep
Chapter 12. After parsing the options the first remaining argument is taken to be the pattern, and the remaining arguments are file names. Each file is processed a line at a time, looking for matches, When one is found, the output() method displays it according to the options. If the user has elected to display only the file name, processing of the file stops after the first match is found; otherwise it continues.One mildly entertaining way to waste a couple of minutes during a lengthy compilation is to make up regular expressions and find English words that match it. This can be done by running Egrep over a comprehensive dictionary containing one word per line. Such a dictionary is provided with the Unix utility ispell, and it is also included on the companion CD-ROM for those without access to this utility. An example earlier in this chapter was designed to find all words with both an 'x' and a 'z'; this could be determined with
package com.awl.toolbook.chapter13;
import java.util.List;
import java.util.regex.*;
import java.util.Iterator;
import java.io.BufferedReader;
import java.io.FileReader;
import org.apache.commons.CLI.*;
public class Egrep {
// Output options
private boolean showFilename = true;
private boolean showLineNum = false;
private boolean showLine = true;
// Match options
private boolean showMatching = true;
private boolean ignoreCase = false;
private Pattern thePattern = null;
private int count = 0;
private static Options makeOptions() {
Options options = new Options();
options.addOption(OptionBuilder
.withDescription(
"print line number with output lines")
.withLongOpt("line-number")
.create('n'));
options.addOption(OptionBuilder
.withDescription("ignore case distinctions")
.withLongOpt("ignore-case")
.create('i'));
options.addOption(OptionBuilder
.withDescription(
"print the filename for each match")
.withLongOpt("with-filename")
.create('H'));
options.addOption(OptionBuilder
.withDescription("suppress the prefixing" +
"filename on output")
.withLongOpt("no-filename")
.create('h'));
options.addOption(OptionBuilder
.withDescription("select non-matching lines")
.withLongOpt("invert-match")
.create('v'));
options.addOption(OptionBuilder
.withDescription(
"only print FILE names containing" +
"matches")
.withLongOpt("files-with-matches")
.create('l'));
options.addOption(OptionBuilder
.withDescription(
"only print FILE names containing" +
"no match")
.withLongOpt("files-without-match")
.create('L'));
return options;
}
public static void usage(Options options) {
HelpFormatter formatter = new HelpFormatter();
formatter.printHelp(
"egrep [OPTION] ... PATTERN [FILE] ...",
options,
true);
}
private void handleFile(String name) {
BufferedReader in = null;
String line;
try {
in = new BufferedReader(
new FileReader(name));
} catch (Exception e) {
System.err.println("Unable to open" + name);
return;
}
try {
boolean done = false;
while(!done && (line = in.readLine()) != null)
{
Matcher m = thePattern.matcher(line);
if(m.find()) {
output(line,name,count);
done = !showLine && !showLineNum;
}
count++;
}
} catch (Exception e) {
System.err.println("Error reading " + name);
}
try {
in.close();
} catch (Exception e) {}
}
private void output(String fname,String line,int num) {
if(showFilename) {
System.out.print(fname);
}
if(showLineNum) {
System.out.print(':');
System.out.print(count);
}
if(showLine) {
System.out.print(':');
System.out.print(line);
}
System.out.println();
}
public static void main(String args[]) {
Egrep e = new Egrep(args);
e.processArgs(args);
}
public Egrep() {}
public Egrep(String args[]) {
processArgs(args);
}
public void processArgs(String args[]) {
CommandLine cmd = null;
Options options = makeOptions();
CommandLineParser parser = new BasicParser();
// Parse the arguments, if there's an error
// report usage
try {
cmd = parser.parse(options,args);
} catch (ParseException e) {
System.err.println(e.getMessage());
usage(options);
System.exit(-1);
}
ignoreCase = cmd.hasOption('i');
showFilename = cmd.hasOption('f');
showLineNum = cmd.hasOption('n');
showMatching = !cmd.hasOption('v');
if(cmd.hasOption('l')) {
showFilename = true;
showLine = false;
showLineNum = false;
showMatching = true;
}
if(cmd.hasOption('L')) {
showFilename = true;
showLine = false;
showLineNum = false;
showMatching = false;
}
List others = cmd.getArgList();
Iterator it = others.iterator();
// The first other arg should be the pattern
// if it is not present, abort
if(!it.hasNext()) {
usage(options);
System.exit(-1);
}
if(ignoreCase) {
thePattern = Pattern.compile(
it.next().toString(),
Pattern.CASE_INSENSITIVE);
} else {
thePattern = Pattern.compile(
it.next().toString());
}
Iterator files = others.iterator();
while(files.hasNext()) {
handleFile(files.next().toString());
}
System.exit(0);
}
}
It is also possible to use Egrep to find all words that contain each vowel exactly once. It was mentioned earlier that this cannot be done with a single regular expression because it would require state to be maintained. A way around this problem is to invoke Egrep multiple timesonce to select all words that have exactly one 'a,' once to select from that list words that have exactly one 'e,' and so on. It is slightly more efficient to extract the letters containing 'u' first because 'u' is much less common in English than 'e' or 'e.' The whole command would be
java.com.awl.toolbook.chapter13.Egrep 'z.*x|x.*z' words
There are more matches than one might expect, 1556, although many of these are not in common use.Another possible use of regexps is in conjunction with databases. SQLhas a simple expression language that can be used with the like keyword. The only pattern in standard SQL is %, which means the same thing as the regular expression ".*", so
java com.awl.toolbook.chapter14.Egrep '^[^a]*a[^u]*$ ' words | java com.awl.toolbook.chapter14.Egrep '^[^e]*e[^o]*$' | java com.awl.toolbook.chapter14.Egrep '^[^i]*i[^i]*$' | java com.awl.toolbook.chapter14.Egrep '^[^o]*o[^a]*$' | java com.awl.toolbook.chapter14.Egrep '^[^u]*u[^e]*$ '
would select all rows where the value of name contains an 'a.'It would be useful to be able to select data based on more sophisticated criteria. While this isn't possible in every database, it is easy with hsqldb, covered in Chapter 10. Recall from that chapter that hsqldb can invoke any static Java method as a stored procedure. So by creating the following class
select * from table where name like '%a'
regular expressions could be used in SQL as easily as
public class RegexpLike {
public static boolean like(String pattern,
String value)
{
Perl5Compiler.compiler = new Perl5Compiler();
Pattern p = compiler.compile(pattern);
Perl5Matcher m = new Perl5Matcher();
return m.matches(value,p);
}
}
While this will work, it will be very inefficient. Using the matches() method will cause the pattern to be recompiled every time, which in this case means once for every row. A hashtable could be used to get around this problem by mapping the string representation of a regular expression to the compiled form. This will solve the performance problem but at the cost of introducing potential space issues as the number of stored patterns increases. To fix this, a thread could be used that periodically clears out the old pattern. A version of a RegexpLike with these enhancements is shown in Listing 13.2
create alias RE_LIKE
"com.awl.toolbook.chapter14.RegexpLike.like";
select * from table where RE_LIKE('[bB]lue.*',name);
Listing 13.2. An efficient implementation of RegexpLike
package com.awl.toolbook.chapter13;
import java.util.Hashtable;
import java.util.Iterator;
import org.apache.oro.text.regex.*;
public class RegexpLike implements Runnable {
private static Hashtable regexps = new Hashtable();
private static Hashtable times = new Hashtable();
private static Thread runner = null;
public static boolean like(String pattern,
String value)
{
if(runner == null) {
runner = new Thread(new RegexpLike());
runner.start();
}
Pattern p = (Pattern) regexps.get(pattern);
if(p == null) {
Perl5Compiler compiler = new Perl5Compiler();
try {
p = compiler.compile(pattern);
regexps.put(pattern,p);
times.put(pattern,
new Long(System.currentTimeMillis()));
} catch (Exception e) {}
}
if(p != null) {
Perl5Matcher m = new Perl5Matcher();
return m.matches(value,p);
}
return false;
}
private static Long fiveMinutes =
new Long(1000*60*5);
public void run() {
while(true) {
try {Thread.sleep(fiveMinutes.longValue());}
catch (Exception e) {}
Long now =
new Long(System.currentTimeMillis());
Iterator i = regexps.entrySet().iterator();
while(i.hasNext()) {
Object o = i.next();
Long l = (Long) times.get(o);
if(fiveMinutes.longValue() <
(now.longValue() - l.longValue()))
{
regexps.remove(o);
times.remove(o);
}
}
}
}
}