Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

8.3 Packages, Packages, Packages

There are many regex packages for Java; the list that follows has a few words
about those that I investigated while researching this book. (See this book's web
page, regex.info/, for links). The table below gives a superficial overview of some of the differences among their flavors.

Sun

java.util.regex Sun's own regex package, finally standard as of Java 1.4.
It's a solid, actively maintained package that provides a rich Perl-like flavor. It
has the best Unicode support of these packages. It provides all the basic functionality
you might need, but has only minimal convenience functions. It
matches against CharSequence objects, and so is extremely flexible in that
respect. Its documentation is clear and complete. It is the all-around fastest of
the engines listed here. This package is described in detail later in this chapter.
Version Tested: 1.4.0.

License: comes as part of Sun's JRE. Source code is available under SCSL (Sun
Community Source Licensing)

IBM

com.ibm.regex This is IBM's commercial regex package (although it's said to
be similar to the org.apache.xerces.utils.regex package, which I did not
investigate). It's actively maintained, and provides a rich Perl-like flavor,
although is somewhat buggy in certain areas. It has very good Unicode support.
It can match against char[], CharacterIterator, and String. Overall,
not quite as fast as Sun's package, but the only other package that's in the
same class.

Version Tested: 1.0.0.

License: commercial product

Table 1. Superficial Overview of Some Java Package Flavor Differences
Feature	Sun	IBM	ORO	JRegex	Pat	GNU	Regexp
Basic Functionality
Engine type	`NFA`	`NFA`	`NFA`	`NFA`	`NFA`	`POSIX NFA`	`NFA`
Deeply-nested parens
dot doesn't match:	`various`	`various`	`\n`	`\n,\r`	`\n`	`\r\n`	`\n`
`\s` includes `[• \t\r\n\f]`
`\w` includes underscore
Class set operators
POSIX `[[:···:]]`

Metacharacter Support
`\A,\z,\Z`	`\A,\Z`	`\A,\z,\Z`	`\A,\z,\Z`	`\A,\z,\Z`	`\A,\Z`	`\A,\Z`
`\G`
`(?#···)`
Octal escapes
2-, 4-, 6-digit hex escapes	2, 4	2, 4, 6	2	2, 4, 6	2		2, 4
Lazy quantifiers
Atomic grouping
Possessive quantifiers
Word boundaries	`\b`	`\b`	`\b`	`\< \b \>`	`\b`	`\< \>`
Non-word boundaries
`\Q···\E`
`(if then\|else)` conditional
Non-capturing parens
Lookahead
Lookbehind
`(? mod )`
`(? -mod :··· )`
`(? mod :··· )`

Unicode-Aware Metacharacters
Unicode properties
Unicode blocks
dot, `^`, `$`
`\w`
`\d`
`\s`			`partial`			`partial`	`partial`
Word boundaries
-supported - partial support - supported, but buggy (Version info see Section 8.3)

ORO

org.apache.oro.text.regex The Apache Jakarta project has two unrelated
regex packages, one of which is "Jakarta-ORO." It actually contains multiple
regex engines, each targeting a different application. I looked at one engine,
the very popular Perl5Compiler matcher. It's actively maintained, and solid,
although its version of a Perl-like flavor is much less rich than either the Sun or
the IBM packages. It has minimal Unicode support. Overall, the regex engine is
notably slower than most other packages. Its \G is broken. It can match against
char[] and String.

One of its strongest points is that it has a vast, modular structure that exposes
almost all of the mechanics that surround the engine (the transmission, searchand-
replace mechanics, etc.) so advanced users can tune it to suit their needs,
but it also comes replete with a fantastic set of convenience functions that
makes it one of the easiest packages to work with, particularly for those coming
from a Perl background (or for those having read Chapter 2 of this book).
This is discussed in more detail later in this chapter.

Version Tested: 2.0.6.

License: ASL (Apache Software License)

JRegex

jregex Has the same object model as Sun's package, with a fairly rich Perllike
feature set. It has good Unicode support. Its speed places it is in the middle
of the pack.

Version Tested: v1.01

License: GNU-like

Pat

com.stevesoft.pat It has a fairly rich Perl-like flavor, but no Unicode support.
Very haphazard interface. It has provisionh
on the fly. Its speed puts it on the high end of the middle of the pack.

Version Tested: 1.5.3

License: GNU LGPL (GNU Lesser General Public License)

GNU

gnu.regexp The more advanced of the two "GNU regex packages" for Java.
(The other, gnu.rex, is a very small package providing only the most barebones
regex flavor and support, and is not covered in this book.) It has some
Perl-like features, and minimal Unicode support. It's very slow. It's the only
package with a POSIX NFA (although its POSIXness is a bit buggy at times).

Version Tested: 1.1.4

License: GNU LGPL (GNU Lesser General Public License)

Regexp

org.apache.regexp This is the other regex package under the umbrella of
the Apache Jakarta project. It's somewhat popular, but quite buggy. It has the
fewest features of the packages listed here. Its overall speed is on par with
ORO. Not actively maintained. Minimal Unicode support.

Version Tested: 1.2

License: ASL (Apache Software License)

8.3.1 Why So Many "Perl5" Flavors?

The list mentions "Perl-like" fairly often; the packages themselves advertise "Perl5
support." When version 5 of Perl was released in 1994 (see Section 3.1.1.7), it introduced a new
level of regular-expression innovation that others, including Java regex developers,
could well appreciate. Perl's regex flavor is powerful, and its adoption by a wide
variety of packages and languages has made it somewhat of a de facto standard.

However, of the many packages, programs, and languages that claim to be "Perl5
compliant," none truly are. Even Perl itself differs from version to version as new
features are added and bugs are fixed. Some of the innovations new with early 5.x
versions of Perl were non-capturing parentheses, lazy quantifiers, lookahead,
inline mode modifiers like (?i), and the /x free-spacing mode (all discussed in Chapter 3). Packages supporting only these features claim a "Perl5" flavor, but miss out on later innovations, such as lookbehind, atomic grouping, and conditionals.

There are also times when a package doesn't limit itself to only "Perl5" enhancements.
Sun's package, for example, supports possessive quantifiers, and both Sun
and IBM support character class set operations. Pat offers an innovative way to do
lookbehind, and a way to allow matching of simple arbitrarily nested constructs.

8.3.2 Lies, Damn Lies, and Benchmarks

It's probably a common twist on Sam Clemens' famous "lies, damn lies, and statistics"
quote, but when I saw its use with "benchmarks" in a paper from Sun while
doing research for this chapter, I knew it was an appropriate introduction for this
section. In researching these seven packages, I've run literally thousands of benchmarks,
but the only fact that's clearly emerged is that there are no clear
conclusions.

There are several things that cloud regex benchmarking with Java. First, there are
language issues. Recall the benchmarking discussion from Chapter 6 (see Section 6.3.2), and
the special issues that make benchmarking Java a slippery science at best (primarily,
the effects of the Just-In-Time or Better-Late-Than-Never compiler). In doing
these benchmarks, I've made sure to use a server VM that was "warmed up" for
the benchmark (see "BLTN" Section 6.3.2), to show the truest results.

Then there are regex issues. Due to the complex interactions of the myriad of optimizations
like those discussed in Chapter 6, a seemingly inconsequential change
while trying to test one feature might tickle the optimization of an unrelated featur
e, anonymously skewing the results one way or the other. I did many (many!)
very specific tests, usually approaching an issue from multiple directions, and so I
believe I've been able to get meaningful results . . . but one never truly knows.

8.3.2.1 Warning: Benchmark results can cause drowsiness!

Just to show how slippery this all can be, recall that I judged the two Jakarta packages
(ORO and Regexp) to be roughly comparable in speed. Indeed, they finished
equally in some of the many benchmarks I ran, but for the most part, one generally
ran at least twice the speed of the other (sometimes 10x or 20x the speed).
But which was "one" and which "the other" changed depending upon the test.

For example, I targeted the speed of greedy and lazy quantifiers by applying ^.*:
and ^.*?: to a very long string like '···xxx:x'. I expected the greedy one to be
faster than the lazy one with this type of string, and indeed, it's that way for every
package, program, and language I tested . . . except one. For whatever reason,
Jakarta's Regexp's ^.*: performed 70% slower than its ^.*?:. I then applied the
same expressions to a similarly long string, but this time one like 'x:xxx···' where
the ':' is near the beginning. This should give the lazy quantifier an edge, and
indeed, with Regexp, the expression with the lazy quantifier finished 670x faster
than the greedy. To gain more insight, I applied ^[^:]*: to each string. This
should be in the same ballpark, I thought, as the lazy version, but highly contingent
upon certain optimizations that may or may not be included in the engine.
With Regexp, it finished the test a bit slower than the lazy version, for both strings.

Does the previous paragraph make your eyes glaze over a bit? Well, it discusses
just six tests, and for only one regex package we haven't even started to compar
e these Regexp results against ORO or any of the other packages. When compar
ed against ORO, it turns out that Regexp is about 10x slower with four of the
tests, but about 20x faster with the other two! It's faster with ^.*?: and ^[^:]*: applied to the long string with ':' at the front, so it seems that Regexp does poorly
(or ORO does well) when the engine must walk through a lot of string, and that
the speeds are reversed when the match is found quickly.

Are you eyes completely glazed over yet? Let's try the same set of six tests, but this
time on short strings instead of very long ones. It turns out that Regexp is faster
three to ten times faster than ORO for all of them. Okay, so what does this tell
us? Perhaps that ORO has a lot of clunky overhead that overshadows the actual
match time when the matches are found quickly. Or perhaps it means that Regexp
is generally much faster, but has an inefficient mechanism for accessing the target
string. Or perhaps it's something else altogether. I don't know.

Another test involved an "exponential match" (see Section 6.1.4) on a short string, which
tests the basic churning of an engine as it tracks and backtracks. These tests took a
long time, yet Regexp tended to finish in half the time of ORO. There just seems to
be no rhyme nor reason to the results. Such is often the case when benchmarking
something as complex as a regex engine.

8.3.2.2 And the winner is . . .

The mind-numbing statistics just discussed take into account only a small fraction
of the many, varied tests I did. In looking at them all for Regexp and ORO, one
package does not stand out as being faster overall. Rather, the good points and
bad points seem to be distributed fairly evenly between the two, so I (perhaps
somewhat arbitrarily) judge them to be about equal.

Adding the benchmarks from the five other packages into the mix results in a lot
of drowsiness for your author, and no obviously clear winner, but overall, Sun's
package seems to be the fastest, followed closely by IBM's. Following in a group
somewhat behind are Pat, Jregex, Regexp, and ORO. The GNU package is clearly
the slowest.

The overall difference between Sun and IBM is not so obviously clear that another
equally comprehensive benchmark suite wouldn't show the opposite order if the
suite happened to be tweaked slightly differently than mine. Or, for that matter, it's
entirely possible that someone looking at all my benchmark data would reach a
different conclusion. And, of course, the results could change drastically with the
next release of any of the packages or virtual machines (and may well have, by
the time you read this). It's a slippery science.

In general, Sun did most things very well, but it's missing a few key optimizations,
and some constructs (such as character classes) are much slower than one would
expect. Over time, these will likely be addressed by Sun (and in fact, the slowness
of character classes is slated to be fixed in Java 1.4.2). The source code is available
if you'd like to hack on it as well; I'm sure Sun would appreciate ideas and
patches that improve it.

8.3.3 Recommendations

There are many reasons one might choose one package over another, but Sun's
java.util.regex packagewith its high quality, speed, good Unicode support,
advanced features, and future ubiquityis a good recommendation. It comes integrated
as part of Java 1.4: String.matches(), for example, checks to see whether
the string can be completely matched by a given regex.

java.util.regex's strengths lie in its core engine, but it doesn't have a good set
of "convenience functions," a layer that hides much of the drudgery of bit-shuffling
behind the scenes. ORO, on the other hand, while its core engine isn't as strong,
does have a strong support layer. It provides a very convenient set of functions for
casual use, as well as the core interface for specialized needs. ORO is designed to
allow multiple regex core engines to be plugged in, so the combination of
java.util.regex with ORO sounds very appealing. I've talked to the ORO developer,
and it seems likely that this will happen, so the rest of this chapter looks at
Sun's java.util.regex and ORO's interface.

Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی