Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Regular Expressions (2nd Edition) [Electronic resources] - نسخه متنی

Jeffrey E. F. Friedl

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید












6.3 Benchmarking



Because this chapter talks a lot about speed and efficiency, and I often mention
benchmarks I've done, I'd like to mention a few principles of benchmarking. I'll
also show simple ways to benchmark in a few languages.


Basic benchmarking is simply timing how long it takes to do some work. To do
the timing, get the system time, do the work, get the system time again, and report
the difference between the times as the time it took to do the work. As an example,
let's compare
^(a|b|c|d|e|f|g)+$
with
^[a-g]+$
. We'll first look at
benchmarking in Perl, but will see it in other languages in a bit. Here's a simple
(but as we'll see, somewhat lacking) Perl script:


     use Time::HiRes 'time'; # So time() gives a high-resolution value.
$StartTime = time();
"abababdedfg" =~ m/^(a|b|c|d|e|f|g)+$/;
$EndTime = time();
printf("Alternation takes %.3f seconds.\n", $EndTime - $StartTime);
$StartTime = time();
"abababdedfg" =~ m/^[a-g]+$/;
$EndTime = time();
printf("Character class takes %.3f seconds.\n", $EndTime - $StartTime);


It looks (and is) simple, but there are some important points to keep in mind
while constructing a test for benchmarking:



Time only "interesting" work Time as much of the "work" as possible, but as
little "non-work" as possible. If there is some initialization or other setup that
must done, do it before the starting time is taken. If there's cleanup, do it after
the ending time is taken.



Do "enough" work Often, the time it takes to do what you want to test is
very short, and a computer's clock doesn't have enough granularity to give
meaning to the timing.


When I run the simple Perl test on my system, I get


     Alternation takes 0.000 seconds.


Character class takes 0.000 seconds.


which really doesn't tell me anything other than both are faster than the shortest
time that can be measured. So, if something is fast, do it twice, or 10 times,
or even 10,000,000 timeswhatever is required to make "enough" work. What
is "enough" depends on the granularity of the system clock, but most systems
now have clocks accurate down to 1/100th of a second, and in such cases, timing
even half a second of work may be sufficient for meaningful results.



Do the "right" work Doing a very fast operation ten million times involves
the overhead of ten million updates of a counter variable in the block being
timed. If possible, it's best to increase the amount of real work being done in
a way that doesn't increase the overhead work. In our Perl example, the regular
expressions are being applied to fairly short strings: if applied to much
longer strings, they'd do more "real" work each time.




So, taking these into account, here's another version:


     use Time::HiRes 'time'; # So time() gives a high-resolution value.
$TimesToDo = 1000; # Simple setup
$TestString = "abababdedfg" x 1000; # Makes a huge string

$Count = $TimesToDo;
$StartTime = time();
while (--$Count > 0) {
$TestString =~ m/^(a|b|c|d|e|f|g)+$/;
}
$EndTime = time();
printf("Alternation takes %.3f seconds.\n", $EndTime - $StartTime);
$Count = $TimesToDo;
$StartTime = time();
while (--$Count > 0) {
$TestString =~ m/^[a-g]+$/;
}
$EndTime = time();
printf("Character class takes %.3f seconds.\n", $EndTime - $StartTime);


Notice how the $TestString and $Count are initialized before the timing starts?
($TestString is initialized with Perl's convenient x operator, which replicates the
string on its left as many times as the number on its right.) On my system, with
Perl 5.8, this prints:


     Alternation takes 7.276 seconds.


Character class takes 0.333 seconds.



So, with this test case, one is about 11x faster than the other. The benchmark
should be executed a few times, with the fastest times taken, to lessen the impact
of sporadic background system activity.



6.3.1 Know What You're Measuring



It might be interesting to see what happens when the initialization is changed to:


     my $TimesToDo = 1000000;
my $TestString = "abababdedfg";


Now, the test string is 1,000 times shorter, but the test is done 1,000x more times.
Conceptually, the amount of "work" should be the same, but the results are quite
different:


     Alternation takes 18.167 seconds.


Character class takes 5.231 seconds.



Both are slower than before, but the difference between the two is now a factor of
only 3.5x. Why the relative difference ? After the change, there's more "non-work"
overhead than before. That is, the update and testing of $Count, and the setup of
the regex engine, now happen 1,000x more than before.


These tests are about 5.3 and 4.1 seconds slower than the first tests. Most of the
extra time is probably the overhead of working with $Count. The fact that

^(a|b|c|d|e|f|g)+$
is hit relatively harder (5.3 seconds slower than the first
time, rather than 4.1 seconds slower) may reflect additional pre-match (or earlymatch)
setup by the regex engine before getting into the main part of the match.


In any case, the point of this change is to illustrate that the results are strongly
influenced by how much real work vs. non-work overtime is part of the timing.



6.3.2 Benchmarking with Java



Benchmarking Java can be a slippery science, for a number of reasons. Let's first
look at a somewhat naïve example, and then look at why it's naïve, and at what
can be done to make it less so. The listing below shows the benchmark
example with Java, using Sun's java.util.regex.


Notice how the regular expressions are compiled in the initialization part of the
program? We want to benchmark the matching speed, not the compile speed.


Speed is dependent upon which virtual machine (VM) is used. Sun standard JRE
[4]
comes with two virtual machines, a client VM optimized for fast startup, and a
server VM optimized for heavy-duty long-haul work.



[4] I had to use a shorter string for this test to run on my Linux system, as a longer string somehow tickles
a problem with the VM, causing the test to abort. Engineers at Sun tell me it's due to an unexpected
interaction between the aggressively optimizing C compiler used to build the VM (gcc), and
an overly conservative use of Linux's stack-monitoring hooks. It may be fixed as early as Java 1.4.1.
To compensate for the shortened string in the current test, I've increased the number of times the
loop executes the match, so these results should be comparable to the original.



Benchmarking with java.util.regex


     import java.util.regex.*;
public class JavaBenchmark {
public static void main(String [] args)
{
Matcher regex1 = Pattern.compile("^(a|b|c|d|e|f|g)+$").matcher(");
Matcher regex2 = Pattern.compile("^[a-g]+$").matcher(");
long timesToDo = 4000;
StringBuffer temp = new StringBuffer();
for (int i = 250; i > 0; i--)
temp.append("abababdedfg");
String testString = temp.toString();
// Time first one . . .
long count = timesToDo;
long startTime = System.currentTimeMillis();
while (--count > 0)
regex1.reset(testString).find();
double seconds = (System.currentTimeMillis() - startTime)/1000.0;
System.out.println("Alternation takes " + seconds + " seconds");
// Time second one . . .
count = timesToDo;
startTime = System.currentTimeMillis();
while (--count > 0)
regex2.reset(testString).find();
seconds = (System.currentTimeMillis() - startTime)/1000.0;
System.out.println("Character Class takes " + seconds + " seconds");
}
}


On my system, running the benchmark on the client VM produces:


     Alternation takes 19.318 seconds


Character Class takes 1.685 seconds



while the server VM yields:


     Alternation takes 12.106 seconds


Character Class takes 0.657 seconds



What makes benchmarking slippery, and this example somewhat naïve, is that the
timing can be highly dependent on how well the automatic pre-execution compiler
works, or how the run-time compiler interacts with the code being tested.
Some VM have a JIT (Just-In-Time compiler), which compiles code on the fly, just
before it's needed.


Sun's Java 1.4 has what I call a BLTN (Better-Late-Than-Never) compiler, which
kicks in during execution, compiling and optimizing heavily-used code on the fly.
The nature of a BLTN is that it doesn't "kick in" until it senses that some code is
"hot" (being used a lot). A VM that's been running for a while, such as in a server
environment, will be "warmed up," while our simple test ensures a "cd" server
(nothing yet optimized by the BLTN).


One way to see "warmed up" times is to run the benchmarked parts in a loop:


     // Time first one . . .
for (int i = 4; i > 0; i--)
{
long count = timesToDo;
long startTime = System.currentTimeMillis();
while (--count > 0)
regex1.reset(testString).find();
double seconds = (System.currentTimeMillis() - startTime)/1000.0;
System.out.println("Alternation takes " + seconds + " seconds");
}


If the extra loop runs enough times (say, for 10 seconds), the BLTN will have optimized
the hot code, leaving the last times reported as representative of a warmedup
system. Testing again with the server VM, these times are indeed a bit faster by
about 8% and 25%:


     Alternation takes 11.151 seconds


Character Class takes 0.483 seconds



Another issue that makes Java benchmarking slippery is the unpredictable nature
of thread scheduling and garbage collection. Again, running the test long enough
helps amortize their unpredictable influence.



6.3.3 Benchmarking with VB.NET



The benchmark example in VB.NET is shown in the listing below. On
my system, it produces:


     Alternation takes 13.311 seconds


Character Class takes 1.680 seconds



The .NET Framework allows a regex to be compiled to an even more efficient
form, by providing RegexOptions.Compiled as a second argument to each
Regex constructor (see Section 9.1.1.4). Doing that results in:


     Alternation takes 5.499 seconds


Character Class takes 1.157 seconds



Both tests are faster using the Compiled option, but alternation sees a greater relative
benefit (its almost 3x faster when Compiled, but the class version is only
about 1.5x faster), so it seems that alternation benefits from the more efficient
compilation relatively more than a character class does.


Benchmarking with VB.NET


     Option Explicit On
Option Strict On
Imports System.Text.RegularExpressions
Module Benchmark
Sub Main()
Dim Regex1 as Regex = New Regex("^(a|b|c|d|e|f|g)+$")
Dim Regex2 as Regex = New Regex("^[a-g]+$")
Dim TimesToDo as Integer = 1000
Dim TestString as String = "
Dim I as Integer
For I = 1 to 1000
TestString = TestString & "abababdedfg"
Next
Dim StartTime as Double = Timer()
For I = 1 to TimesToDo
Regex1.Match(TestString)
Next
Dim Seconds as Double = Math.Round(Timer() - StartTime, 3)
Console.WriteLine("Alternation takes " & Seconds & " seconds")
StartTime = Timer()
For I = 1 to TimesToDo
Regex2.Match(TestString)
Next
Seconds = Math.Round(Timer() - StartTime, 3)
Console.WriteLine("Character Class takes " & Seconds & " seconds")
End Sub
End Module


6.3.4 Benchmarking with Python



The benchmark example in Python is shown in the listing below.


For Python's regex engine, I had to cut the size of the string a bit because the
original causes an internal error ("maximum recursion limit exceeded") within the
regex engine. To compensate, I increased the number of times the test is done by
a proportional amount.


On my system, the benchmark produces:


     Alternation takes 10.357 seconds


Character Class takes 0.769 seconds


Benchmarking with Python


     import re
import time
import fpformat
Regex1 = re.compile("^(a|b|c|d|e|f|g)+$")
Regex2 = re.compile("^[a-g]+$")
TimesToDo = 1250;
TestString = "
for i in range(800):
TestString += "abababdedfg"
StartTime = time.time()
for i in range(TimesToDo):
Regex1.search(TestString)
Seconds = time.time() - StartTime
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"
StartTime = time.time()
for i in range(TimesToDo):
Regex2.search(TestString)
Seconds = time.time() - StartTime
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"


6.3.5 Benchmarking with Ruby



Here's the benchmark example in Ruby:


     TimesToDo=1000
testString="
for i in 1..1000
testString += "abababdedfg"
end
Regex1 = Regexp::new("^(a|b|c|d|e|f|g)+$");
Regex2 = Regexp::new("^[a-g]+$");
startTime = Time.new.to_f
for i in 1..TimesToDo
Regex1.match(testString)
end
print "Alternation takes .3f seconds\n" %
(Time.new.to_f - startTime);
startTime = Time.new.to_f
for i in 1..TimesToDo
Regex2.match(testString)
end
print "Character Class takes .3f seconds\n" %
(Time.new.to_f - startTime);


On my system, it produces:


     Alternation takes 16.311 seconds


Alternation takes 16.311 seconds



6.3.6 Benchmarking with Tcl



Here's the benchmark example in Tcl:


     set TimesToDo 1000
set TestString "
for {set i 1000} {$i > 0} {incr i -1} {
append TestString "abababdedfg"
}
set Count $TimesToDo
set StartTime [clock clicks -milliseconds]
for {} {$Count > 0} {incr Count -1} {
regexp {^(a|b|c|d|e|f|g)+$} $TestString
}
set EndTime [clock clicks -milliseconds]
set Seconds [expr ($EndTime - $StartTime)/1000.0]
puts [format "Alternation takes .3f seconds" $Seconds]
set Count $TimesToDo
set StartTime [clock clicks -milliseconds]
for {} {$Count > 0} {incr Count -1} {
regexp {^[a-g]+$} $TestString
}
set EndTime [clock clicks -milliseconds]
set Seconds [expr ($EndTime - $StartTime)/1000.0]
puts [format "Character class takes .3f seconds" $Seconds]


On my system, this benchmark produces:


     Alternation takes 0.362 seconds


Character class takes 0.352 seconds



Wow, they're both about the same speed! Well, recall from the table on
Section 4.1.3 that Tcl has a hybrid NFA/DFA engine, and these regular expressions are
exactly the same to a DFA engine. Most of what this chapter talks about simply
does not apply to Tcl. See the sidebar in Section 6.4.4.1.1 for more.



/ 83