9.6 Advanced .NET
The following sections cover a few features that haven''''t fit into the discussion so far:
building a regex library with regex assemblies, using an interesting .NET-only
regex feature for matching nested constructs, and a discussion of the Capture
object.
9.6.1 Regex Assemblies
.NET allows you to encapsulate Regex objects into an assembly, which is useful in
creating a regex library. The example in the sidebar in below shows
how to build one.
When the sidebar example executes, it creates the file JfriedlsRegexLibrary.DLL in
the project''''s bin directory.
I can then use that assembly in another project, after first adding it as a reference
via Visual Studio .NET''''s Project > Add Reference dialog.
To make the classes in the assembly available, I first import them:
Imports jfriedl
I can then use them just like any other class, as in this example::
Dim FieldRegex as CSV.GetField =
New CSV.GetField''''This makes a new Regex object
.
.
.
Dim FieldMatch as Match =
FieldRegex.Match(Line) ''''Apply the regex to a string . . .
While FieldMatch.Success
Dim Field as String
If FieldMatch.Groups(1).Success
Field = FieldMatch.Groups("QuotedField").Value
Field = Regex.Replace(Field, """, "")
''''replace two double quotes with one
Else
Field = FieldMatch.Groups("UnquotedField").Value
End If
Console.WriteLine("[" & Field & "]")
'''' Can now work with ''''Field''''....
FieldMatch = FieldMatch.NextMatch
End While
In this example, I chose to import only from the jfriedl namespace, but could
have just as easily imported from the jfriedl.CSV namespace, which then would
allow the Regex object to be created with:
Dim FieldRegex as GetField = New GetField
''''This makes a new Regex object
The difference is mostly a matter of style. You can also choose to not import anything,
but rather use them directly:
Dim FieldRegex as jfriedl.CSV.
GetField = New jfriedl.CSV.GetField
This is a bit more cumbersome, but documents clearly where exactly the object is
coming from. Again, it''''s a matter of style.
Creating Your Own Regex Library With an Assembly
This example builds a small regex library. This complete program builds an
assembly (DLL) that holds three pre-built Regex constructors I''''ve named
jfriedl.Mail.Subject, jfriedl.Mail.From, and jfriedl.CSV.GetField.
The first two are simple examples just to show how it''''s done, but the complexity
of the final one really shows the promise of building your own
library. Note that you don''''t have to give the RegexOptions.Compiled flag,
as that''''s implied by the process of building an assembly.
See the text (in Section 9.6.1) for how to use the assembly after it''''s built.
Option Explicit On
Option Strict On
Imports System.Text.RegularExpressions
Imports System.Reflection
Module BuildMyLibrary
Sub Main()
''''The calls to RegexCompilationInfo below
provide the pattern, regex options, name within the class,
''''class name, and a Boolean indicating whether
the new class is public. The first class, for example,
''''will be available to programs that use this
assembly as "jfriedl.Mail.Subject", a Regex constructor.
Dim RCInfo() as RegexCompilationInfo = { _
New RegexCompilationInfo( _
"^Subject:\s*(.*)", RegexOptions.IgnoreCase, _
"Subject", "jfriedl.Mail", true), _
New RegexCompilationInfo( _
"^From:\s*(.*)", RegexOptions.IgnoreCase, _
"From", "jfriedl.Mail", true), _
New RegexCompilationInfo( _
"\G(?:^|,) " & _
"(?: " & _
" (?# Either a double-quoted field... ) " & _
" " (?# field''''s opening quote ) " & _
" (?<QuotedField> (?> [^"]+ | "" )* ) " & _
" " (?# field''''s closing quote ) " & _
" (?# ...or... ) " & _
" | " & _
" (?# ...some non-quote/non-comma text... ) " & _
" (?<UnquotedField> [^",]*) " & _
" )", _
RegexOptions.IgnorePatternWhitespace, _
"GetField", "jfriedl.CSV", true) _
}
''''Now do the heavy lifting to build and write out the whole thing . . .
Dim AN as AssemblyName = new AssemblyName()
AN.Name = "JfriedlsRegexLibrary"
''''This will be the DLL''''s filename
AN.Version = New Version("1.0.0.0")
Regex.CompileToAssembly(RCInfo, AN) ''''Build everything
End Sub
End Module
9.6.2 Matching Nested Constructs
Microsoft has included an interesting innovation for matching balanced constructs
(historically, something not possible with a regular expression). It''''s not particularly
easy to understandthis section is short, but be warned, it is very dense.
It''''s easiest to understand with an example, so I''''ll start with one:
Dim R As Regex = New Regex(" \( " & _
" (?> " & _
" [^()]+ " & _
" | " & _
" \( (?<DEPTH>) " & _
" | " & _
" \) (?<-DEPTH>) " & _
" )* " & _
" (?(DEPTH)(?!)) " & _
" \) ", _
RegexOptions.IgnorePatternWhitespace)
This matches the first properly-paired nested set of parentheses, such as the underlined
portion of ''''before (nope (yes (here) okay) after''''. The first parenthesis
isn''''t matched because it has no associated closing parenthesis.
Here''''s the super-short overview of how it works:
With each ''''('''' matched,
(?<DEPTH>)
adds one to the regex''''s idea of how
deep the parentheses are currently nested (at least, nested beyond the initial
\(
at the start of the regex).
With each '''')'''' matched,
(?<-DEPTH>)
subtracts one from that depth.
(?(DEPTH)(?!))
ensures that the depth is zero before allowing the final literal
\)
to match.
This works because the engine''''s backtracking stack keeps track of successfullymatched
groupings.
(?<DEPTH>)
is just a named-capture version of
()
, which is
always successful. Since it has been placed immediately after
\(
, its success
(which remains on the stack until removed) is used as a marker for counting
opening parentheses.
Thus, the number of successful ''''DEPTH'''' groupings matched so far is maintained on
the backtracking stack. We want to subtract from that whenever a closing parentheses
is found. That''''s accomplished by .NET''''s special
(?<-DEPTH>)
construct,
which removes the most recent "successful DEPTH" notation from the stack. If it
turns out that there aren''''t any, the
(?<-DEPTH>)
itself fails, thereby disallowing
the regex from over-matching an extra closing parenthesis.
Finally,
(?(DEPTH)(?!))
is a normal conditional that applies
(?!)
if the ''''DEPTH''''
grouping is currently successful. If it''''s still successful by the time we get here,
there was an unpaired opening parenthesis whose success had never been subtracted by a balancing
(?<-DEPTH>)
. If that''''s the case, we want to exit the
match (we don''''t want to match an unbalanced sequence), so we apply
(?!)
,
which is normal negative lookbehind of an empty subexpression, and guaranteed
to fail.
Phew! That''''s how to match nested constructs with .NET regular expressions.
9.6.3
Capture Objects
There''''s an additional component to .NET''''s object model, the Capture object,
which I haven''''t discussed yet. Depending on your point of view, it either adds an
interesting new dimension to the match results, or adds confusion and bloat.
A Capture object is almost identical to a Group object in that it represents the text
matched within a set of capturing parentheses. Like the Group object, it has methods
for Value (the text matched), Length (the length of the text matched), and
Index (the zero-based number of characters into the target string that the match
was found).
The main difference between a Group object and a Capture object is that each
Group object contains a collection of Captures representing all the intermediary
matches by the group during the match, as well as the final text matched by the
group.
Here''''s an example with
^(..)+
applied to ''''abcdefghijk'''':
Dim M as Match = Regex.Match("abcdefghijk", "^(..)+")
The regex matches four sets of
(..)
, which is most of the string:
''''abcdefghijk''''.
Since the plus is outside of the parentheses, they recapture with each iteration of
the plus, and are left with only ''''ij'''' (that is, M.Groups(1).Value is ''''ij''''). However,
that M.Groups(1) also contains a collection of Captures representing the
complete ''''ab'''', ''''cd'''', ''''ef'''', ''''gh'''', and ''''ij'''' that
(..)
walked through during the match:
M.Groups(1).Captures(0).Value is ''''ab''''
M.Groups(1).Captures(1).Value is ''''cd''''
M.Groups(1).Captures(2).Value is ''''ef''''
M.Groups(1).Captures(3).Value is ''''gh''''
M.Groups(1).Captures(4).Value is ''''ij''''
M.Groups(1).Captures.Count is 5.
You''''ll notice that the last capture has the same ''''ij'''' value as the overall match,
M.Groups(1).Value. It turns out that the Value of a Group is really just a shorthand
notation for the group''''s final capture. M.Groups(1).Value is really:
M.Groups(1).Captures( M.Groups(1).Captures.Count - 1 ).Value
Here are some additional points about captures:
M.Groups(1).Captures is a CaptureCollection, which, like any collection,
has Items and Count properties. However, it''''s common to forego the Items
property and index directly through the collection to its individual items, as
with M.Groups(1).Captures(3) (M.Groups[1].Captures[3] in C#).
A Capture object does not have a Success method; check the Group''''s
Success instead.
So far, we''''ve seen that Capture objects are available from a Group object.
Although it''''s not particularly useful, a Match object also has a Captures property.
M.Captures gives direct access to the Capture property of the zeroth
group (that is, M.Captures is the same as M.Group(0).Captures). Since the
zeroth group represents the entire match, there are no iterations of it "walking
through" a match, so the zeroth captured collection always has only one
Capture. Since they contain exactly the same information as the zeroth
Group, both M.Captures and M.Group(0).Captures are not particularly
useful.
.NET''''s Capture object is an interesting innovation that appears somewhat more
complex and confusing than it really is by the way it''''s been "overly integrated"
into the object model. After getting past the .NET documentation and actually
understanding what these objects add, I''''ve got mixed feelings about them. On one
hand, it''''s an interesting innovation that I''''d like to get to know. Uses for it don''''t
immediately jump to mind, but that''''s likely because I''''ve not had the same years of
experience with it as I have with traditional regex features.
On the other hand, the construction of all these extra capture groups during a
match, and then their encapsulation into objects after the match, seems an effi-
ciency burden that I wouldn''''t want to pay unless I''''d requested the extra information.
The extra Capture groups won''''t be used in the vast majority of matches, but
as it is, all Group and Capture objects (and their associated GroupCollection
and CaptureCollection objects) are built when the Match object is built. So,
you''''ve got them whether you need them or not; if you can find a use for the
Capture objects, by all means, use them.