The following sections cover a few features that haven''''t fit into the discussion so far: building a regex library with regex assemblies, using an interesting .NET-only regex feature for matching nested constructs, and a discussion of the Capture object.
.NET allows you to encapsulate Regex objects into an assembly, which is useful in creating a regex library. The example in the sidebar in below shows how to build one.
When the sidebar example executes, it creates the file JfriedlsRegexLibrary.DLL in the project''''s bin directory.
I can then use that assembly in another project, after first adding it as a reference via Visual Studio .NET''''s Project > Add Reference dialog.
To make the classes in the assembly available, I first import them:
Imports jfriedl
I can then use them just like any other class, as in this example::
Dim FieldRegex as CSV.GetField = New CSV.GetField''''This makes a new Regex object . . . Dim FieldMatch as Match = FieldRegex.Match(Line) ''''Apply the regex to a string . . . While FieldMatch.Success Dim Field as String If FieldMatch.Groups(1).Success Field = FieldMatch.Groups("QuotedField").Value Field = Regex.Replace(Field, """, "") ''''replace two double quotes with one Else Field = FieldMatch.Groups("UnquotedField").Value End If Console.WriteLine("[" & Field & "]") '''' Can now work with ''''Field''''.... FieldMatch = FieldMatch.NextMatch End While
In this example, I chose to import only from the jfriedl namespace, but could have just as easily imported from the jfriedl.CSV namespace, which then would allow the Regex object to be created with:
Dim FieldRegex as GetField = New GetField ''''This makes a new Regex object
The difference is mostly a matter of style. You can also choose to not import anything, but rather use them directly:
Dim FieldRegex as jfriedl.CSV. GetField = New jfriedl.CSV.GetField
This is a bit more cumbersome, but documents clearly where exactly the object is coming from. Again, it''''s a matter of style.
This example builds a small regex library. This complete program builds an assembly (DLL) that holds three pre-built Regex constructors I''''ve named jfriedl.Mail.Subject, jfriedl.Mail.From, and jfriedl.CSV.GetField.
The first two are simple examples just to show how it''''s done, but the complexity of the final one really shows the promise of building your own library. Note that you don''''t have to give the RegexOptions.Compiled flag, as that''''s implied by the process of building an assembly.
See the text (in Section 9.6.1) for how to use the assembly after it''''s built.
Option Explicit On Option Strict On Imports System.Text.RegularExpressions Imports System.ReflectionModule BuildMyLibrary Sub Main() ''''The calls to RegexCompilationInfo below provide the pattern, regex options, name within the class, ''''class name, and a Boolean indicating whether the new class is public. The first class, for example, ''''will be available to programs that use this assembly as "jfriedl.Mail.Subject", a Regex constructor. Dim RCInfo() as RegexCompilationInfo = { _ New RegexCompilationInfo( _ "^Subject:\s*(.*)", RegexOptions.IgnoreCase, _ "Subject", "jfriedl.Mail", true), _ New RegexCompilationInfo( _ "^From:\s*(.*)", RegexOptions.IgnoreCase, _ "From", "jfriedl.Mail", true), _ New RegexCompilationInfo( _ "\G(?:^|,) " & _ "(?: " & _ " (?# Either a double-quoted field... ) " & _ " " (?# field''''s opening quote ) " & _ " (?<QuotedField> (?> [^"]+ | "" )* ) " & _ " " (?# field''''s closing quote ) " & _ " (?# ...or... ) " & _ " | " & _ " (?# ...some non-quote/non-comma text... ) " & _ " (?<UnquotedField> [^",]*) " & _ " )", _ RegexOptions.IgnorePatternWhitespace, _ "GetField", "jfriedl.CSV", true) _ } ''''Now do the heavy lifting to build and write out the whole thing . . . Dim AN as AssemblyName = new AssemblyName() AN.Name = "JfriedlsRegexLibrary" ''''This will be the DLL''''s filename AN.Version = New Version("1.0.0.0") Regex.CompileToAssembly(RCInfo, AN) ''''Build everything End Sub End Module
Microsoft has included an interesting innovation for matching balanced constructs (historically, something not possible with a regular expression). It''''s not particularly easy to understandthis section is short, but be warned, it is very dense.
It''''s easiest to understand with an example, so I''''ll start with one:
Dim R As Regex = New Regex(" \( " & _ " (?> " & _ " [^()]+ " & _ " | " & _ " \( (?<DEPTH>) " & _ " | " & _ " \) (?<-DEPTH>) " & _ " )* " & _ " (?(DEPTH)(?!)) " & _ " \) ", _ RegexOptions.IgnorePatternWhitespace)
This matches the first properly-paired nested set of parentheses, such as the underlined portion of ''''before (nope (yes (here) okay) after''''. The first parenthesis isn''''t matched because it has no associated closing parenthesis.
Here''''s the super-short overview of how it works:
With each ''''('''' matched, (?<DEPTH>) adds one to the regex''''s idea of how deep the parentheses are currently nested (at least, nested beyond the initial \( at the start of the regex).
With each '''')'''' matched, (?<-DEPTH>) subtracts one from that depth.
(?(DEPTH)(?!)) ensures that the depth is zero before allowing the final literal \) to match.
This works because the engine''''s backtracking stack keeps track of successfullymatched groupings. (?<DEPTH>) is just a named-capture version of () , which is always successful. Since it has been placed immediately after \( , its success (which remains on the stack until removed) is used as a marker for counting opening parentheses.
Thus, the number of successful ''''DEPTH'''' groupings matched so far is maintained on the backtracking stack. We want to subtract from that whenever a closing parentheses is found. That''''s accomplished by .NET''''s special (?<-DEPTH>) construct, which removes the most recent "successful DEPTH" notation from the stack. If it turns out that there aren''''t any, the (?<-DEPTH>) itself fails, thereby disallowing the regex from over-matching an extra closing parenthesis.
Finally, (?(DEPTH)(?!)) is a normal conditional that applies (?!) if the ''''DEPTH'''' grouping is currently successful. If it''''s still successful by the time we get here, there was an unpaired opening parenthesis whose success had never been subtracted by a balancing (?<-DEPTH>) . If that''''s the case, we want to exit the match (we don''''t want to match an unbalanced sequence), so we apply (?!) , which is normal negative lookbehind of an empty subexpression, and guaranteed to fail.
Phew! That''''s how to match nested constructs with .NET regular expressions.
There''''s an additional component to .NET''''s object model, the Capture object, which I haven''''t discussed yet. Depending on your point of view, it either adds an interesting new dimension to the match results, or adds confusion and bloat.
A Capture object is almost identical to a Group object in that it represents the text matched within a set of capturing parentheses. Like the Group object, it has methods for Value (the text matched), Length (the length of the text matched), and Index (the zero-based number of characters into the target string that the match was found).
The main difference between a Group object and a Capture object is that each Group object contains a collection of Captures representing all the intermediary matches by the group during the match, as well as the final text matched by the group.
Here''''s an example with ^(..)+ applied to ''''abcdefghijk'''':
Dim M as Match = Regex.Match("abcdefghijk", "^(..)+")
The regex matches four sets of (..) , which is most of the string: ''''abcdefghijk''''. Since the plus is outside of the parentheses, they recapture with each iteration of the plus, and are left with only ''''ij'''' (that is, M.Groups(1).Value is ''''ij''''). However, that M.Groups(1) also contains a collection of Captures representing the complete ''''ab'''', ''''cd'''', ''''ef'''', ''''gh'''', and ''''ij'''' that (..) walked through during the match:
M.Groups(1).Captures(0).Value is ''''ab'''' M.Groups(1).Captures(1).Value is ''''cd'''' M.Groups(1).Captures(2).Value is ''''ef'''' M.Groups(1).Captures(3).Value is ''''gh'''' M.Groups(1).Captures(4).Value is ''''ij'''' M.Groups(1).Captures.Count is 5.
You''''ll notice that the last capture has the same ''''ij'''' value as the overall match, M.Groups(1).Value. It turns out that the Value of a Group is really just a shorthand notation for the group''''s final capture. M.Groups(1).Value is really:
M.Groups(1).Captures( M.Groups(1).Captures.Count - 1 ).Value
Here are some additional points about captures:
M.Groups(1).Captures is a CaptureCollection, which, like any collection, has Items and Count properties. However, it''''s common to forego the Items property and index directly through the collection to its individual items, as with M.Groups(1).Captures(3) (M.Groups[1].Captures[3] in C#).
A Capture object does not have a Success method; check the Group''''s Success instead.
So far, we''''ve seen that Capture objects are available from a Group object. Although it''''s not particularly useful, a Match object also has a Captures property. M.Captures gives direct access to the Capture property of the zeroth group (that is, M.Captures is the same as M.Group(0).Captures). Since the zeroth group represents the entire match, there are no iterations of it "walking through" a match, so the zeroth captured collection always has only one Capture. Since they contain exactly the same information as the zeroth Group, both M.Captures and M.Group(0).Captures are not particularly useful.
.NET''''s Capture object is an interesting innovation that appears somewhat more complex and confusing than it really is by the way it''''s been "overly integrated" into the object model. After getting past the .NET documentation and actually understanding what these objects add, I''''ve got mixed feelings about them. On one hand, it''''s an interesting innovation that I''''d like to get to know. Uses for it don''''t immediately jump to mind, but that''''s likely because I''''ve not had the same years of experience with it as I have with traditional regex features.
On the other hand, the construction of all these extra capture groups during a match, and then their encapsulation into objects after the match, seems an effi- ciency burden that I wouldn''''t want to pay unless I''''d requested the extra information. The extra Capture groups won''''t be used in the vast majority of matches, but as it is, all Group and Capture objects (and their associated GroupCollection and CaptureCollection objects) are built when the Match object is built. So, you''''ve got them whether you need them or not; if you can find a use for the Capture objects, by all means, use them.