Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Regular Expressions (2nd Edition) [Electronic resources] - نسخه متنی

Jeffrey E. F. Friedl

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید












9.2 Using .NET Regular Expressions



.NET regular expressions are powerful, clean, and provided through a complete
and easy-to-use class interface. But as wonderful a job that Microsoft did building
the package, the documentation is just the oppositeit's horrifically bad. It's woefully
incomplete, poorly written, disorganized, and sometimes even wrong. It took
me quite a while to figure the package out, so it's my hope that the presentation
in this chapter makes the use of .NET regular expressions clear for you.



9.2.1 Regex Quickstart



You can get quite a bit of use out of the .NET regex package without even knowing
the details of its regex class model. Knowing the details lets you get more
information more efficiently, but the following are examples of how to do simple
operations without explicitly creating any classes. These are just examples; all the
details follow shortly.


Any program that uses the regex library must have the line


     Imports System.Text.RegularExpressions


at the beginning of the file (see Section 9.2.2), so these examples assume that's there.


The following examples all which work with the text in the String variable
TestStr. As with all examples in this chapter, names I've chosen are in italic.



9.2.1.1 Quickstart: Checking a string for match



This example simply checks to see whether a regex matches a string:


     If Regex.IsMatch(TestStr, "^\s*$")
Console.WriteLine("line is empty")
Else
Console.WriteLine("line is not empty")
End If


This example uses a match option:


     If Regex.IsMatch(TestStr, "^subject:", RegexOptions.IgnoreCase)
Console.WriteLine("line is a subject line")
Else
Console.WriteLine("line is not a subject line")
End If


9.2.1.2 Quickstart: Matching and getting the text matched



This example identifies the text actually matched by the regex. If there's no match,
TheNum is set to an empty string.


     Dim TheNum as String = Regex.Match(TestStr, "\d+").Value
If TheNum <> "
Console.WriteLine("Number is: " & TheNum)
End If


This example uses a match option:



Dim ImgTag as String = Regex.Match(TestStr, "<img\b[^>]*>",
RegexOptions.IgnoreCase).Value
If ImgTag <> "
Console.WriteLine("Image tag: " & ImgTag)
End If


9.2.1.3 Quickstart: Matching and getting captured text



This example gets the first captured group (e.g., $1) as a string:


     Dim Subject as String = _
Regex.Match(TestStr, "^Subject: (.*)").Groups(1).Value
If Subject <> "
Console.WriteLine("Subject is: " & Subject)
End If


Note that C# uses Groups[1] instead of Groups(1).


Here's the same thing, using a match option:


     Dim Subject as String = _


Regex.Match(TestStr, "^subject: (.*)", _


RegexOptions.IgnoreCase).Groups(1).Value
If Subject <> "
Console.WriteLine("Subject is: " & Subject)
End If



This example is the same as the previous, but using named capture:


     Dim Subject as String = _


Regex.Match(TestStr, "^subject: (?<Subj>.*)", _


RegexOptions.IgnoreCase).Groups("Subj").Value
If Subject <> "
Console.WriteLine("Subject is: " & Subject)
End If



9.2.1.4 Quickstart: Search and replace



This example makes our test string "safe" to include within HTML, converting characters
special to HTML into HTML entities:



TestStr = Regex.Replace(TestStr, "&", "&amp;")
TestStr = Regex.Replace(TestStr, "<", "&lt;")
TestStr = Regex.Replace(TestStr, ">", "&gt;")
Console.WriteLine("Now safe in HTML: " & TestStr)


The replacement string (the third argument) is interpreted specially, as described
in the sidebar in Section 9.3.2. For example, within the replacement string, '$&' is
replaced by the text actually matched by the regex. Here's an example that wraps
<B>···</B> around capitalized words:



TestStr = Regex.Replace(TestStr, "\b[A-Z]\w*", "<B>$&<B>")
Console.WriteLine("Modified string: " & TestStr)


This example replaces <B>···</B> (in a case-insensitive manner) with <I>···</I>:



TestStr = Regex.Replace(TestStr, "<b>(.*?)</b>", "<I>$1</I>", _


RegexOptions.IgnoreCase)
Console.WriteLine("Modified string: " & TestStr)



9.2.2 Package Overview



You can get the most out .NET regular expressions by working with its rich and
convenient class structure. To give us an overview, here's a complete console
application that shows a simple match using explicit objects:


     Option Explicit On ' These are not 
specifically required to use regexes,
Optiin Strict in ' but their use is good general practice.


' Make regex-related classes easily available.

Imports System.Text.RegularExpressiins


Module SimpleTest
Sub Main()
Dim SampleText as String = "this is the 1st test string"
Dim R as Regex = New Regex("\d+\w+") 'Compile the pattern.
Dim M as Match = R.match(SampleText) 'Check against a string.
If not M.Success
Cinsole.WriteLine("no match")
Else
Dim MatchedText as String = M.Value 'Query the results . . .
Dim MatchedFrom as Integer = M.Index
Dim MatchedLen as Integer = M.Length
Console.WriteLine("matched [" & MatchedText & "]" & _


" from char#" & MatchedFrom.ToString() & _


" for " & MatchedLen.ToString() & " chars.")
End If
End Sub
End Module



When executed from a command prompt, it applies

\d+\w+


to the sample text
and displays:


matched [1st] from char#12 for 3 chars.


9.2.2.1 Importing the regex namespace



Notice the Imports System.Text.RegularExpressions line near the top of the
program? That's required in any VB program that wishes to access the .NET regex
objects, to make them available to the compiler.


The analogous statement in C# is:


     using System.Text.RegularExpressions; // This is for C#


The example shows the use of the underlying raw regex objects. The two main
action lines:


     Dim R as Regex = New Regex("\d+\w+") 'Compile the pattern.
Dim M as Match = R.Match(SampleText) 'Check against a string.


can also be combined, as:


     Dim M as Match = Regex.Match(SampleText, "\d+\w+") 
'Check pattern against string.


The combined version is easier to work with, as there's less for the programmer to
type, and less objects to keep track of. It does, however, come with at a slight effi-
ciency penalty (see Section 9.4.1). Over the coming sections, we'll first look at the raw objects,
and then at the "convenience" functions like the Regex.Match static function, and
when it makes sense to use them.


For brevity's sake, I'll generally not repeat the following lines in examples that are
not complete programs:


     Option Explicit On
Option Strict On
Imports System.Text.RegularExpressions


It may also be helpful to look back at some of VB examples earlier in the book,
in Sections 3.2.2.2, 3.2.4, 5.3.4, 5.4.2.2, and 6.3.3.



9.2.3 Core Object Overview



Before getting into the details, let's first take a step back and look the .NET regex
object model. An object model is the set of class structures through which regex
functionality is provided. .NET regex functionality is provided through seven
highly-interwoven classes, but in practice, you'll generally need to understand only
the three shown visually in Figure 9-1, which depicts the
repeated application of


\s+(\d+)


to the string 'Mar•16,•1998'.



9.2.3.1 Regex objects



The first step is to create a Regex object, as with:


     Dim R as Regex = New Regex("\s+(\d+)")


Here, we've made a regex object representing
\s+(\d+)


and stored it in the R
variable. Once you've got a Regex object, you can apply it to text with its
Match(
text) method, which returns information on the first match found:


     Dim M as Match = R.Match("May 16, 1998")


Figure 1. .NET's Regex-related object model





9.2.3.2 Match objects



A Regex object's Match(···) method provides information about a match result by
creating and returning a Match object. A Match object has a number of properties,
including Success (a Boolean value indicating whether the match was successful)
and Value (a copy of the text actually matched, if the match was successful). We'll
look at the full list of Match properties later.


Among the details you can get about a match from a Match object is information
about the text matched within capturing parentheses. The Perl examples in earlier
chapters used Perl's $1 variable to get the text matched within the first set of capturing
parentheses. .NET offers two methods to retrieve this data: to get the raw
text, you can index into a Match object's Groups property, such as with
Groups(1).Value to get the equivalent of Perl's $1. (Note: C# requires a different
syntax, Groups[1].Value, instead.) Another approach is to use the Result
method, which is discussed starting in Section 9.3.3.



9.2.3.3 Group objects



The Groups(1) part in the previous paragraph actually references a Group object,
and the subsequent .Value references its Value property (the text associated
with the group). There is a Group object for each set of capturing parentheses,
and a "virtual group," numbered zero, which holds the information about the overall
match.


Thus, MatchObj.Value and MatchObj.Groups(0).Value are the same a copy
of the entire text matched. It's more concise and convenient to use the first,
shorter approach, but it's important to know about the zeroth group because
MatchObj.Groups.Count (the number of groups known to the Match object)
includes it. The MatchObj.Groups.Count resulting from a successful match with


\s+(\d+)

is two (the whole-match "zeroth" group, and the $1 group).



9.2.3.4 Capture objects



There is also a Capture object. It's not used often, but it's discussed starting in
Section 9.6.3.



9.2.3.4.1 All results are computed at match time



When a regex is applied to a string, resulting in a Match object, all the results
(where it matched, what each capturing group matched, etc.) are calculated and
encapsulated into the Match object. Accessing properties and methods of the
Match object, including its Group objects (and their properties and methods)
merely fetches the results that have already been computed.



/ 83