Professional ASP.NET 1.1 [Electronic resources] نسخه متنی

Chapter 5 discussed validation controls, including the regular expression validator that uses a regular expression to check the value of an email entry field. This works very well to validate entry fields, but there are times when you need to process text outside of the validators, probably when writing custom text or screen scraping applications.

For example, without regular expressions, how easy would it be to extract all the links from the HTML of a web page? You could search for the
"href" string, but then you would have to be flexible about the contents of the attribute string. Regular expressions allow this flexibility, by way of pattern matching.

Pattern Matching

Regular expressions allow you to search, extract, or replace substrings based on an expression, or a pattern. These expressions are where the power of regular expressions lies. The patterns available in regular expressions use special characters and sequences to identify what is being searched for. The following table lists some of the main pattern elements:

Element	Description
*	A quantifier construct, when used it indicates that zero or more matches for a specific expression
+	A quantifier construct, when used it indicates that one or more matches for a specific expression
()	Captures the matched substring into the next available capture group (a capture group is zero, one, or more strings)
(?<name>)	Captures matched substring into capture group identified by name
\n	Return the nth captured group
\|	Either of the expressions separated by the \| character
.	Any character (except newline)
[]	Any single character within the brackets
[^]	Any single character not within the brackets
\s	Any whitespace character
\S	Any non-whitespace character
\d	Any digit character
\D	Any non-digit character

The following table shows some examples of regular expressions, and content those expressions match:

Example	Matches
abc*	abc followed by none or more ''c'' characters
abc+	abc followed by one or more ''c'' characters
abc(def)ghi	abcdefghi , and places def in the first capture group
Ab(cd)ef(gh)i	abcdefghi , places cd into capture group 1, and gh into capture group 2
hello\|goodbye	Either hello or goodbye
[abcdef]	Any of the characters abcdef
[a-f]	Any of the characters abcdef
[^a-f]	Any character other than abcdef

Pattern Ordering and Length

There are two important points to note about searching for patterns. The searched pattern will be the largest available, which may not be what is expected. For example, consider the following string:

Alex Homer is an author. Despite his years, he''s not the Homer that wrote

Greek epics.

Let''s say we use the following expression:

Homer(.*)

This expression looks for the word
Homer , and places any characters found after it in a capture group. The thing to watch for is that the first expression found in the search string is used. So, what''s captured is the following:

is an author. Despite his years, he''s not the Homer that wrote Greek epics.

There are two instances of
Homer , and it''s the first one that is matched. This rule changes when the search expression is widened to include any characters at the start of the search string. If you use the following expression:

.*Homer(.*)

This looks for any characters, followed by
Homer , and places any characters found after it in a capture group. However, since the entire expression is widened, it now matches a larger number of characters. The largest match is returned, but the group now contains less characters. In this case, the following is captured:

that wrote the Greek epics.

The rules for these matches are entirely consistent, and they mean that you have to be careful in selecting match strings.

Text Replacement

If you are using patterns to search and replace within a string, remember that the replacement text may invalidate the expression that was used to perform the search. You should therefore be careful of search patterns that pick the widest match. It''s nearly always best to be as explicit as possible, by using narrow patterns.

Pattern Example

You''ve seen how to use the network classes to retrieve a web page from Amazon.com and extract the sales ranking for a book. Let''s take a look at part of the HTML that the Amazon.com web page uses:

Amazon.com Sales Rank: 52,504

Notice that this is all on one line, so you need to extract the rank from the middle of text, rather than from a line on its own. Here''s the search expression, this time only using one group, since you really only require the sales rank:

Amazon.com Sales Rank: (?<rank>.*)

There are several parts to this, some of which aren''t directly relevant to the ranking. However, let''s take the whole expression so you can see exactly what it''s built from. Firstly you''ll notice that you have two groups (these are the parts contained within parentheses), each of which is given a name. The name is defined by use of the
? character followed by a name contained within angle brackets. So you have
x and
rank . The groupings don''t affect how the expression is parsed – they are just used to allow easy access to parts of the expression once parsing has taken place.

It''s clear which characters you need to match–those after
: and before the closing
font tag. These are extracted by the group labeled
rank .

The Regular Expression Classes

The
System.Text.RegularExpressions namespace contains eight classes for the manipulation of regular expressions. These are:

Class	Represents
Regex	A regular expression
Match	The results from a single expression match
MatchCollection	A collection of results from iteratively applied matches
Group	The results from a single captured group
GroupCollection	A collection of captured groups
Capture	The results from a single sub-expression capture
CaptureCollection	A collection of captured sub-expressions
RegexCompilationInfo	Information about the compilation of expressions

Like the pattern matching, we''re not going to cover an exhaustive list of all the classes, properties, and methods. Instead we''ll concentrate on the most useful scenarios.

The Regex Class

Regex is the root class for regular expressions, and represents an individual regular expression. It contains a number of methods to allow the creation and matching of expressions. For example:

Dim expr As String = "hello"

Dim re As New Regex(expr)

re.Match("Hello everyone, hello one and all.")

This creates an expression and then uses the
Match method to match the expression with the supplied string. In this case, there would only be one match – the second
hello – since the matching is, by default, case-sensitive.

The
Regex class constructor can be overloaded, to allow options to be specified. For example:

Dim expr As String = "hello"

Dim re As New Regex(expr, RegexOptions.IgnoreCase)

re.Match("Hello everyone, hello one and all.")

Now there are two matches, since case is being ignored!

The options to specify can be from the
RegexOptions shown in the following table (or set the
Options property of the class):

RegexOption	Description
Compiled	Specifies that the expression should be compiled to MSIL
ECMAScript	Enables ECMAScript -compliant behavior for the expression
ExplicitCapture	Only captures explicitly named or numbered groups, allowing parentheses to be matched without escaping
IgnoreCase	Case-insensitive match
IgnorePattern Whitespace	Ignores un-escaped whitespace in the pattern
Multiline	Make ^ and $ match the beginning and end of any line, rather than the entire string
None	No options are set
RightToLeft	Searches from right to left. This sets the RightToLeft property of the class
SingleLine	Treat the search string as a single line (where all characters are matched, including new line)

The Match Class

The
Match class contains the details of a single expression match, as returned by the
Match method of the
Regex class. For example:

Dim mt As Match

Dim expr As String = "hello"

Dim re As New Regex(expr, RegexOptions.IgnoreCase)

mt = re.Match("Hello everyone, hello one and all.")

You can then use the
Success property to determine if any matches were made, and examine the
Groups and
Captures collections to identify what were matched.

The Group Class

The
Group class identifies a single captured group. Since an expression can contain multiple groups, the
Match class has a
Groups collection that contains a
Group object for each group matched. For example, consider the match expression:

(he(ll)o)

This contains two explicit groups. One is for the entire word
hello , and the other for the two
l characters. There is also a third group, which is the entire expression. So, as far as matching is concerned, this expression is equivalent to:

he(ll)o

The only difference is the number of groups created.

Unlike the sales-ranking examples, these groups aren''t explicitly named, so they are given names equivalent to their position in the collection (
1 ,
2 , and so on). You can access the groups directly, or through an enumeration. For simple expressions, it''s marginally quicker to allow the class name the groups, but for more complex expressions, explicit names make it clear exactly which groups correspond to which match expression.

For example, consider the following expression:

(l)+

This expression matches one or more occurrences of the
l character.

The following example demonstrates simple grouping in use:

<%@ Page Language="VB" %>

<%

Dim mt As Match

Dim gp As Group

Dim expr As String = "h(e(ll)o) "

Dim re As New Regex(expr, RegexOptions.IgnoreCase)

mt = re.Match("Hello everyone, hello one and all.")

For Each gp In mt.Groups

Response.Write(" ")

Response.Write(gp.Value)

Next

%>

This returns the following:

hello
ello
ll

There are three matches. The first is the entire match expression, the second corresponds to the group within the first set of parentheses, and the third is the group within the second set of parentheses.

The
Group class also includes
Index and
Length properties, which indicate the position of the match within the search string, and the length of string that is matched.

The Capture Class

The
Capture class represents a single sub-expression capture. Each
Group can have multiple captures. The
Capture class really comes into its own when quantifiers are used within expressions. Quantifiers add an optional quantity to finding patterns. Examples of quantifiers are
* for zero or more occurrences and
+ for one or more occurrences. For example, consider the following expression, which searches for the first occurrence of one or more
l characters:

(l)+

Putting this into a full example, you have:

<%@ Page Language="VB" %>

<%

Dim mt As Match

Dim gp As Group

Dim cp As Capture

Dim expr As String = "(l)+"

Dim re As New Regex(expr, RegexOptions.IgnoreCase)

mt = re.Match("Hello everyone, hello one and all.")

For Each gp In mt.Groups

Response.Write("Group: " & gp.Value)

Response.Write(" ");

For Each cp In gp.Captures

Response.Write(" Capture: " & cp.Value)

Response.Write(" ");

Next

Next

%>

This gives the following result:

Group: ll
Capture: ll
Group: l
Capture: l
Capture: l

Both a single
l and multiple
l characters are matched, because the
+ quantifier specifies one or more. So, the first group matches the
ll in the first
Hello . For the second group, there are two occurrences of single
l characters. This becomes clearer with another example. Let''s consider the following:

(abc)+

This matches one or more occurrences of the string
abc . When matched against
QQQabcabcabcWWWEEEabcab you get the following output:

Group: abcabcabc
Capture: abcabcabc
Group: abc
Capture: abc
Capture: abc
Capture: abc

The first group matches the widest expression, and there is only one occurrence of this. The second group matches the explicit group, and there are three occurrences of this.

Substitutions

When using groups in expressions, you can reuse the group without having to retype it. This is known as substitution. For example, consider the expression:

(abc)def

This matches
abcdef but places
abc into the first group. Then, to match
abcdefabc , you''d use:

(abc)def\1

Professional ASP.NET 1.1 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی