Mastering Regular Expressions (2nd Edition) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Regular Expressions (2nd Edition) [Electronic resources] - نسخه متنی

Jeffrey E. F. Friedl

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید












5.3 HTML-Related Examples



In Chapter 2, we saw an extended example that converted raw text to HTML
(see Section 2.3.6), including regular expressions to pluck out email addresses and http> URLs
from the text. In this section, we'll do a few other HTML-related tasks.



5.3.1 Matching an HTML Tag



It's common to see
<[^>]+>
> used to match an HTML tag. It usually works fine,
such as in this snippet of Perl that strips tags:


$html =~ s/<[^>]+>//g;


However, it matches improperly if the tag has '>>' within it, as with this perfectly
valid HTML:
class="docEmphBold"><input name=dir value=">" class="docEmphBold">>
>. Although it's not common or recommended,
HTML allows a raw '<>' and '>>' to appear within a quoted tag attribute.
Our simple
<[^>]+>
> doesn't allow for that, so, we must make it smarter.


Allowed within the '<···>>' are quoted sequences, and "other stuff" characters that
may appear unquoted. This includes everything except '>>' and quotes. HTML
allows both single- and double-quoted strings. It doesn't allow embedded quotes
to be escaped, which allows us to use simple regexes
"[^"]*"
> and
'[^']*'
> to
match them.


Putting these together with the "other stuff" regex
[^'">]
>, we get:



< class="docEmphBold">("[^"]*"|'[^']*'|[^'">] class="docEmphBold">)*>
>


That may be a bit confusing, so how about the same thing shown with comments
in a free-spacing mode:


     <                    #   class="docEmphasis">Opening "<"
( # class="docEmphasis">Any amount of . . .
"[^"]*" # class="docEmphasis">double-quoted string,
| # class="docEmphasis">or . . .
'[^']*' # class="docEmphasis">single-quoted string,
| # class="docEmphasis">or . . .
[^'">] # class="docEmphasis">"other stuff"
)* #
> # class="docEmphasis">Closing ">"


The overall approach is quite elegant, as it treats each quoted part as a unit, and
clearly indicates what is allowed at any point in the match. Nothing can be
matched by more than one part of the regex, so there's no ambiguity, and hence
no worry about unintended matches "sneaking in," as with some earlier examples.


Notice that
*
> rather than
+
> is used within the quotes of the first two alternatives?
A quoted string may be empty (e.g., '
class="docEmphBold">alt="
>'), so
*
> is used within each pair of
quotes to reflect that. But don't use
*
> or
+
> in the third alternative, as the
[^'">]
>
is already directly subject to a quantifier via the wrapping
(···)*
>. Adding another
quantifier, yielding an effective
([^'">]+)*
>, could case a very rude surprise that I
don't expect you to understand at this point; it's discussed in great detail in the
next chapter (see Section 6.1.4).


One thought about efficiency when used with an NFA engine: since we don't use
the text captured by the parentheses, we can change them to non-capturing parentheses
(see Section 3.4.5.2). And since there is indeed no ambiguity among the alternatives, if it
turns out that the final
>
> can't match when it's tried, there's no benefit going back
and trying the remaining alternatives. Where one of the alternatives matched
before, no other alternative can match now from the same spot. So, it's okay to
throw away any saved states, and doing so affords a faster failure when no match
can be had. This can be done by using
(?>···)
> atomic grouping instead of the
non-capturing parentheses (or a possessive star to quantify whichever parentheses
are used).



5.3.2 Matching an HTML Link



Let's say that now we want to match sets of URL and link text from a document,
such as pulling the marked items from:


     ···<a href=" class="docEmphUl">
http://www.oreilly.com"> class="docEmphUl">O'Reilly
And Associates</a>···


Because the contents of an <A>> tag can be fairly complex, I would approach this
task in two parts. The first is to pluck out the "guts" of the <A>> tag, along with the
link text, and then pluck the URL itself from those <A>> guts.


A simplistic approach to the first part is a case-insensitive, dot-matches-all application
of
<a\b([^>]+)>(.*?)</a>
>, which features the lazy star quantifier. This puts
the <A>> guts into $1> and the link text into $2>. Of course, as earlier, instead of

[^>]+
> I should use what we developed in the previous section. Having said that,
I'll continue with this simpler version, for the sake of keeping that part of the
regex shorter and cleaner for the discussion.


Once we have the <A>> guts in a string, we can inspect them with a separate regex.
In them, the URL is the value for the href=>
class="docEmphasis">value attribute. HTML allows spaces on
either side of the equal sign, and the value can be quoted or not, as described in the previous section. A solution is shown as part of this Perl snippet to report on
links in the variable $Html:>


     #  class="docEmphasis">Note: the regex in the while(...)
is overly simplisticsee text for discussion
while ($Html =~ m{<a\b([^>]+)>(.*?)</a>}ig)
{
my $Guts = $1; # class="docEmphasis">Save
results from the match above, to their own . . .
my $Link = $2; # class="docEmphasis">
. . . named variables, for clarity below.
if ($Guts =~ m{
\b HREF # "href" attribute
\s* = \s* # "=" may have whitespace on either side
(?: # Value is···
"([^"]*)" # double-quoted string,
| # or···
'([^']*)' # single-quoted string,
| # or···
([^'">\s]+) # "other stuff"
) #
}xi)
{
my $Url = $+; # class="docEmphasis">
Gives the highest-numbered actually
-filled $1, $2, etc.
print "$Url with link text: $Link\n";
}
}


Some notes about this:



This time, I added parentheses to each value-matching alternative, to capture
the exact value matched.



Because I'm using some of the parentheses to capture, I've used non-capturing
parentheses where I don't need to capture, both for clarity and efficiency.



This time, the "other stuff" component excludes whitespace in addition to
quotes and '>>', as whitespace separates "attribute=value" pairs.



This time, I do use
+
> in the "other stuff" alternative, as it's needed to capture
the whole href> value. Does this cause the same "rude surprise" as if we used

+
> in the "other stuff" alternative in Section 5.3? No, because there's no outer
quantifier that directly influences the class being repeated. Again, this is cover
ed in detail in the next chapter.




Depending on the text, the actual URL may end up in $1, $2,> or $3>. The others
will be empty or undefined. Perl happens to support a special variable $+> which is
the value of the highest-numbered $1, $2,> etc. that actually captured text. In this
case, that's exactly what we want as our URL.


Using $+> is convenient in Perl, but other languages offer other ways to isolate the
captured URL. Normal programming constructs can always be used to inspect the
captured groups, using the one that has a value. If supported, named capturing
(see Section 3.4.5.3) is perfect for this, as shown in the VB.NET example in Section 5.3.4. (It's
good that .NET offers named capture, because its $+> is broken see Section 9.3.2.1.)



5.3.3 Examining an HTTP URL



Now that we've got a URL, let's see if it's an http> URL, and if so, pluck it apart into
its hostname and path components. Since we know we have something intended
to be a URL, our task is made much simpler than if we had to class="docEmphasis">identify a URL from
among random text. That much more difficult task is investigated a bit later in this
chapter.


So, given a URL, we merely need to be able to recognize the parts. The hostname
is everything after
^http://
> but before the next slash (if there is another slash),
and the path is everything else:
^http://([^/]+)(/.*)?$
>


Actually, a URL may have an optional port number between the hostname and the
path, with a leading colon:
^http://([^/ class="docEmphBold">:]+ class="docEmphBold">(:(\d+))?)(/.*)?$
>


Here's a Perl snippet to report about a URL:


     if ($url =~ m{^http://([^/:]+(:(\d+))?)(/.*)?$}i)
{
my $host = $1;
my $port = $3 || 80; # class="docEmphasis">
Use $3 if it exists; otherwise default to 80.
my $path = $4 || "/"; # class="docEmphasis">
Use $4 if it exists; otherwise default to "/".
print "host: $host\n";
print "port: $port\n";
print "path: $path\n";
} else {
print "not an http url\n";
}


5.3.4 Validating a Hostname



In the previous example, we used
[^/:]+
> to match a hostname. Yet, in Chapter 2
(see Section 2.3.6.7), we used the more complex
[-a-z]+(\.[-a-z]+)*\.(com|edu|···|info)
>.
Why the difference in complexity for finding ostensibly the same thing?


Well, even though both are used to "match a hostname," they're used quite differently.
It's one thing to pluck out something from a known quantity (e.g., from
something you know to be a URL), but it's quite another to accurately and unambiguously
pluck out that same type of something from among random text. Specifically,
in the previous example, we made the assumption that what comes after the
'http://>' is a hostname, so the use of
[^/:]+
> merely to fetch it is reasonable. But
in the Chapter 2 example, we use a regex to find a hostname in random text, so it
must be much more specific.


Now, for a third angle on matching a hostname, we can consider validating hostnames
with regular expressions. In this case, we want to check whether a string is
a well-formed, syntactically correct hostname. Officially, a hostname is made up of
dot-separated parts, where each part can have ASCII letters, digits, and hyphens,
but a part can't begin or end with a hyphen. Thus, one part can be matched with a case-insensitive application of
[a-z0-9]|[a-z0-9][-a-z0-9]*[-a-z0-9]
>. The
final suffix part ('com>', 'edu>', 'uk>', etc.) has a limited set of possibilities, mentioned
in passing in the Chapter 2 example. Using that here, we're left with the following
regex to match a syntactically valid hostname:



Link Checker in VB.NET



This Program reports on links within the HTML in the variable class="docEmphBoldItalic">
Html>
:


     Imports System.Text.RegularExpressions
·
·
·
' class="docEmphasis">Set up the regular
expressions we'll use in the loop
Dim A_Regex as Regex = New Regex( _
"<a\b(?<guts>[^>]+)>(?<Link>.*?)</a>", _
RegexOptions.IgnoreCase)
Dim GutsRegex as Regex = New Regex( _
"\b HREF (?# 'href' attribute )" & _
"\s* = \s* (?# '=' with optional whitespace )" & _
"(?: (?# Value is ... )" & _
" "(?<url>[^"]*)" (?# double-quoted string, )" & _
" | (?# or ... )" & _
" '(?<url>[^']*)' (?# single-quoted string, )" & _
" | (?# or ... )" & _
" (?<url>[^'">\s]+) (?# 'other stuff' )" & _
") (?# )", _
RegexOptions.IgnoreCase OR RegexOptions.IgnorePatternWhitespace)
' class="docEmphasis">Now check the 'Html' Variable . . .
Dim CheckA as Match = A_Regex.Match(Html)
' class="docEmphasis">For each match within . . .
While CheckA.Success
' class="docEmphasis">We matched an <a> tag,
so now check for the URL.
Dim UrlCheck as Match = _
GutsRegex.Match(CheckA.Groups("guts").Value)
If UrlCheck.Success
' class="docEmphasis">We've got a match, so have a URL/link pair
Console.WriteLine("Url " & UrlCheck.Groups("url").Value & _
" WITH LINK " & CheckA.Groups("Link").Value)
End If
CheckA = CheckA.NextMatch
End While
_____________________________________________________________________


A few things to notice:



VB.NET programs using regular expressions require that first Imports>
line to tell the compiler what object libraries to use.



I've used
(?#···)
> style comments because it's inconvenient to get a newline
into a VB.NET string, and normal '#>' comments carry on until the
next newline or the end of the string (which means that the first one
would make the entire rest of the regex a comment). To use normal
#···
>
comments, add &chr(10)> at the end of each line (see Section 9.3.1.2).



Each double quote in the regex requires '">' in the literal string (see Section 3.3.1.1).



Named capturing is used in both expressions, allowing the more descriptive
Groups("url")> instead of Groups(1), Groups(2),> etc.




     ^
(?:i) # apply this regex in a case-insensitive manner.
# One or more dot-separated parts···
(?: [a-z0-9]\. | [a-z0-9][-a-z0-9]*[-a-z0-9]\. )+
# Followed by the final suffix part···
(?: com|edu|gov|int|mil|net|org|biz|
info|name|museum|coop|aero|[a-z][a-z] )
$


Something matching this regex isn't necessarily valid quite yet, as there's a length
limitation: individual parts may be no longer than 63 characters. That means that
the
[-a-z0-9]*
> in there should be
[-a-z0-9]{0,61}
>.


There's one final change, just to be official. Officially, a name consisting of only
one of the suffixes (e.g., 'com>', 'edu>', etc.) is also syntactically valid. Current practice
seems to be that these "names" don't actually have a computer answer to
them, but that doesn't always seem to be the case for the two-letter country suf-
fixes. For example, Anguilla's top-level domain 'ai>' has a web server: http://ai/>
shows a page. A few others like this that I've seen include cc, co, dk, mm, ph, tj,
tv>, and tw>.


So, if you wish to allow for these special cases, change the central
(?:···) class="docEmphBold">+

> to

(?:···) class="docEmphBold">*

>. These changes leave us with:


     ^
(?:i) # apply this regex in a case-insensitive manner.
# One or more dot-separated parts···
(?: [a-z0-9]\. | [a-z0-9][-a-z0-9]{0,61}[-a-z0-9]\. )*
# Followed by the final suffix part···
(?: com|edu|gov|int|
mil|net|org|biz|info|name|museum|coop|aero|[a-z][a-z] )
$


This now works just dandy to validate a string containing a hostname. Since this is
the most specific of the three hostname-related regexes we've developed, you
might think that if you remove the anchors, it could be better than the regex we
came up with earlier for plucking out hostnames from random text. That's not the
case. This regex matches any two-letter word, which is why the less-specific regex
from Chapter 2 is better in practice. But, it still might not be good enough for
some purposes, as the next section shows.



5.3.5 Plucking Out a URL in the Real World



Working for Yahoo! Finance, I write programs that process incoming financial
news and data feeds. News articles are usually provided to us in raw text, and my
programs convert them to HTML for a pleasing presentation. (Read financial news at http://finance.yahoo.com> and see how I've done.) It's often a daunting task
due to the random "formatting" (or lack thereof) of the data we receive, and
because it's much more difficult to class="docEmphasis">recognize things like hostnames and URLs in
raw text than it is to class="docEmphasis">validate them once you've got them. The previous section
alluded to this; in this section, I'll show you code we actually use at Yahoo! to
solve the issues we've faced.


We look for several types of URLs to pluck from the text mailto, http, https,>
and ftp> URLs. If we find 'http://>' in the text, we're pretty certain that's the start
of a URL, so we can use something simple like
http://[-\w]+(\.\w[-\w]*)+
> to
match up through the hostname part. We're using the knowledge of the text (raw
English text provided as ASCII) to realize that it's probably okay to use
-\w
> instead
of
[-a-z0-9]
>.
\w
> also matches an underscore, and in some systems also matches
the whole of Unicode letters, but we know that neither of these really matter to us
in this particular situation.


However, often, a URL is given without the http://> or mailto:> prefix, such as:


     ···visit us at www.oreilly.com or mail to orders@oreilly.com.


In this case, we need to be much more careful. What we use is quite similar to the
regex from the previous section, but it differs in a few ways:


     (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ 
# class="docEmphasis">sub domains
# class="docEmphasis">Now ending .com, etc.
For these, we require lowercase
(?-i: com\b
| edu\b
| biz\b
| org\b
| gov\b
| in(?:t|fo)\b # class="docEmphasis">.int or .info
| mil\b
| net\b
| name\b
| museum\b
| coop\b
| aero\b
| [a-z][a-z]\b # class="docEmphasis">two-letter country codes
)


In this regex,
(?i:···)
> and
(?-i:···)
> are used to explicitly enable and disable caseinsensitivity
for specific parts of the regex (see Section 3.4.4.2). We want to match a URL like
'www.OReilly.com>', but not a stock symbol like 'NT.TO>' (the stock symbol for
Nortel Networks on the Toronto Stock Exchangeremember, we process financial
news and data, which has a lot of stock symbols). Officially, the ending part of a
URL (e.g., '.com>') may be upper case, but we simply won't recognize those. That's
the balance we've struck among matching what we want (pretty much every URL
we're likely to see), not matching what we don't want (stock symbols), and simplicity.
I suppose we could move the
(?-i:···)
> to wrap only the country codes
part, but in practice, we just don't get uppercased URLs, so we've left this as it is.


Here's a framework for finding URLs in raw text, into which we can insert the
subexpression to match a hostname:


     \b
# class="docEmphasis">Match the
leading part (proto://hostname, or just hostname)
(
# class="docEmphasis">ftp://, http://, or https:// leading part
(ftp|https?)://[-\w]+(\.\w[-\w]*)+
|
# class="docEmphasis">or, try to find a
hostname with our more specific sub-expression
full-hostname-regex
)
# class="docEmphasis">Allow an optional port number
( : \d+ )?
# class="docEmphasis">
The rest of the URL is optional, and begins with / . . .
(
/ class="docEmphasis">path-part
)?


I haven't talked yet about the path part of the regex, which comes after the hostname
(e.g., the underlined part of http://www.oreilly.com class="docEmphUl">/catalog/regex/
>).
The path part turns out to be the most difficult text to match properly, as it
requires some guessing to do a good job. As discussed in Chapter 2, what often
comes class="docEmphasis">after a URL in the text is also allowed as part of a URL. For example, with


     Read his comments 
at http://www.oreilly.com/ask_tim/indexl. He ...


we can look and realize that the period after 'indexl>' is English punctuation
and should not be considered part of the URL, yet the period class="docEmphasis">within 'indexl>'
class="docEmphasis">is part of the URL.


Although it's easy for us humans to differentiate between the two, it's quite difficult for a program, so we've got to come up with some heuristics that get the job
done as best we can. The approach taken with the Chapter 2 example is to use
negative lookbehind to ensure that a URL can't end with sentence-ending punctuation
characters. What we've been using at Yahoo! Finance was originally written
before negative lookbehind was available, and so is more complex than the Chapter
2
approach, but in the end it has the same effect. It's shown in the listing below. The approach taken for the path part is different in a number of
respects, and the comparison with the Chapter 2 example in Section 2.3.6.6 should be
interesting. In particular, the Java version of this regex in the sidebar in Section 5.4.1
provides some insight as to how it was built.


In practice, I doubt I'd actually write out a full monster like this, but instead I'd
build up a "library" of regular expressions and use them as needed. A simple
example of this is shown with the use of $HostnameRegex> in Section 2.3.6.7, and also
in the sidebar in Section 5.4.1.


Regex to pluck a URL from financial news


     \b
# class="docEmphasis">Match the leading
part (proto://hostname, or just hostname)
(
# class="docEmphasis">ftp://, http://, or https:// leading part
(ftp|https?)://[-\w]+(\.\w[-\w]*)+
|
# class="docEmphasis">or, try to find a
hostname with our more specific sub-expression
(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ #
class="docEmphasis">sub domains
# class="docEmphasis">Now ending
.com, etc. For these, require lowercase
(?-i: com\b
| edu\b
| biz\b
| gov\b
| in(?:t|fo)\b # class="docEmphasis">.int or .info
| mil\b
| net\b
| org\b
| [a-z][a-z]\b # class="docEmphasis">two-letter country codes
)
)
# class="docEmphasis">Allow an optional port number
( : \d+ )?
# class="docEmphasis">The rest of
the URL is optional, and begins with / . . .
(
/
# class="docEmphasis">The rest are
heuristics for what seems to work well
[^;"'<>()\[\]{}\s\x7F-\xFF]*
(?:
[.,?]+ [^;"'<>()\[\]{}\s\x7F-\xFF]+
)*
)?


/ 83