Apache Jakarta and Beyond: A Java Programmeramp;#039;s Introduction [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Apache Jakarta and Beyond: A Java Programmeramp;#039;s Introduction [Electronic resources] - نسخه متنی

Larne Pekowsky

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







13.1. Regular Expressions


Before looking at the tools themselves it is necessary to describe how the structure will be specified. There are many ways of defining a structure, including XML DTDs, which may be thought of as a set of rules that a block of text must obey to be considered well formed.

One common tool used to define structure is called

regular expressions or

regexs, or sometimes

regexps for short. Regular expressions comprise a language for specifying patterns of text. This is not a full-fledged language like Java; regexps are both simpler and less powerful,[1] but they turn out to be powerful enough for a wide variety of applications. In addition, the simplicity of regular expressions makes it possible for them to be implemented very efficiently. In what follows a regular expression will be called a

pattern, and the text to be tested will be called the

input.

[1] The notion of the "power" of a language has a precise definition in computer science. Although it is beyond the scope of this book, full languages like Java are equivalent to an abstract computing model called "turing machines," whereas regular expression are equivalent to another model called "finite state machines" or "finite automata."


Before describing the regular expression syntax in detail, a few preliminary words on how regexps are evaluated are in order. Patterns are matched left to right, and in general any pattern can be evaluated by examining each character of the input only once. Regular expressions have no notion of memory in the general sense. At each step of processing a regular expression knows only what action to take based on the next character. This means regular expressions cannot be used to describe structures such as palindromes, strings that read the same backward as forward. This is because to check for palindromes it would be necessary to either remember what had been seen in the first half of the string when evaluating the second half or to move back and forth across the string to compare characters at each end. Basic regular expressions cannot even check for things like the presence of every vowel in any order. This would require the regexp to remember which vowels it had already seen as it moved through the string.

Keeping these rules in mind, here are the basic elements of regular expressions:

  • Single characters match themselves. The pattern 'a' matches the input "a," the pattern 'q' matches "q," and the unicode character'\u00FC' matches "ü." However, 'a' would not match "apple" even though "apple" contains the letter 'a.'[2]

    [2] The regular expression packages in many languages, such as Perl, do allow substrings or subpatterns to match against the entire input, so "a" would match "apple," as would "le." However, the usual formal definition of regular expressions agrees with the rule given in stating that single characters only match themselves. Except where otherwise noted, the regular expression libraries considered in the chapter use the formal definition.

  • It follows immediately from the left-to-right processing rule that strings of characters match the corresponding literal strings, so the pattern "apple" matches the input "apple" and "Cr\u00FCxshadows" matches "Crüuxshadows." "Hello" will not match "Hello, world."

  • A single dot '.' matches any single character, so the pattern '.' will match "a" or "b" or "ö" or so on. By the left-to-right rule this means that the pattern "c.t" will match "cat," "cot," "cut," or even "czt". The pattern ".." will match any two characters, "e.." will match any three characters beginning with 'e,' and so on. Note that this is not quite the same thing as saying "all three-letter words" because "e.." will also match strings with spaces such as "e t".

  • An asterisk following a pattern means that the immediately preceding pattern may be repeated zero or more times. The pattern "a*" will match any number of instances of the letter 'a' including "" (the empty string), "aa," "aaaaa," and so on.

  • Following a dot with an asterisk means any single character will match any number of times, which is the same thing as matching any input. ".*" will match anything at all; "a.*" will match any input beginning with 'a'; ".*ly" will match any input ending in "ly," and ".*qq.*" will match any input that contains "qq" anywhere within it. Note that this is equivalent to checking whether a input contains "qq," which is also equivalent to checking that


    string.indexOf("qq")!= -1.

  • A plus sign ('+') in a pattern indicates that the preceding pattern must match one or more times. The pattern "a+" will match the strings "a" and "aaa" but not the empty string. Note that if p represents a pattern, then "p+" is equivalent to "pp*".

  • A question mark ('?') indicates that the preceding pattern must match exactly zero or once. "z?" matches the strings "z" and "".

  • A pattern may be followed by a number or pair of numbers enclosed in braces ({}) to specify the number of times, or a range of times, a pattern should occur. ".{4}" matches all inputs of four characters and is equivalent to "....". "a {2,}". "will match all inputs consisting of the character 'a' at least twice, which is equivalent to the pattern "aaa*". The pattern "a {2,4}" will match inputs with at least two but no more than four occurrences of the letter 'a' (that is, "aa," "aaa," and "aaaa"). This rule can be considered a generalization of the previous two.

  • Patterns consisting of lists of characters enclosed in brackets ([ ]) match any of the enclosed characters. The pattern "[ab]" matches the strings "a" and "b". "[ab]*" matches inputs with any combination of 'a's and 'b's, such as "a," "b," "aabb," "baabaab," and so on.

  • As a shortcut, ranges of characters can be expressed with a dash, so "[d-g]" means the same thing as "[defg]". Ranges can be listed sequentially, "[a-dx-z]" means the same thing as "[abcdxyz]". Note that this is

    not the same as "[a-d][x-z]," which matches two-character inputs such as "az" and "dx". This means that "[a-zA-Z] {3}" matches all three-letter words.

  • Starting a range or set of characters with a caret (^) negates the list. "[^a]" matches any character except 'a'; "[^a-d]" matches any character except 'a,' 'b,' 'c,' or 'd.'

  • A caret outside brackets means the start of a line and so should only appear at the beginning of a pattern. Likewise a dollar sign ($) represents the end of a line.

  • A vertical bar (|) separating two patterns matches if the input matches the pattern on the left or the right. "a | b" means the same as "[ab]," which is not that interesting. "a | zz" matches either a single 'a' or two 'z's.

  • So far there has been a slight ambiGUIty in the regular expression syntax; "abc {3}" could mean either "abcabcabc" or "abccc." Pattern modifiers effect only the immediately preceding pattern, so "abc {3}" is equivalent to "abccc." To make a modifier affect a string of characters the characters must be enclosed in parentheses, so "(abc) {3}" matches "abcabcabc." Any pattern can be enclosed in parentheses in order to apply a modifier to it. "(a[bc]) {3}" matches strings consisting of three iterations of the pattern "a[bc]," such as "ababab," "acabab," and so on.


There are other options, and all three of the regular expression packages that will be discussed add their own twists. The syntax covered is enough for common use and allows many interesting possibilities. Here are a few:

  • .*q[^u].* finds all words containing a 'q' that is not followed by a 'u.'

  • .*z.*x.*|.*x.*z.* finds all words with both a 'z' and an 'x'. Note the use of an "or" pattern to represent the two cases, the 'z' before the 'x' and the 'x' before the 'z'. In principle this same approach could be used to find words where every vowel appears once, but because there are 120 ways to order five vowels, this is impractical.

  • ^([aeiou][^aeiou]*) {11}$ matches all words that start with a vowel and contain eleven vowels. [aeiou][^aeiou]* matches a vowel followed by any number of nonvowels, and the {11} modifier repeats this eleven times.



/ 207