Table 16-3. Java regular expression quick reference
Syntax | Matches |
---|
Single characters |
x | The character x, as long as x is not a punctuation character with special meaning in the regular expression syntax. |
\p | The punctuation character p. |
\\ | The backslash character |
\n | Newline character \u000A. |
\t | Tab character \u0009. |
\r | Carriage return character \u000D. |
\f | Form feed character \u000C. |
\e | Escape character \u001B. |
\a | Bell (alert) character \u0007. |
\uxxxx | Unicode character with hexadecimal code xxxx. |
\xxx | Character with hexadecimal code xx. |
\0n | Character with octal code n. |
\0nn | Character with octal code nn. |
\0nnn | Character with octal code nnn, where nnn <= 377. |
\cx | The control character ^x. |
Character classes |
[...] | One of the characters between the brackets. Characters may be specified literally, and the syntax also allows the specification of character ranges, with intersection, union, and subtraction operators. See specific examples below. |
[^...] | Any one character not between the brackets. |
[a-z0-9] | Character range: a character between (inclusive) a and z or 0 and 9. |
[0-9[a-fA-F]] | Union of classes: same as [0-9a-fA-F] |
[a-z&&[aeiou]] | Intersection of classes: same as [aeiou]. |
[a-z&&[^aeiou]] | Subtraction: the characters a through z except for the vowels. |
. | Any character except a line terminator. If the DOTALL flag is set, then it matches any character including line terminators. |
\d | ASCII digit: [0-9]. |
\D | Anything but an ASCII digit: [^\d]. |
\s | ASCII whitespace: [ \t\n\f\r\x0B] |
\S | Anything but ASCII whitespace: [^\s]. |
\w | ASCII word character: [a-zA-Z0-9_]. |
\W | Anything but ASCII word characters: [^\w]. |
\p{group} | Any character in the named group. See group names below. Many of the group names are from POSIX, which is why p is used for this character class. |
\P{group} | Any character not in the named group. |
\p{Lower} | ASCII lowercase letter: [a-z]. |
\p{Upper} | ASCII uppercase: [A-Z]. |
\p{ASCII} | Any ASCII character: [\x00-\x7f]. |
\p{Alpha} | ASCII letter: [a-zA-Z]. |
\p{Digit} | ASCII digit: [0-9]. |
\p{XDigit} | Hexadecimal digit: [0-9a-fA-F]. |
\p{Alnum} | ASCII letter or digit: [\p{Alpha}\p{Digit}]. |
\p{Punct} | ASCII punctuation: one of !"#$%& ( )*+,-./:;<=>?@[\]^_ {|}~]. |
\p{Graph} | visible ASCII character: [\p{Alnum}\p{Punct}]. |
\p{Print} | visible ASCII character: same as \p{Graph}. |
\p{Blank} | ASCII space or tab: [ \t]. |
\p{Space} | ASCII whitespace: [ \t\n\f\r\x0b]. |
\p{Cntrl} | ASCII control character: [\x00-\x1f\x7f]. |
\p{category} | Any character in the named Unicode category. Category names are one or two letter codes defined by the Unicode standard. One letter codes include L for letter, N for number, S for symbol, Z for separator, and P for punctuation. Two letter codes represent subcategories, such as Lu for uppercase letter, Nd for decimal digit, Sc for currency symbol, Sm for math symbol, and Zs for space separator. See java.lang.Character for a set of constants that correspond to these subcategories; however, note that the full set of one- and two-letter codes is not documented in this book. |
\p{block} | Any character in the named Unicode block. In Java regular expressions, block names begin with "In", followed by mixed-case capitalization of the Unicode block name, without spaces or underscores. For example: \p{InOgham} or \p{InMathematicalOperators}. See java.lang.Character.UnicodeBlock for a list of Unicode block names. |
Sequences, alternatives, groups, and references |
xy | Match x followed by y. |
x|y | Match x or y. |
(...) | Grouping. Group subexpression within parentheses into a single unit that can be used with *, +, ?, |, and so on. Also "capture" the characters that match this group for use later. |
(?:...) | Grouping only. Group subexpression as with ( ), but do not capture the text that matched. |
\n | Match the same characters that were matched when capturing group number n was first matched. Be careful when n is followed by another digit: the largest number that is a valid group number will be used. |
Repetition[1] |
x? | zero or one occurrence of x; i.e., x is optional. |
x* | zero or more occurrences of x. |
x+ | one or more occurrences of x. |
x{n} | exactly n occurrences of x. |
x{n,} | n or more occurrences of x. |
x{n,m} | at least n, and at most m occurrences of x. |
Anchors[2] |
^ | The beginning of the input string, or if the MULTILINE flag is specified, the beginning of the string or of any new line. |
$ | The end of the input string, or if the MULTILINE flag is specified, the end of the string or of line within the string. |
\b | A word boundary: a position in the string between a word and a nonword character. |
\B | A position in the string that is not a word boundary. |
\A | The beginning of the input string. Like ^, but never matches the beginning of a new line, regardless of what flags are set. |
\Z | The end of the input string, ignoring any trailing line terminator. |
\z | The end of the input string, including any line terminator. |
\G | The end of the previous match. |
(?=x) | A positive look-ahead assertion. Require that the following characters match x, but do not include those characters in the match. |
(?!x) | A negative look-ahead assertion. Require that the following characters do not match the pattern x. |
(?<=x) | A positive look-behind assertion. Require that the characters immediately before the position match x, but do not include those characters in the match. x must be a pattern with a fixed number of characters. |
(?<!x) | A negative look-behind assertion. Require that the characters immediately before the position do not match x. x must be a pattern with a fixed number of characters. |
Miscellaneous |
(?>x) | Match x independently of the rest of the expression, without considering whether the match causes the rest of the expression to fail to match. Useful to optimize certain complex regular expressions. A group of this form does not capture the matched text. |
(?onflags-offflags) | Don t match anything, but turn on the flags specified by onflags, and turn off the flags specified by offflags. These two strings are combinations in any order of the following letters and correspond to the following Pattern constants: i (CASE_INSENSITIVE), d (UNIX_LINES), m (MULTILINE), s (DOTALL), u (UNICODE_CASE), and x (COMMENTS). Flag settings specified in this way take effect at the point that they appear in the expression and persist until the end of the expression, or until the end of the parenthesized group of which they are a part, or until overridden by another flag setting expression. |
(?onflags-offflags:x) | Match x, applying the specified flags to this subexpression only. This is a noncapturing group, like (?:...), with the addition of flags. |
\Q | Don't match anything, but quote all subsequent pattern text until \E. All characters within such a quoted section are interpreted as literal characters to match, and none (except \E) have special meanings. |
\E | Don't match anything; terminate a quote started with \Q. |
#comment | If the COMMENT flag is set, pattern text between a # and the end of the line is considered a comment and is ignored. |