6.2. Matching Letters
6.2.1. Problem
You
want to see whether a string contains only alphabetic characters.
6.2.2. Solution
The obvious character class for matching regular letters isn't good
enough in the general case:
if ($var =~ /^[A-Za-z]+$/) {
# it is purely alphabetic
}
because it doesn't pay attention to letters with diacritics or
characters from other writing systems. The best solution is to use
Unicode properties:
if ($var =~ /^\p{Alphabetic}+$/) { # or just /^\pL+$/
print "var is purely alphabetic\n";
}
On older releases of Perl that don't support Unicode, your only real
option was to use either a negated character class:
if ($var =~ /^[^\W\d_]+$/) {
print "var is purely alphabetic\n";
}
or, if supported, POSIX character classes:
if ($var =~ /^[[:alpha:]]+$/) {
print "var is purely alphabetic\n";
}
But these don't work for non-ASCII letters unless you
use locale and the system
you're running on actually supports POSIX
locales.
6.2.3. Discussion
Apart from Unicode properties or POSIX character classes, Perl can't
directly express "something alphabetic" independent of locale, so we
have to be more clever. The \w regular expression
notation matches one alphabetic, numeric, or underscore
character—hereafter known as an "alphanumunder" for short.
Therefore, \W is one character that is not one of
those. The negated character class [^\W\d_]
specifies a character that must be neither a non-alphanumunder, a
digit, nor an underscore. That leaves nothing but alphabetics, which
is what we were looking for.Here's how you'd use this in a program:
use locale;
use POSIX 'locale_h';
# the following locale string might be different on your system
unless (setlocale(LC_ALL, "fr_CA.ISO8859-1")) {
die "couldn't set locale to French Canadian\n";
}
while (<DATA>) {
chomp;
if (/^[^\W\d_]+$/) {
print "$_: alphabetic\n";
} else {
print "$_: line noise\n";
}
}
_ _END_ _
silly
façade
coöperate
niño
Renée
Molière
hæmoglobin
naïve
tschüß
random!stuff#here
POSIX character classes help a little here; available ones are
alpha, alnum,
ascii, blank,
cntrl, digit,
graph, lower,
print, punct,
space, upper,
word, and xdigit. These are
valid only within a square-bracketed character class specification:
$phone =~ /\b[:digit:]{3}[[:space:][:punct:]]?[:digit:]{4}\b/;# WRONG
$phone =~ /\b[[:digit:]]{3}[[:space:][:punct:]]?[[:digit:]]{4}\b/;# RIGHT
It would be easier to use properties instead, because they don't have
to occur only within other square brackets:
$phone =~ /\b\p{Number}{3}[\p{Space}\p{Punctuation]?\p{Number}{4}\b/;
$phone =~ /\b\pN{3}[\pS\pP]?\pN{4}\b/;# abbreviated form
Match any one character with Unicode property
prop using
\p{prop}; to match any
character lacking that property, use
\P{prop} or
[^\p{prop}]. The relevant
property when looking for alphabetics is
Alphabetic, which can be abbreviated as simply
Letter or even just L.
Other relevant properties include
UppercaseLetter,
LowercaseLetter, and
TitlecaseLetter; their short forms are
Lu, Ll, and
Lt, respectively.
6.2.4. See Also
The treatment of locales in Perl in
perllocale(1); your system's
locale(3) manpage; we discuss locales in greater
depth in Recipe 6.12; the "Perl and the POSIX
Locale" section of Chapter 7 of Mastering Regular
Expressions; also much of that book's Chapter 3