6.12. Honoring Locale Settings in Regular Expressions
6.12.1. Problem
You want to translate case when in a
different locale, or you want to make \w match
letters with diacritics, such as José
or déjà vu.For example, let''s say you''re given half a gigabyte of text written
in German and told to index it. You want to extract words (with
\w+) and convert them to lowercase (with
lc or \L), but the normal
versions of \w and lc neither
match the German words nor change the case of accented letters.
6.12.2. Solution
Perl''s
regular-expression and text-manipulation routines have hooks to the
POSIX locale setting. Under the use
locale pragma, accented characters are taken care
of—assuming a reasonable LC_CTYPE
specification and system support for the same.
use locale;
6.12.3. Discussion
By default, \w+ and case-mapping functions operate
on upper- and lowercase letters, digits, and underscores. This works
only for the simplest of English words, failing even on many common
imports. The use locale
directive redefines what a "word character" means.In Example 6-7 you see the difference in output
between having selected the English ("en") locale and the German
("de") one.
Example 6-7. localeg
#!/usr/bin/perl -w
# localeg - demonstrate locale effects
use locale;
use POSIX ''locale_h'';
$name = "andreas k\xF6nig";
@locale{qw(German English)} = qw(de_DE.ISO_8859-1 us-ascii);
setlocale(LC_CTYPE, $locale{English})
or die "Invalid locale $locale{English}";
@english_names = ( );
while ($name =~ /\b(\w+)\b/g) {
push(@english_names, ucfirst($1));
}
setlocale(LC_CTYPE, $locale{German})
or die "Invalid locale $locale{German}";
@german_names = ( );
while ($name =~ /\b(\w+)\b/g) {
push(@german_names, ucfirst($1));
}
print "English names: @english_names\n";
print "German names: @german_names\n";
English names: Andreas K Nig
German names: Andreas König
This approach relies on POSIX locale support for 8-bit character
encodings, which your system may or may not provide. Even if your
system does claim to provide POSIX locale support, the standard does
not specify the locale names. As you might guess, portability of this
approach is not assured. If your data is already in Unicode, you
don''t need POSIX locales for this to work.
6.12.4. See Also
The treatment of \b, \w, and
\s in perlre(1) and in the
"Classic Perl Character Class Shortcuts" section of Chapter 5 of
Programming Perl; the treatment of locales in
Perl in perllocale(1); your system''s
locale(3) manpage; we discuss locales in greater
depth in Recipe 6.2; the "POSIX—An
Attempt at Standardization" section of Chapter 3 of
Mastering Regular Expressions