Perl Cd Bookshelf [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

6.12. Honoring Locale Settings in Regular Expressions

6.12.1. Problem

You want to translate case when in a
different locale, or you want to make \w match
letters with diacritics, such as José
or déjà vu.

For example, let''s say you''re given half a gigabyte of text written
in German and told to index it. You want to extract words (with
\w+) and convert them to lowercase (with
lc or \L), but the normal
versions of \w and lc neither
match the German words nor change the case of accented letters.

6.12.2. Solution

Perl''s
regular-expression and text-manipulation routines have hooks to the
POSIX locale setting. Under the use
locale pragma, accented characters are taken care
of—assuming a reasonable LC_CTYPE
specification and system support for the same.

use locale;

6.12.3. Discussion

By default, \w+ and case-mapping functions operate
on upper- and lowercase letters, digits, and underscores. This works
only for the simplest of English words, failing even on many common
imports. The use locale
directive redefines what a "word character" means.

In Example 6-7 you see the difference in output
between having selected the English ("en") locale and the German
("de") one.

Example 6-7. localeg

  #!/usr/bin/perl -w
# localeg - demonstrate locale effects
use locale;
use POSIX ''locale_h'';
$name = "andreas k\xF6nig";
@locale{qw(German English)} = qw(de_DE.ISO_8859-1 us-ascii);
setlocale(LC_CTYPE, $locale{English})
or die "Invalid locale $locale{English}";
@english_names = ( );
while ($name =~ /\b(\w+)\b/g) {
push(@english_names, ucfirst($1));
}
setlocale(LC_CTYPE, $locale{German})
or die "Invalid locale $locale{German}";
@german_names = ( );
while ($name =~ /\b(\w+)\b/g) {
push(@german_names, ucfirst($1));
}
print "English names: @english_names\n";
print "German names:  @german_names\n";
  English names: Andreas K Nig
  German names:  Andreas König

This approach relies on POSIX locale support for 8-bit character
encodings, which your system may or may not provide. Even if your
system does claim to provide POSIX locale support, the standard does
not specify the locale names. As you might guess, portability of this
approach is not assured. If your data is already in Unicode, you
don''t need POSIX locales for this to work.

6.12.4. See Also

The treatment of \b, \w, and
\s in perlre(1) and in the
"Classic Perl Character Class Shortcuts" section of Chapter 5 of
Programming Perl; the treatment of locales in
Perl in perllocale(1); your system''s
locale(3) manpage; we discuss locales in greater
depth in Recipe 6.2; the "POSIX—An
Attempt at Standardization" section of Chapter 3 of
Mastering Regular Expressions

Perl Cd Bookshelf [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی