Perl Cd Bookshelf [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Cd Bookshelf [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید



1.8. Treating Unicode Combined Characters as Single Characters


1.8.1. Problem



You have a Unicode string
that contains combining characters, and you'd like to treat each of
these sequences as a single logical character.

1.8.2. Solution


Process them using \X in a regular expression.

$string = "fac\x{0327}ade"; # "façade"
$string =~ /fa.ade/; # fails
$string =~ /fa\Xade/; # succeeds
@chars = split(//, $string); # 7 letters in @chars
@chars = $string =~ /(.)/g; # same thing
@chars = $string =~ /(\X)/g; # 6 "letters" in @chars

1.8.3. Discussion


In Unicode, you can combine a base character with one or more
non-spacing characters following it; these are usually diacritics,
such as accent marks, cedillas, and tildas. Due to the presence of
precombined characters, for the most part to accommodate legacy
character systems, there can be two or more ways of writing the same
thing.

For example, the word "façade" can be written with one
character between the two a's, "\x{E7}", a
character right out of Latin1 (ISO 8859-1). These characters might be
encoded into a two-byte sequence under the UTF-8 encoding that Perl
uses internally, but those two bytes still only count as one single
character. That works just fine.

There's a thornier issue. Another way to write U+00E7 is with two
different code points: a regular "c" followed by
"\x{0327}". Code point U+0327 is a non-spacing
combining character that means to go back and put a cedilla
underneath the preceding base character.

There are times when you want Perl to treat each combined character
sequence as one logical character. But because they're distinct code
points, Perl's character-related operations treat non-spacing
combining characters as separate characters, including
substr, length, and regular
expression metacharacters, such as in /./ or
/[^abc]/.

In a regular expression, the
\X metacharacter matches an extended Unicode
combining character sequence, and is exactly equivalent to
(?:\PM\pM*) or, in long-hand:

(?x: # begin non-capturing group
\PM # one character without the M (mark) property,
# such as a letter
\pM # one character that does have the M (mark) property,
# such as an accent mark
* # and you can have as many marks as you want
)

Otherwise simple operations become tricky if these beasties are in
your string. Consider the approaches for reversing a word by
character from the previous recipe. Written with combining
characters, "année" and
"niño" can be expressed in Perl as
"anne\x{301}e" and
"nin\x{303}o".

for $word ("anne\x{301}e", "nin\x{303}o") {
printf "%s simple reversed to %s\n", $word,
scalar reverse $word;
printf "%s better reversed to %s\n", $word,
join(", reverse $word =~ /\X/g);
}

That produces:

année simple reversed to éenna
année better reversed to eénna
niño simple reversed to õnin
niño better reversed to oñin

In the reversals marked as simply reversed, the diacritical marking
jumped from one base character to the other one. That's because a
combining character always follows its base character, and you've
reversed the whole string. By grabbing entire sequences of a base
character plus any combining characters that follow, then reversing
that list, this problem is avoided.

1.8.4. See Also


The perlre(1) and
perluniintro(1) manpages; Chapter 15 of
Programming Perl;
Recipe 1.9



1.7. Reversing a String by Word or Character1.9. Canonicalizing Strings with Unicode Combined Characters




Copyright © 2003 O'Reilly & Associates. All rights reserved.

/ 875