
![]() | ![]() |
1.8. Treating Unicode Combined Characters as Single Characters
1.8.1. Problem
You have a Unicode string
that contains combining characters, and you'd like to treat each of
these sequences as a single logical character.
1.8.2. Solution
Process them using \X in a regular expression.$string = "fac\x{0327}ade"; # "façade"
$string =~ /fa.ade/; # fails
$string =~ /fa\Xade/; # succeeds
@chars = split(//, $string); # 7 letters in @chars
@chars = $string =~ /(.)/g; # same thing
@chars = $string =~ /(\X)/g; # 6 "letters" in @chars
1.8.3. Discussion
In Unicode, you can combine a base character with one or more
non-spacing characters following it; these are usually diacritics,
such as accent marks, cedillas, and tildas. Due to the presence of
precombined characters, for the most part to accommodate legacy
character systems, there can be two or more ways of writing the same
thing.For example, the word "façade" can be written with one
character between the two a's, "\x{E7}", a
character right out of Latin1 (ISO 8859-1). These characters might be
encoded into a two-byte sequence under the UTF-8 encoding that Perl
uses internally, but those two bytes still only count as one single
character. That works just fine.There's a thornier issue. Another way to write U+00E7 is with two
different code points: a regular "c" followed by
"\x{0327}". Code point U+0327 is a non-spacing
combining character that means to go back and put a cedilla
underneath the preceding base character.There are times when you want Perl to treat each combined character
sequence as one logical character. But because they're distinct code
points, Perl's character-related operations treat non-spacing
combining characters as separate characters, including
substr, length, and regular
expression metacharacters, such as in /./ or
/[^abc]/. In a regular expression, the
\X metacharacter matches an extended Unicode
combining character sequence, and is exactly equivalent to
(?:\PM\pM*) or, in long-hand:(?x: # begin non-capturing group
\PM # one character without the M (mark) property,
# such as a letter
\pM # one character that does have the M (mark) property,
# such as an accent mark
* # and you can have as many marks as you want
)
Otherwise simple operations become tricky if these beasties are in
your string. Consider the approaches for reversing a word by
character from the previous recipe. Written with combining
characters, "année" and
"niño" can be expressed in Perl as
"anne\x{301}e" and
"nin\x{303}o".for $word ("anne\x{301}e", "nin\x{303}o") {
printf "%s simple reversed to %s\n", $word,
scalar reverse $word;
printf "%s better reversed to %s\n", $word,
join(", reverse $word =~ /\X/g);
}
That produces:année simple reversed to éenna
année better reversed to eénna
niño simple reversed to õnin
niño better reversed to oñin
In the reversals marked as simply reversed, the diacritical marking
jumped from one base character to the other one. That's because a
combining character always follows its base character, and you've
reversed the whole string. By grabbing entire sequences of a base
character plus any combining characters that follow, then reversing
that list, this problem is avoided.
1.8.4. See Also
The perlre(1) and
perluniintro(1) manpages; Chapter 15 of
Programming Perl; Recipe 1.9
![]() | ![]() | ![]() |
1.7. Reversing a String by Word or Character | ![]() | 1.9. Canonicalizing Strings with Unicode Combined Characters |

Copyright © 2003 O'Reilly & Associates. All rights reserved.