Perl Cd Bookshelf [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Cd Bookshelf [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید

8.20. Reading or Writing Unicode from a Filehandle


8.20.1. Problem


You have a file
containing text in a particular encoding and when you read data from
that into a Perl string, Perl treats it as a series of 8-bit bytes.
You''d like to work with characters instead of bytes because your
encoding characters can take more than one byte. Also, if Perl
doesn''t know about your encoding, it may fail to identify certain
characters as letters. Similarly, you may want to output text in a
particular encoding.

8.20.2. Solution


Use I/O
layers to tell Perl that data from that filehandle is in a particular
encoding.

open(my $ifh, "<:encoding(ENCODING_NAME)", $filename);
open(my $ofh, ">:encoding(ENCODING_NAME)", $filename);

8.20.3. Discussion


Perl''s text manipulation functions handle UTF-8 strings just as well
as they do 8-bit data—they just need to know what type of data
they''re working with. Each string in Perl is internally marked as
either UTF-8 or 8-bit data. The encoding(...)
layer converts data between variable external encodings and the
internal UTF-8 within Perl. This is done by way of the Encode
module.Unicode Support in Perl back in the Introduction to Chapter 1, we explained
how under Unicode, every different character had a different code
point (i.e., a different number) associated with it. Assigning all
characters unique code points solves many problems. No longer does
the same number, like 0xC4, represent one character under one
character repertoire (e.g., a LATIN CAPITAL LETTER A WITH DIAERESIS
under ISO-8859-1) and a different character in another repertoire
(e.g., a GREEK CAPITAL LETTER DELTA under ISO-8859-7).

This neatly solves many problems, but still leaves one important
issue: the precise format used in memory or disk for each code point.
If most code points fit in 8 bits, it would seem wasteful to use,
say, a full 32 bits for each character. But if every character is the
same size as every other character, the code is easier to write and
may be faster to execute.

This has given rise to different encoding systems for storing
Unicode, each offering distinct advantages. Fixed-width encodings fit
every code point into the same number of bits, which simplifies
programming but at the expense of some wasted space. Variable-width
encodings use only as much space as each code point requires, which
saves space but complicates programming.

One further complication is combined characters, which may look like
single letters on paper but in code require multiple code points.
When you see a capital A with two dots above it (a diaeresis) on your
screen, it may not even be character U+00C4. As explained in Recipe 1.8, Unicode supports the idea of combining
characters, where you start with a base character and add non-spacing
marks to it. U+0308 is a "COMBINING DIAERESIS", so you could use a
capital A (U+0041) followed by U+0308, or A\x{308}
to produce the same output.

The following table shows the old ISO 8859-1 way of writing a capital
A with a diaeresis, in which the logical character code and the
physical byte layout enjoyed an identical representation, and the new
way under Unicode. We''ll include both ways of writing that character:
one precomposed in one code point and the other using two code points
to create a combined character.



















































Old way


New way



Ä


A


Ä


Ä


Character(s)


0xC4


U+0041


U+00C4


U+0041 U+0308


Character repertoire


ISO 8859-1


Unicode


Unicode


Unicode


Character code(s)


0xC4


0x0041


0x00C4


0x0041 0x0308


Encoding



UTF-8


UTF-8


UTF-8


Byte(s)


0xC4


0x41


0xC3 0x84


0x41 0xCC 0x88

The internal format
used by Perl is UTF-8, a variable-width encoding system. One reason
for this choice is that legacy ASCII requires no conversion for
UTF-8, looking in memory exactly as it did before—just one byte
per character. Character U+0041 is just 0x41 in memory. Legacy data
sets don''t increase in size, and even those using Western character
sets like ISO 8859-n grow only slightly, since
in practice you still have a favorable ratio of regular ASCII
characters to 8-bit accented characters.

Just because Perl uses UTF-8 internally doesn''t preclude using other
formats externally. Perl automatically converts all data between
UTF-8 and whatever encoding you''ve specified for that handle. The
Encode module is used implicitly when you specify an I/O layer of the
form ":encoding(....)". For example:

binmode(FH, ":encoding(UTF-16BE)")
or die "can''t binmode to utf-16be: $!";

or directly in the open:

open(FH, "< :encoding(UTF-32)", $pathname)
or die "can''t open $pathname: $!";

Here''s a comparison of actual byte layouts of those two sequences,
both representing a capital A with diaeresis, under several other
popular formats:












































U+00C4


U+0041 U+0308


UTF-8


c3 84


41 cc 88


UTF-16BE


00 c4


00 41 03 08


UTF-16LE


c4 00


41 00 08 03


UTF-16


fe ff 00 c4


fe ff 00 41 03 08


UTF-32LE


c4 00 00 00


41 00 00 00 08 03 00 00


UTF-32BE


00 00 00 c4


00 00 00 41 00 00 03 08


UTF-32


00 00 fe ff 00 00 00 c4


00 00 fe ff 00 00 00 41 00 00 03 08

This can chew up memory quickly. It''s also
complicated by the fact that some computers are big-endian, others
little-endian. So fixed-width encoding formats that don''t specify
their endian-ness require a special byte-ordering sequence ("FF EF"
versus "EF FF"), usually needed only at the start of the stream.

If you''re reading or writing UTF-8 data, use the
:utf8 layer. Because Perl natively uses UTF-8, the
:utf8 layer bypasses the Encode module for
performance.

The Encode module understands many aliases for encodings, so
ascii, US-ascii, and
ISO-646-US are synonymous. Read the
Encode::Supported manpage for a list of available encodings. Perl
supports not only standard Unicode names but vendor-specific names,
too; for example, iso-8859-1 is
cp850 on DOS, cp1252 on
Windows, MacRoman on a Mac, and
hp-roman8 on NeXTstep. The Encode module
recognizes all of these as names for the same encoding.

8.20.4. See Also


The documentation for the standard Encode module; the
Encode::Supported manpage; Recipe 8.12 and
Recipe 8.19

/ 875