Perl Cd Bookshelf [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Cd Bookshelf [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید

8.21. Converting Microsoft Text Files into Unicode


8.21.1. Problem


You have a text file written on a
Microsoft computer that looks like garbage when displayed. How do you
fix this?

8.21.2. Solution


Set the encoding layer appropriately when reading to convert this
into Unicode:

binmode(IFH, ":encoding(cp1252)")
|| die "can't binmode to cp1252 encoding: $!";

8.21.3. Discussion


Suppose someone sends you a file in cp1252 format, Microsoft's
default in-house 8-bit character set. Files in this format can be
annoying to read—while they might claim to be Latin1, they are
not, and if you look at them with Latin1 fonts loaded, you'll get
garbage on your screen. A simple solution is as follows:

open(MSMESS, "< :crlf :encoding(cp1252)", $inputfile)
|| die "can't open $inputfile: $!";

Now data read from that handle will be automatically converted into
Unicode when you read it in. It will also be processed in CRLF mode,
which is needed on systems that don't use that sequence to indicate
end of line.

You probably won't
be able to write out this text as Latin1. That's because cp1252
includes characters that don't exist in Latin1. You'll have to leave
it in Unicode, and displaying Unicode properly may not be as easy as
you wish, because finding tools to work with Unicode is something of
a quest in its own right. Most web browsers support ISO 10646 fonts;
that is, Unicode fonts (see http://www.cl.cam.ac.uk/~mgk25/ucs-fontsl).
Whether your text editor does is a different matter, although both
emacs and vi (actually,
vim, not nvi) have
mechanisms for handling Unicode. The authors used the following
xterm(1) command to look at
text:

xterm -n unicode -u8 -fn -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1

But many open questions still exist, such as cutting and pasting of
Unicode data between windows.

The www.unicode.org site has help
for finding and installing suitable tools for a variety of platforms,
including both Unix and Microsoft systems.

You'll also need to tell Perl it's alright to emit Unicode. If you
don't, you'll get a warning about a "Wide character in
print
" every time you try. Assuming you're running in an
xterm like the one shown previously (or its
equivalent for your system) that has Unicode fonts available, you
could just do this:

binmode(STDOUT, ":utf8");

But that requires the rest of your program to emit Unicode, which
might not be convenient. When writing new programs specifically
designed for this, though, it might not be too much trouble.

As of v5.8.1, Perl offers a couple of other means of getting this
effect. The -C command-line switch
controls some Unicode features related to your runtime environment.
This way you can set those features on a per-command basis without
having to edit the source code.

The -C
switch can be followed by either a number or a list of option
letters. Some available letters, their numeric values, and effects
are as follows:
















































Letter


Number


Meaning


I


1


STDIN is assumed to be in UTF-8


O


2


STDOUT will be in UTF-8


E


4


STDERR will be in UTF-8


S


7


I + O + E


i


8


UTF-8 is the default PerlIO layer for input streams


o


16


UTF-8 is the default PerlIO layer for output streams


D


24


i + o


A


32


the @ARGV elements are expected to be strings encoded in UTF-8

You may use letters or numbers. If you use numbers, you have to add
them up. For example, -COE and
-C6 are synonyms of UTF-8 on both
STDOUT and STDERR.

One
last approach is to use the PERL_UNICODE
environment variable. If set, it contains the same value as you would
use with -C. For example, with the
xterm that has Unicode fonts loaded, you could
do this in a POSIX shell:

sh% export PERL_UNICODE=6

or this in the csh:

csh% setenv PERL_UNICODE 6

The advantage of using the environment variable is that you don't
have to edit the source code as the pragma would require, and you
don't even need to change the command invocation as setting -C would require.

8.21.4. See Also


The perlrun(1),
encoding(3), PerlIO(3), and
Encode(3) manpages


/ 875