Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 1.21. Converting Between Unicode and Plain Strings


Credit: David Ascher, Paul Prescod


Problem




You need to deal with textual data
that doesn't necessarily fit in the ASCII character
set.


Solution


Unicode strings can be encoded in plain strings in a variety of ways,
according to whichever encoding you choose:

unicodestring = u"Hello world"
# Convert Unicode to plain Python string: "encode"
utf8string = unicodestring.encode("utf-8")
asciistring = unicodestring.encode("ascii")
isostring = unicodestring.encode("ISO-8859-1")
utf16string = unicodestring.encode("utf-16")
# Convert plain Python string to Unicode: "decode"
plainstring1 = unicode(utf8string, "utf-8")
plainstring2 = unicode(asciistring, "ascii")
plainstring3 = unicode(isostring, "ISO-8859-1")
plainstring4 = unicode(utf16string, "utf-16")
assert plainstring1 == plainstring2 == plainstring3 == plainstring4


Discussion


If you find yourself dealing with text that contains non-ASCII
characters, you have to learn about Unicodewhat it is, how it
works, and how Python uses it. The preceding Recipe 1.20 offers minimal but crucial
practical tips, and this recipe tries to offer more perspective.

You don't need to know everything about Unicode to
be able to solve real-world problems with it, but a few basic tidbits
of knowledge are indispensable. First, you must understand the
difference between bytes and characters. In older, ASCII-centric
languages and environments, bytes and characters are treated as if
they were the same thing. A byte can hold up to 256 different values,
so these environments are limited to dealing with no more than 256
distinct characters. Unicode, on the other hand, has tens of
thousands of characters, which means that each Unicode character
takes more than one byte; thus you need to make the distinction
between characters and bytes.

Standard Python strings are really bytestrings, and a Python
character, being such a string of length 1, is really a byte. Other
terms for an instance of the standard Python string type are
8-bit string and plain
string
. In this recipe we call such instances bytestrings,
to remind you of their byte orientation.

A Python Unicode character is an abstract object big enough to hold
any character, analogous to Python's long integers.
You don't have to worry about the internal
representation; the representation of Unicode characters becomes an
issue only when you are trying to send them to some byte-oriented
function, such as the write method of files or the
send method of network sockets. At that point, you
must choose how to represent the characters as bytes. Converting from
Unicode to a bytestring is called encoding the
string. Similarly, when you load Unicode strings from a file, socket,
or other byte-oriented object, you need to
decode the strings from bytes to characters.

Converting Unicode objects to bytestrings can be achieved in many
ways, each of which is called an encoding. For a
variety of historical, political, and technical reasons, there is no
one "right" encoding. Every
encoding has a case-insensitive name, and that name is passed to the
encode and decode methods as a
parameter. Here are a few encodings you should know about:

  • The UTF-8 encoding can handle any Unicode
    character. It is also backwards compatible with ASCII, so that a pure
    ASCII file can also be considered a UTF-8 file, and a UTF-8 file that
    happens to use only ASCII characters is identical to an ASCII file
    with the same characters. This property makes UTF-8 very
    backwards-compatible, especially with older Unix tools. UTF-8 is by
    far the dominant encoding on Unix, as well as the default encoding
    for XML documents. UTF-8's primary weakness is that
    it is fairly inefficient for eastern-language texts.

  • The UTF-16 encoding is favored by Microsoft operating systems and the
    Java environment. It is less efficient for western languages but more
    efficient for eastern ones. A variant of UTF-16 is sometimes known as
    UCS-2.

  • The ISO-8859 series of encodings are supersets of ASCII, each able to
    deal with 256 distinct characters. These encodings cannot support all
    of the Unicode characters; they support only some particular language
    or family of languages. ISO-8859-1, also known as
    "Latin-1", covers most western
    European and African languages, but not Arabic. ISO-8859-2, also
    known as "Latin-2", covers many
    eastern European languages such as Hungarian and Polish. ISO-8859-15,
    very popular in Europe these days, is basically the same as
    ISO-8859-1 with the addition of the Euro currency symbol as a
    character.


If you want to be able to encode all Unicode characters,
you'll probably want to use UTF-8. You will need to
deal with the other encodings only when you are handed data in those
encodings created by some other application or input device, or vice
versa, when you need to prepare data in a specified encoding to
accommodate another application downstream of yours, or an output
device. In particular, Recipe 1.22 shows how to handle the case
in which the downstream application or device is driven from your
program's standard output stream.


See Also


Unicode is a huge topic, but a recommended book is Tony Graham,
Unicode: A Primer (Hungry Minds)details
are available at http://www.menteith.com/unicode/primer/; and
a short, but complete article from Joel Spolsky,
"The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses)!" is located at http://www.joelonsoftware.com/articles/Unicodel.
See also the Library Reference and
Python in a Nutshell documentation about the
built-in str and unicode types,
and modules unidata and codecs;
also, Recipe 1.20 and
Recipe 1.22.


/ 394