Recipe 1.21. Converting Between Unicode and Plain Strings
Credit: David Ascher, Paul Prescod
Problem
You need to deal with textual data
that doesn't necessarily fit in the ASCII character
set.
Solution
Unicode strings can be encoded in plain strings in a variety of ways,
according to whichever encoding you choose:
unicodestring = u"Hello world"
# Convert Unicode to plain Python string: "encode"
utf8string = unicodestring.encode("utf-8")
asciistring = unicodestring.encode("ascii")
isostring = unicodestring.encode("ISO-8859-1")
utf16string = unicodestring.encode("utf-16")
# Convert plain Python string to Unicode: "decode"
plainstring1 = unicode(utf8string, "utf-8")
plainstring2 = unicode(asciistring, "ascii")
plainstring3 = unicode(isostring, "ISO-8859-1")
plainstring4 = unicode(utf16string, "utf-16")
assert plainstring1 == plainstring2 == plainstring3 == plainstring4
Discussion
If you find yourself dealing with text that contains non-ASCII
characters, you have to learn about Unicodewhat it is, how it
works, and how Python uses it. The preceding Recipe 1.20 offers minimal but crucial
practical tips, and this recipe tries to offer more perspective.You don't need to know everything about Unicode to
be able to solve real-world problems with it, but a few basic tidbits
of knowledge are indispensable. First, you must understand the
difference between bytes and characters. In older, ASCII-centric
languages and environments, bytes and characters are treated as if
they were the same thing. A byte can hold up to 256 different values,
so these environments are limited to dealing with no more than 256
distinct characters. Unicode, on the other hand, has tens of
thousands of characters, which means that each Unicode character
takes more than one byte; thus you need to make the distinction
between characters and bytes.Standard Python strings are really bytestrings, and a Python
character, being such a string of length 1, is really a byte. Other
terms for an instance of the standard Python string type are
8-bit string and plain
string. In this recipe we call such instances bytestrings,
to remind you of their byte orientation.A Python Unicode character is an abstract object big enough to hold
any character, analogous to Python's long integers.
You don't have to worry about the internal
representation; the representation of Unicode characters becomes an
issue only when you are trying to send them to some byte-oriented
function, such as the write method of files or the
send method of network sockets. At that point, you
must choose how to represent the characters as bytes. Converting from
Unicode to a bytestring is called encoding the
string. Similarly, when you load Unicode strings from a file, socket,
or other byte-oriented object, you need to
decode the strings from bytes to characters.Converting Unicode objects to bytestrings can be achieved in many
ways, each of which is called an encoding. For a
variety of historical, political, and technical reasons, there is no
one "right" encoding. Every
encoding has a case-insensitive name, and that name is passed to the
encode and decode methods as a
parameter. Here are a few encodings you should know about:
- The UTF-8 encoding can handle any Unicode
character. It is also backwards compatible with ASCII, so that a pure
ASCII file can also be considered a UTF-8 file, and a UTF-8 file that
happens to use only ASCII characters is identical to an ASCII file
with the same characters. This property makes UTF-8 very
backwards-compatible, especially with older Unix tools. UTF-8 is by
far the dominant encoding on Unix, as well as the default encoding
for XML documents. UTF-8's primary weakness is that
it is fairly inefficient for eastern-language texts. - The UTF-16 encoding is favored by Microsoft operating systems and the
Java environment. It is less efficient for western languages but more
efficient for eastern ones. A variant of UTF-16 is sometimes known as
UCS-2. - The ISO-8859 series of encodings are supersets of ASCII, each able to
deal with 256 distinct characters. These encodings cannot support all
of the Unicode characters; they support only some particular language
or family of languages. ISO-8859-1, also known as
"Latin-1", covers most western
European and African languages, but not Arabic. ISO-8859-2, also
known as "Latin-2", covers many
eastern European languages such as Hungarian and Polish. ISO-8859-15,
very popular in Europe these days, is basically the same as
ISO-8859-1 with the addition of the Euro currency symbol as a
character.
If you want to be able to encode all Unicode characters,
you'll probably want to use UTF-8. You will need to
deal with the other encodings only when you are handed data in those
encodings created by some other application or input device, or vice
versa, when you need to prepare data in a specified encoding to
accommodate another application downstream of yours, or an output
device. In particular, Recipe 1.22 shows how to handle the case
in which the downstream application or device is driven from your
program's standard output stream.
See Also
Unicode is a huge topic, but a recommended book is Tony Graham,
Unicode: A Primer (Hungry Minds)details
are available at http://www.menteith.com/unicode/primer/; and
a short, but complete article from Joel Spolsky,
"The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses)!" is located at http://www.joelonsoftware.com/articles/Unicodel.
See also the Library Reference and
Python in a Nutshell documentation about the
built-in str and unicode types,
and modules unidata and codecs;
also, Recipe 1.20 and
Recipe 1.22.