Recipe 1.20. Handling International Text with Unicode
Credit: Holger Krekel
Problem
You need to deal with text
strings that include non-ASCII characters.
Solution
Python has a first class unicode type that you can
use in place of the plain bytestring str type.
It's easy, once you accept the need to explicitly
convert between a bytestring and a Unicode string:
>>> german_ae = unicode('\xc3\xa4', 'utf8')Here german_ae is a
unicode string representing the German lowercase a
with umlaut (i.e., diaeresis) character
"ae". It has been constructed from
interpreting the bytestring '\xc3\xa4' according
to the specified UTF-8 encoding. There are many encodings, but UTF-8
is often used because it is universal (UTF-8 can encode any Unicode
string) and yet fully compatible with the 7-bit ASCII set (any ASCII
bytestring is a correct UTF-8-encoded string).Once you cross this barrier, life is easy! You can manipulate this
Unicode string in practically the same way as a plain
str string:
>>> sentence = "This is a " + german_aeNote that para is a
>>> sentence2 = "Easy!"
>>> para = ". ".join([sentence, sentence2])
Unicode string, because operations between a
unicode string and a bytestring always result in a
unicode stringunless they fail and raise an
exception:
>>> bytestring = '\xc3\xa4' # Uuh, some non-ASCII bytestring!The byte '0xc3' is not a valid character in the
>>> german_ae += bytestring
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 0: ordinal not in range(128)
7-bit ASCII encoding, and Python refuses to guess an encoding. So,
being explicit about encodings is the crucial point for successfully
using Unicode strings with Python.
Discussion
Unicode is easy to handle in Python, if you respect a few guidelines
and learn to deal with common problems. This is not to say that an
efficient implementation of Unicode is an easy task. Luckily, as with
other hard problems, you don't have to care much:
you can just use the efficient implementation of Unicode that Python
provides.The most important issue is to fully accept the distinction between a
bytestring and a unicode string. As exemplified in
this recipe's solution, you often need to explicitly
construct a unicode string by providing a
bytestring and an encoding. Without an encoding, a bytestring is
basically meaningless, unless you happen to be lucky and can just
assume that the bytestring is text in ASCII.The most common problem with using Unicode in Python arises when you
are doing some text manipulation where only some of your strings are
unicode objects and others are bytestrings. Python
makes a shallow attempt to implicitly convert your bytestrings to
Unicode. It usually assumes an ASCII encoding, though, which gives
you UnicodeDecodeError exceptions if you actually
have non-ASCII bytes somewhere. UnicodeDecodeError
tells you that you mixed Unicode and bytestrings in such a way that
Python cannot (doesn't even try to) guess the text
your bytestring might represent.Developers from many big Python projects have come up with simple
rules of thumb to prevent such runtime
UnicodeDecodeErrors, and the rules may be
summarized into one sentence: always do the conversion at IO
barriers. To express this same concept a bit more extensively:
- Whenever your program receives text data "from the
outside" (from the network, from a file, from user
input, etc.), construct unicode objects
immediately. Find out the appropriate encoding, for example, from an
HTTP header, or look for an appropriate convention to determine the
encoding to use. - Whenever your program sends text data "to the
outside" (to the network, to some file, to the user,
etc.), determine the correct encoding, and convert your text to a
bytestring with that encoding. (Otherwise, Python attempts to convert
Unicode to an ASCII bytestring, likely producing
UnicodeEncodeErrors, which are just the converse
of the UnicodeDecodeErrors previously mentioned).
With these two rules, you will solve most Unicode problems. If you
still get UnicodeErrors of either kind, look for
where you forgot to properly construct a unicode
object, forgot to properly convert back to an encoded bytestring, or
ended up using an inappropriate encoding due to some mistake. (It is
quite possible that such encoding mistakes are due to the user, or
some other program that is interacting with yours, not following the
proper encoding rules or conventions.)In order to convert a Unicode string back to an encoded bytestring,
you usually do something like:
>>> bytestring = german_ae.decode('latin1')Now
>>> bytestring
'\xe4'
bytestring is a German ae character in the
'latin1' encoding. Note how
'\xe4' (in Latin1) and the previously shown
'\xc3\xa4' (in UTF-8) represent the same German
character, but in different encodings.By now, you can probably imagine why Python refuses to guess among
the hundreds of possible encodings. It's a crucial
design choice, based on one of the Zen of Python
principles: "In the face of ambiguity, resist the
temptation to guess." At any interactive Python
shell prompt, enter the statement import this to
read all of the important principles that make up the Zen
of Python.
See Also
Unicode is a huge topic, but a recommended book is
Unicode: A Primer, by Tony Graham (Hungry
Minds, Inc.)details are available at http://www.menteith.com/unicode/primer/; and
a short but complete article from Joel Spolsky, "The
Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No
Excuses)!," located at http://www.joelonsoftware.com/articles/Unicodel.
See also the Library Reference and
Python in a Nutshell documentation about the
built-in str and unicode types
and modules unidata and codecs;
also, Recipe 1.21 and
Recipe 1.22.