Credit: David Goodger, Peter Cogolo
You want to encode Unicode text for output in HTML, or some other XML application, using a limited but popular encoding such as ASCII or Latin-1.
Python provides an encoding error handler named xmlcharrefreplace, which replaces all characters outside of the chosen encoding with XML numeric character references:
def encode_for_xml(unicode_data, encoding='ascii'): return unicode_data.encode(encoding, 'xmlcharrefreplace')
You could use this approach for HTML output, too, but you might
prefer to use HTML's symbolic
entity references instead. For this purpose, you need to define and
register a customized encoding error handler. Implementing that
handler is made easier by the fact that the Python Standard Library
includes a module named After registering this error handler, you can optionally write a
function to wrap its use: As with any good Python module, this module would normally proceed
with an example of its use, guarded by an if _ _name_ _ ==
'_ _main_ _' test: If you run this module as a main script, you will then see such
output as (from function encode_for_xml): as well as (from function encode_forl): There is clearly a niche for each case, since
encode_for_xml is more general (you can use it for
any XML application, not just HTML), but
encode_forl may produce output
that's easier to readshould you ever need to
look at it directly, edit it further, and so on. If you feed either
form to a browser, you should view it in exactly the same way. To
visualize both forms of encoding in a browser, run this
recipe's module as a main script, redirect the
output to a disk file, and use a text editor to separate the two
halves before you view them with a browser. (Alternatively, run the
script twice, once commenting out the call to
encode_for_xml, and once commenting out the call to
encode_forl.) Remember that Unicode data must always be encoded before being
printed or written out to a file. UTF-8 is an ideal encoding, since
it can handle any Unicode character. But for many users and
applications, ASCII or Latin-1 encodings are often preferred over
UTF-8. When the Unicode data contains characters that are outside of
the given encoding (e.g., accented characters and most symbols are
not encodable in ASCII, and the
"infinity" symbol is not encodable
in Latin-1), these encodings cannot handle the data on their own.
Python supports a built-in encoding error handler called
xmlcharrefreplace, which replaces unencodable
characters with XML numeric character references, such as
∞ for the
"infinity" symbol. This recipe
shows how to write and register another similar error handler,
Neither of these error handlers makes sense for output that is
neither HTML nor some other form of XML. For example, TeX and other
markup languages do not recognize XML numeric character references.
However, if you know how to build an arbitrary character reference
for such a markup language, you may modify the example error handler
An alternative (and very effective!) way to perform encoding of
Unicode data into a file, with a given encoding and error handler of
your choice, is offered by the codecs module in
Python's standard library: You can now use outfile.write(unicode_data) for
any arbitrary Unicode string unicode_data,
and all the encoding and error handling will be taken care of
transparently. When your output is finished, of course, you should
call outfile.close( ). Library Reference and Python in a
Nutshell docs for modules codecs and
import codecs
fromlentitydefs import codepoint2name
defl_replace(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
s = [ u'&%s;' % codepoint2name[ord(c)]
for c in exc.object[exc.start:exc.end] ]
return ''.join(s), exc.end
else:
raise TypeError("can't handle %s" % exc._ _name_ _)
codecs.register_error(l_replace',l_replace)
def encode_forl(unicode_data, encoding='ascii'):
return unicode_data.encode(encoding, l_replace')
Discussion
if _ _name_ _ == '_ _main_ _':
# demo
data = u'''<l>
<head>
<title>Encoding Test</title>
</head>
<body>
<p>accented characters:
<ul>
<li>\xe0 (a + grave)
<li>\xe7 (c + cedilla)
<li>\xe9 (e + acute)
</ul>
<p>symbols:
<ul>
<li>\xa3 (British pound)
<li>\u20ac (Euro)
<li>\u221e (infinity)
</ul>
</body><l>
'''
print encode_for_xml(data)
print encode_forl(data)
<li>à (a + grave)
<li>ç (c + cedilla)
<li>é (e + acute)
...
<li>£ (British pound)
<li>€ (Euro)
<li>∞ (infinity)
<li>à (a + grave)
<li>ç (c + cedilla)
<li>é (e + acute)
...
<li>£ (British pound)
<li>€ (Euro)
<li>∞ (infinity)
outfile = codecs.open('outl', mode='w', encoding='ascii',
errors=l_replace')
See Also