Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources]

David Ascher, Alex Martelli, Anna Ravenscroft

نسخه متنی -صفحه : 394/ 40

Recipe 1.23. Encoding Unicode Data for XML and HTML

Credit: David Goodger, Peter Cogolo

Problem

You want to encode Unicode text for output in HTML, or some other XML application, using a limited but popular encoding such as ASCII or Latin-1.

Solution

Python provides an encoding error handler named xmlcharrefreplace, which replaces all characters outside of the chosen encoding with XML numeric character references:

def encode_for_xml(unicode_data, encoding='ascii'):
return unicode_data.encode(encoding, 'xmlcharrefreplace')

You could use this approach for HTML output, too, but you might prefer to use HTML's symbolic entity references instead. For this purpose, you need to define and register a customized encoding error handler. Implementing that handler is made easier by the fact that the Python Standard Library includes a module named that holds HTML entity definitions:

import codecs
fromlentitydefs import codepoint2name
defl_replace(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
s = [ u'&%s;' % codepoint2name[ord(c)]
for c in exc.object[exc.start:exc.end] ]
return ''.join(s), exc.end
else:
raise TypeError("can't handle %s" % exc._ _name_ _) 
codecs.register_error(l_replace',l_replace)

After registering this error handler, you can optionally write a function to wrap its use:

def encode_forl(unicode_data, encoding='ascii'):
return unicode_data.encode(encoding, l_replace')

Discussion

As with any good Python module, this module would normally proceed with an example of its use, guarded by an if _ _name_ _ == '_ _main_ _' test:

if _ _name_ _ == '_ _main_ _':
# demo
data = u'''<l>
<head>
<title>Encoding Test</title>
</head>
<body>
<p>accented characters:
<ul>
<li>\xe0 (a + grave)
<li>\xe7 (c + cedilla)
<li>\xe9 (e + acute)
</ul>
<p>symbols:
<ul>
<li>\xa3 (British pound)
<li>\u20ac (Euro)
<li>\u221e (infinity)
</ul>
</body><l>
'''
print encode_for_xml(data)
print encode_forl(data)

If you run this module as a main script, you will then see such output as (from function encode_for_xml):

<li>&#224; (a + grave)
<li>&#231; (c + cedilla)
<li>&#233; (e + acute)
...
<li>&#163; (British pound)
<li>&#8364; (Euro)
<li>&#8734; (infinity)

as well as (from function encode_forl):

<li>&agrave; (a + grave)
<li>&ccedil; (c + cedilla)
<li>&eacute; (e + acute)
...
<li>&pound; (British pound)
<li>&euro; (Euro)
<li>&infin; (infinity)

There is clearly a niche for each case, since encode_for_xml is more general (you can use it for any XML application, not just HTML), but encode_forl may produce output that's easier to readshould you ever need to look at it directly, edit it further, and so on. If you feed either form to a browser, you should view it in exactly the same way. To visualize both forms of encoding in a browser, run this recipe's module as a main script, redirect the output to a disk file, and use a text editor to separate the two halves before you view them with a browser. (Alternatively, run the script twice, once commenting out the call to encode_for_xml, and once commenting out the call to encode_forl.)

Remember that Unicode data must always be encoded before being printed or written out to a file. UTF-8 is an ideal encoding, since it can handle any Unicode character. But for many users and applications, ASCII or Latin-1 encodings are often preferred over UTF-8. When the Unicode data contains characters that are outside of the given encoding (e.g., accented characters and most symbols are not encodable in ASCII, and the "infinity" symbol is not encodable in Latin-1), these encodings cannot handle the data on their own. Python supports a built-in encoding error handler called xmlcharrefreplace, which replaces unencodable characters with XML numeric character references, such as ∞ for the "infinity" symbol. This recipe shows how to write and register another similar error handler, , specifically for producing HTML output. replaces unencodable characters with more readable HTML symbolic entity references, such as ∞ for the "infinity" symbol. is less general than xmlcharrefreplace, since it does not support all Unicode characters and cannot be used with non-HTML applications; however, it can still be useful if you want HTML output that is as readable as possible in a "view page source" context.

Neither of these error handlers makes sense for output that is neither HTML nor some other form of XML. For example, TeX and other markup languages do not recognize XML numeric character references. However, if you know how to build an arbitrary character reference for such a markup language, you may modify the example error handler shown in this recipe's Solution to code and register your own encoding error handler.

An alternative (and very effective!) way to perform encoding of Unicode data into a file, with a given encoding and error handler of your choice, is offered by the codecs module in Python's standard library:

outfile = codecs.open('outl', mode='w', encoding='ascii',
errors=l_replace')

You can now use outfile.write(unicode_data) for any arbitrary Unicode string unicode_data, and all the encoding and error handling will be taken care of transparently. When your output is finished, of course, you should call outfile.close( ).