Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 1.23. Encoding Unicode Data for XML and HTML

Credit: David Goodger, Peter Cogolo

Problem

You want to encode Unicode text for output
in HTML, or some other XML application, using a limited but popular
encoding such as ASCII or Latin-1.

Solution

Python provides an encoding error handler named
xmlcharrefreplace, which replaces all characters
outside of the chosen encoding with XML numeric character references:

def encode_for_xml(unicode_data, encoding='ascii'):
return unicode_data.encode(encoding, 'xmlcharrefreplace')

You could use this approach for HTML output, too, but you might
prefer to use HTML's symbolic
entity references instead. For this purpose, you need to define and
register a customized encoding error handler. Implementing that
handler is made easier by the fact that the Python Standard Library
includes a module named htmlentitydefs that holds
HTML entity definitions:

import codecs
from htmlentitydefs import codepoint2name
def html_replace(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
s = [ u'&%s;' % codepoint2name[ord(c)]
for c in exc.object[exc.start:exc.end] ]
return ''.join(s), exc.end
else:
raise TypeError("can't handle %s" % exc._ _name_ _) 
codecs.register_error('html_replace', html_replace)

After registering this error handler, you can optionally write a
function to wrap its use:

def encode_for_html(unicode_data, encoding='ascii'):
return unicode_data.encode(encoding, 'html_replace')

Discussion

As with any good Python module, this module would normally proceed
with an example of its use, guarded by an if _ _name_ _ == '_ _main_ _' test:

if _ _name_ _ == '_ _main_ _':
# demo
data = u'''<html>
<head>
<title>Encoding Test</title>
</head>
<body>
<p>accented characters:
<ul>
<li>\xe0 (a + grave)
<li>\xe7 (c + cedilla)
<li>\xe9 (e + acute)
</ul>
<p>symbols:
<ul>
<li>\xa3 (British pound)
<li>\u20ac (Euro)
<li>\u221e (infinity)
</ul>
</body></html>
'''
print encode_for_xml(data)
print encode_for_html(data)

If you run this module as a main script, you will then see such
output as (from function encode_for_xml):

<li>&#224; (a + grave)
<li>&#231; (c + cedilla)
<li>&#233; (e + acute)
...
<li>&#163; (British pound)
<li>&#8364; (Euro)
<li>&#8734; (infinity)

as well as (from function encode_for_html):

<li>&agrave; (a + grave)
<li>&ccedil; (c + cedilla)
<li>&eacute; (e + acute)
...
<li>&pound; (British pound)
<li>&euro; (Euro)
<li>&infin; (infinity)

There is clearly a niche for each case, since
encode_for_xml is more general (you can use it for
any XML application, not just HTML), but
encode_for_html may produce output
that's easier to readshould you ever need to
look at it directly, edit it further, and so on. If you feed either
form to a browser, you should view it in exactly the same way. To
visualize both forms of encoding in a browser, run this
recipe's module as a main script, redirect the
output to a disk file, and use a text editor to separate the two
halves before you view them with a browser. (Alternatively, run the
script twice, once commenting out the call to
encode_for_xml, and once commenting out the call to
encode_for_html.)

Remember that Unicode data must always be encoded before being
printed or written out to a file. UTF-8 is an ideal encoding, since
it can handle any Unicode character. But for many users and
applications, ASCII or Latin-1 encodings are often preferred over
UTF-8. When the Unicode data contains characters that are outside of
the given encoding (e.g., accented characters and most symbols are
not encodable in ASCII, and the
"infinity" symbol is not encodable
in Latin-1), these encodings cannot handle the data on their own.
Python supports a built-in encoding error handler called
xmlcharrefreplace, which replaces unencodable
characters with XML numeric character references, such as
∞ for the
"infinity" symbol. This recipe
shows how to write and register another similar error handler,
html_replace, specifically for producing HTML
output. html_replace replaces unencodable characters
with more readable HTML symbolic entity references, such as
∞ for the
"infinity" symbol.
html_replace is less general than
xmlcharrefreplace, since it does not support
all Unicode characters and cannot be used with
non-HTML applications; however, it can still be useful if you want
HTML output that is as readable as possible in a
"view page source" context.

Neither of these error handlers makes sense for output that is
neither HTML nor some other form of XML. For example, TeX and other
markup languages do not recognize XML numeric character references.
However, if you know how to build an arbitrary character reference
for such a markup language, you may modify the example error handler
html_replace shown in this recipe's
Solution to code and register your own encoding error handler.

An alternative (and very effective!) way to perform encoding of
Unicode data into a file, with a given encoding and error handler of
your choice, is offered by the codecs module in
Python's standard library:

outfile = codecs.open('outl', mode='w', encoding='ascii',
errors='html_replace')

You can now use outfile.write(unicode_data) for
any arbitrary Unicode string unicode_data,
and all the encoding and error handling will be taken care of
transparently. When your output is finished, of course, you should
call outfile.close( ).

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی