Recipe 12.4. Autodetecting XML Encoding
Credit: Paul Prescod
Problem
You have XML documents that may
use a large variety of Unicode encodings, and you need to find out
which encoding each document is using.
Solution
This task is one that we need to code ourselves, rather than getting
an existing package to perform it, if we want complete generality:
import codecs, encodings
"" Caller will hand this library a buffer string,
and ask us to convert
the buffer, or autodetect what codec the buffer
probably uses. ""
# 'None' stands for a potentially
variable byte ("##" in the XML spec...)
autodetect_dict={ # bytepattern : ("name",
(0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"),
(0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"),
(0xFE, 0xFF, None, None) : ("utf_16_be"),
(0xFF, 0xFE, None, None) : ("utf_16_le"),
(0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"),
(0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"),
(0x3C, 0x3F, 0x78, 0x6D) : ("utf_8"),
(0x4C, 0x6F, 0xA7, 0x94) : ("EBCDIC"),
}
def autoDetectXMLEncoding(buffer):
"" buffer -> encoding_name
The buffer string should be at least four bytes long.
Returns None if encoding cannot be detected.
Note that encoding_name might not have an installed
decoder (e.g., EBCDIC)
""
# A more efficient implementation would not decode the whole
# buffer at once, but then we'd have to decode a character at
# a time looking for the quote character, and that's a pain
encoding = "utf_8" # According to the XML spec, this is the default
# This code successively tries to refine the default:
# Whenever it fails to refine, it falls back to
# the last place encoding was set
bytes = byte1, byte2, byte3, byte4 = map(ord, buffer[0:4])
enc_info = autodetect_dict.get(bytes, None)
if not enc_info: # Try autodetection again, removing potentially
# variable bytes
bytes = byte1, byte2, None, None
enc_info = autodetect_dict.get(bytes)
if enc_info:
encoding = enc_info # We have a guess...these are
# the new defaults
# Try to find a more precise encoding using XML declaration
secret_decoder_ring = codecs.lookup(encoding)[1]
decoded, length = secret_decoder_ring(buffer)
first_line = decoded.split("\n", 1)[0]
if first_line and first_line.startswith(u"<?xml"):
encoding_pos = first_line.find(u"encoding")
if encoding_pos!=-1:
# Look for double quotes
quote_pos = first_line.find('"', encoding_pos)
if quote_pos==-1: # Look for single quote
quote_pos = first_line.find("'", encoding_pos)
if quote_pos>-1:
quote_char = first_line[quote_pos]
rest = first_line[quote_pos+1:]
encoding = rest[:rest.find(quote_char)]
return encoding
Discussion
The XML specification describes the outline of an algorithm for
detecting the Unicode encoding that an XML document uses. This recipe
implements that algorithm and helps your XML-processing programs
determine which encoding is being used by a specific document.The default encoding (unless we can determine another one
specifically) must be UTF-8, as it is part of the specifications that
define XML. Certain byte patterns in the first four, or sometimes
even just the first two, bytes of the text can identify a different
encoding. For example, if the text starts with the two bytes
0xFF, 0xFE we can be certain that these bytes are
a byte-order mark that identifies the encoding type as little-endian
(low byte before high byte in each character) and the encoding itself
as UTF-16 (or the 32-bits-per-character UCS-4, if the next two bytes
in the text are 0, 0).If we get as far as this, we must also examine the first line of the
text. For this purpose, we decode the text from a bytestring into
Unicode, with the encoding determined so far and detect the first
line-end '\n' character. If the first line begins
with u'<?xml', it's an XML
declaration and may explicitly specify an encoding by using the
keyword encoding as an attribute. The nested
if statements in the recipe check for that case,
and, if they find an encoding thus specified, the recipe returns the
encoding thus found as the encoding the recipe has determined. This
step is absolutely crucial, since any text starting with the
single-byte ASCII-like representation of the XML declaration,
<?xml, would be otherwise erroneously
identified as encoded in UTF-8, while its explicit encoding attribute
may specify it as being, for example, one of the ISO-8859 standard
encodings.This recipe makes the assumption that, as the XML specs require, the
XML declaration, if any, is terminated by an end-of-line character.
If you need to deal with almost-XML documents that are malformed in
this very specific way (i.e., an incorrect XML declaration that is
not terminated by an end-of-line character), you may need to apply
some heuristic adjustments, for example, through regular expressions.
However, it's impossible to offer precise
suggestions, since malformedness may come in such a wide variety of
errant forms.This code detects a variety of encodings, including some that are not
yet supported by Python's Unicode decoders. So, the
fact that you can decipher the encoding does not guarantee that you
can then decipher the document itself!
See Also
Unicode is a huge topic, but a recommended book is
Unicode: A Primer, by Tony Graham (Hungry
Minds, Inc.)details are available at http://www.menteith.com/unicode/primer/;
Library Reference and Python in a
Nutshell document the built-in str and
unicode types, and modules
unidata and codecs; Recipe 1.21 and Recipe 1.22.