Introduction
Credit: Paul Prescod, co-author of XML Handbook
(Prentice-Hall)XML
has become a central technology for all kinds of information
exchange. Today, most new file formats that are invented are based on
XML. Most new protocols are based upon XML. It simply
isn't possible to work with the emerging Internet
infrastructure without supporting XML. Luckily, Python has had XML
support since many versions ago, and Python's
support for XML has kept growing and maturing year after year.Python and XML are perfect complements. XML is an open standards way
of exchanging information. Python is an open source language that
processes the information. Python excels at text processing and at
handling complicated data structures. XML is text based and is, above
all, a way of exchanging complicated data structures.That said, working with XML is not so seamless that it requires no
effort. There is always somewhat of a mismatch between the needs of a
particular programming language and a language-independent
information representation. So there is often a requirement to write
code that reads (i.e., deserializes or
parses) and writes (i.e.,
serializes) XML.Parsing
XML can be done with code written purely in Python, or with a module
that is a C/Python mix. Python comes with the fast Expat parser
written in C. Many XML applications use the Expat parser, and one of
these recipes accesses Expat directly to build its own concept of an
ideal in-memory Python representation of an XML document as a tree of
"element" objects (an alternative
to the standard DOM approach, which I will mention later in this
introduction).However, although Expat is ubiquitous in the XML world, it is far
from being the only parser available, or necessarily the best one for
any given application. A standard API called SAX allows any XML
parser to be plugged into a Python program. The SAX API is demonstrated in
several recipes that perform typical tasks such as checking that an
XML document is well formed, extracting text from a document, or
counting the tags in a document. These recipes should give you a good
understanding of how SAX works. One more advanced recipe shows how to
use one of SAX's several auxiliary features,
"filtering", to normalize
"text events" that might otherwise
happen to get "fragmented".XML-RPC is a
protocol built on top of XML for sending data structures from one
program to another, typically across the Internet. XML-RPC allows
programmers to completely hide the implementation languages of the
two communicating components. Two components running on different
operating systems, written in different languages, can still
communicate easily. XML-RPC is built into Python. This chapter does
not deal with XML-RPC, because, together with other alternatives for
distributed programming, XML-RPC is covered in Chapter 15.Other recipes in this chapter are a little bit more eclectic, dealing
with issues that range from interfacing, to proprietary XML parsers
and document formats, to representing an entire XML document in
memory as a Python object. One, in particular, shows how to
auto-detect the Unicode encoding that an XML document uses without
parsing the document. Unicode is central to the definition of XML, so
it's important to understand
Python's Unicode support if you will be doing any
sophisticated work with XML.
The PyXML extension package supplies a
variety of useful tools for working with XML. PyXML offers a full
implementation of the Document Object Model (DOM)as opposed to
the subset bundled with Python itselfand a validating XML
parser written entirely in Python. The DOM is a standard API that
loads an entire XML document into memory. This can make XML
processing easier for complicated structures in which there are many
references from one part of the document to another, or when you need
to correlate (i.e., compare) more than one XML document. One recipe
shows how to use PyXML's validating parser to
validate and process an XML document, and another shows how to remove
whitespace-only text nodes from an XML document's
DOM. You'll find many other examples in the
documentation of the PyXML package (http://pyxml.sourceforge.net/).Other
advanced tools that you can find in PyXML or, in some cases, in
FourThought's open source 4Suite package
(http://www.4suite.org/) from
which much of PyXML derives, include implementations of a variety of
XML-related standards, such as XPath, XSLT, XLink, XPointer, and RDF.
If PyXML is already an excellent resource for XML power users in
Python, 4Suite is even richer and more powerful.XML has become so pervasive that, inevitably, you will also find
XML-related recipes in other chapters of this book. Recipe 2.26 strips XML markup in a very
rough and ready way. Recipe 1.23 shows how to insert XML
character references while encoding Unicode text. Recipe 10.17, parses a Mac OS X
pinfo-format XML stream to get detailed system
information. Recipe 11.10
uses Tkinter to display a XML DOM as a GUI Tree widget. Recipe 14.11 deals with two XML file
formats related to RSS[1] feeds, fetching and parsing a
FOAF[2]-format input to produce an OPML[3]-format
resultquite a typical XML-related task in
today's programming, and a good general example of
how Python can help you with such tasks.
[1] RSS (Really Simple
Syndication)
[2] FOAF (Friend of a Friend)
[3] OPML
(Outline Processor Markup Language)
For more information on using Python and XML together, see
Python and XML by Christopher A. Jones and
Fred L. Drake, Jr. (O'Reilly).