Recipe 12.1. Checking XML Well-Formedness
Credit: Paul Prescod, Farhad Fouladi
Problem
You need to check
whether an XML document is well formed (not
whether it conforms to a given DTD or schema), and you need to do
this check quickly.
Solution
SAX (presumably using a fast parser such as Expat underneath) offers
a fast, simple way to perform this task. Here is a script to check
well-formedness on every file you mention on the
script's command line:
from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from glob import glob
import sys
def parsefile(filename):
parser = make_parser( )
parser.setContentHandler(ContentHandler( ))
parser.parse(filename)
for arg in sys.argv[1:]:
for filename in glob(arg):
try:
parsefile(filename)
print "%s is well-formed" % filename
except Exception, e:
print "%s is NOT well-formed! %s" % (filename, e)
Discussion
A text is a well-formed XML document if it adheres to all the basic
syntax rules for XML documents. In other words, it has a correct XML
declaration and a single root element, all tags are properly nested,
tag attributes are quoted, and so on.This recipe uses the SAX API with a dummy
ContentHandler that does nothing. Generally, when
we parse an XML document with SAX, we use a
ContentHandler instance to process the
document's contents. But in this case, we only want
to know whether the document meets the most fundamental syntax
constraints of XML; therefore, we need not do any processing, and the
do-nothing handler suffices.The parsefile function parses the whole document and
throws an exception if an error is found. The
recipe's main code catches any such exception and
prints it out like this:
$ python wellformed.py test.xmlThis means that character 2 on line 1,002 has a mismatched tag.This recipe
test.xml is NOT well-formed! test.xml:1002:2: mismatched tag
does not check adherence to a DTD or schema, which is a separate
procedure called validation. The performance
of the script should be quite good, precisely because it focuses on
performing a minimal irreducible core task. However, sometimes you
need to squeeze out the last drop of performance because
you're checking the well-formedness of truly huge
files. If you know for sure that you do have Expat, specifically,
installed on your system, you may alternatively choose to use Expat
directly instead of SAX. To try this approach, you can change
function parsefile to the following code:
import xml.parsers.expatDon't expect all that much of an improvement in
def parsefile(file):
parser = xml.parsers.expat.ParserCreate( )
parser.ParseFile(open(file, "r"))
performance when using Expat directly instead of SAX. However, you
might gain a little bit.
See Also
Recipe 12.2 and Recipe 12.3, for other uses of
SAX; the PyXML package (http://pyxml.sourceforge.net/) includes the
pure-Python validating parser xmlproc, which
checks the conformance of XML documents to specific DTDs; the PyRXP
package from ReportLab is a wrapper around the fast validating parser
RXP (http://www.reportlab.com/xml/pyrxpl),
which is available under the GPL license.