Recipe 12.2. Counting Tags in a Document
Credit: Paul Prescod
Problem
You want to get a sense of how often
particular elements occur in an XML document, and the relevant counts
must be extracted rapidly.
Solution
You can subclass SAX's
ContentHandler to make your own specialized
classes for any kind of task, including the collection of such
statistics:
from xml.sax.handler import ContentHandler
import xml.sax
class countHandler(ContentHandler):
def _ _init_ _(self):
self.tags={ }
def startElement(self, name, attr):
self.tags[name] = 1 + self.tags.get(name, 0)
parser = xml.sax.make_parser( )
handler = countHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")
tags = handler.tags.keys( )
tags.sort( )
for tag in tags:
print tag, handler.tags[tag]
Discussion
When I start working with a new XML content set, I like to get a
sense of which elements are in it and how often they occur. For this
purpose, I use several small variants of this recipe. I could also
collect attributes just as easily, as you can see, since attributes
are also passed to the startElement method that
I'm overriding. If you add a stack, you can also
keep track of which elements occur within other elements (for this,
of course, you also have to override the
endElement method so you can pop the stack).This recipe also works well as a simple example of a SAX application,
usable as the basis for any SAX application. Alternatives to SAX
include pulldom and minidom.
For any simple processing (including this example), these
alternatives would be overkill, particularly if the document you are
processing is very large. DOM approaches are generally justified only
when you need to perform complicated editing and alteration on an XML
document, when the document itself is made complicated by references
that go back and forth inside it, or when you need to correlate
(i.e., compare) multiple documents.ContentHandler subclasses offer many other
options, and the online Python documentation does a pretty good job
of explaining them. This recipe's
countHandler class overrides
ContentHandler's
startElement method, which the parser calls at the
start of each element, passing as arguments the
element's tag name as a Unicode string and the
collection of attributes. Our override of this method counts the
number of times each tag name occurs. In the end, we extract the
dictionary used for counting and emit it (in alphabetical order,
which we easily obtain by sorting the keys).
See Also
Recipe 12.3 for other uses
of SAX.