Recipe 12.3. Extracting Text from an XML Document
Credit: Paul Prescod
Problem
You need to extract only the text from an
XML document, not the tags.
Solution
Once again, subclassing SAX's
ContentHandler makes this task quite easy:
from xml.sax.handler import ContentHandler
import xml.sax
import sys
class textHandler(ContentHandler):
def characters(self, ch):
sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser( )
handler = textHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")
Discussion
Sometimes you want to get rid of XML tagsfor example, to
re-key a document or to spell-check it. This recipe performs this
task and works with any well-formed XML document. It is quite
efficient.
In this
recipe's textHandler class, we
subclass ContentHander's
characters method, which the parser calls for each
string of text in the XML document (excluding tags, XML comments, and
processing instructions), passing as the only argument the piece of
text as a Unicode string. We have to encode this
Unicode before we can emit it to standard output. (See Recipe 1.22 for more information about
emitting Unicode to standard output.) In this recipe,
we're using the Latin-1 (also known as ISO-8859-1)
encoding, which covers all western European alphabets and is
supported by many popular output devices (e.g., printers and
terminal-emulation windows). However, you should use whatever
encoding is most appropriate for the documents
you're handling, as long, of course, as that
encoding is supported by the devices you need to use. The
configuration of your devices may depend on your operating
system's concepts of locale and code page.
Unfortunately, these issues vary too much between operating systems
for me to go into further detail.A simple alternative, if you know that handling Unicode is not going
to be a problem, is to use sgmllib.
It's not quite as fast but somewhat more robust
against XML of dubious well-formedness:
from sgmllib import SGMLParserAn even simpler and rougher way to extract text from an XML document
class XMLJustText(SGMLParser):
def handle_data(self, data):
print data
XMLJustText( ).feed(open('text.xml').read( ))
is shown in Recipe 2.26.
See Also
Recipe 12.1 and Recipe 12.2 for other uses of SAX.