Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 12.3. Extracting Text from an XML Document


Credit: Paul Prescod


Problem


You need to extract only the text from an
XML document, not the tags.


Solution


Once again, subclassing SAX's
ContentHandler makes this task quite easy:

from xml.sax.handler import ContentHandler
import xml.sax
import sys
class textHandler(ContentHandler):
def characters(self, ch):
sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser( )
handler = textHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")


Discussion


Sometimes you want to get rid of XML tagsfor example, to
re-key a document or to spell-check it. This recipe performs this
task and works with any well-formed XML document. It is quite
efficient.


In this
recipe's textHandler class, we
subclass ContentHander's
characters method, which the parser calls for each
string of text in the XML document (excluding tags, XML comments, and
processing instructions), passing as the only argument the piece of
text as a Unicode string. We have to encode this
Unicode before we can emit it to standard output. (See Recipe 1.22 for more information about
emitting Unicode to standard output.) In this recipe,
we're using the Latin-1 (also known as ISO-8859-1)
encoding, which covers all western European alphabets and is
supported by many popular output devices (e.g., printers and
terminal-emulation windows). However, you should use whatever
encoding is most appropriate for the documents
you're handling, as long, of course, as that
encoding is supported by the devices you need to use. The
configuration of your devices may depend on your operating
system's concepts of locale and code page.
Unfortunately, these issues vary too much between operating systems
for me to go into further detail.

A simple alternative, if you know that handling Unicode is not going
to be a problem, is to use sgmllib.
It's not quite as fast but somewhat more robust
against XML of dubious well-formedness:

from sgmllib import SGMLParser
class XMLJustText(SGMLParser):
def handle_data(self, data):
print data
XMLJustText( ).feed(open('text.xml').read( ))

An even simpler and rougher way to extract text from an XML document
is shown in Recipe 2.26.


See Also


Recipe 12.1 and Recipe 12.2 for other uses of SAX.


/ 394