Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 12.3. Extracting Text from an XML Document

Credit: Paul Prescod

Problem

You need to extract only the text from an
XML document, not the tags.

Solution

Once again, subclassing SAX's
ContentHandler makes this task quite easy:

from xml.sax.handler import ContentHandler
import xml.sax
import sys
class textHandler(ContentHandler):
def characters(self, ch):
sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser( )
handler = textHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")

Discussion

Sometimes you want to get rid of XML tagsfor example, to
re-key a document or to spell-check it. This recipe performs this
task and works with any well-formed XML document. It is quite
efficient.

In this
recipe's textHandler class, we
subclass ContentHander's
characters method, which the parser calls for each
string of text in the XML document (excluding tags, XML comments, and
processing instructions), passing as the only argument the piece of
text as a Unicode string. We have to encode this
Unicode before we can emit it to standard output. (See Recipe 1.22 for more information about
emitting Unicode to standard output.) In this recipe,
we're using the Latin-1 (also known as ISO-8859-1)
encoding, which covers all western European alphabets and is
supported by many popular output devices (e.g., printers and
terminal-emulation windows). However, you should use whatever
encoding is most appropriate for the documents
you're handling, as long, of course, as that
encoding is supported by the devices you need to use. The
configuration of your devices may depend on your operating
system's concepts of locale and code page.
Unfortunately, these issues vary too much between operating systems
for me to go into further detail.

A simple alternative, if you know that handling Unicode is not going
to be a problem, is to use sgmllib.
It's not quite as fast but somewhat more robust
against XML of dubious well-formedness:

from sgmllib import SGMLParser
class XMLJustText(SGMLParser):
def handle_data(self, data):
print data
XMLJustText( ).feed(open('text.xml').read( ))

An even simpler and rougher way to extract text from an XML document
is shown in Recipe 2.26.

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی