Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 12.10. Merging Continuous Text Events with a SAX Filter

Credit: Uche Ogbuji, James Kew, Peter Cogolo

Problem

A
SAX parser can report contiguous text using multiple
characters events (meaning, in practice,
multiple calls to the characters method), and this
multiplicity of events for a single text string may give problems to
SAX handlers. You want to insert a filter into the SAX handler chain
to ensure that each text node in the document is reported as a single
SAX characters event (meaning, in practice, that it calls
character just once).

Solution

Module
xml.sax.saxutils in the standard Python library
includes a class XMLFilterBase that we can
subclass to implement any XML filter we may need:

from xml.sax.saxutils import XMLFilterBase
class text_normalize_filter(XMLFilterBase):
"" SAX filter to ensure that contiguous text nodes are merged into one
""
def _ _init_ _(self, upstream, downstream):
XMLFilterBase._ _init_ _(self, upstream)
self._downstream = downstream
self._accumulator = [  ]
def _complete_text_node(self):
if self._accumulator:
self._downstream.characters(''.join(self._accumulator))
self._accumulator = [  ]
def characters(self, text):
self._accumulator.append(text)
def ignorableWhitespace(self, ws):
self._accumulator.append(text)
def _wrap_complete(method_name):
def method(self, *a, **k):
self._complete_text_node( )
getattr(self._downstream, method_name)(*a, **k)
# 2.4 only: method._ _name_ _ = method_name
setattr(text_normalize_filter, method_name, method)
for n in '''startElement startElementNS endElement endElementNS
processingInstruction comment'''.split( ):
_wrap_complete(n)
if _ _name_ _ == "_ _main_ _":
import sys
from xml import sax
from xml.sax.saxutils import XMLGenerator
parser = sax.make_parser( )
# XMLGenerator is a special predefined SAX handler that merely writes
# SAX events back into an XML document
downstream_handler = XMLGenerator( )
# upstream, the parser; downstream, the next handler in the chain
filter_handler = text_normalize_filter(parser, downstream_handler)
# The SAX filter base is designed so that the 
filter takes on much of the
# interface of the parser itself, including the "parse" method
filter_handler.parse(sys.argv[1])

Discussion

A SAX parser can report contiguous text using multiple characters
events (meaning, in practice, multiple calls to the
characters method of the downstream handler). In
other words, given an XML document whose content is
'abc', the text could technically be reported as
up to three character events: one for the 'a'
character, one for the 'b', and
a third for the 'c'. Such an
extreme case of "fragmentation" of
a text string into multiple events is unlikely in real life, but it
is not impossible.

A typical reason that might cause a parser to report text nodes a bit
at a time would be buffering of the XML input source. Most low-level
parsers use a buffer of a certain number of characters that are read
and parsed at a time. If a text node straddles such a buffer
boundary, many parsers will just wrap up the current text event and
start a new one to send characters from the next buffer. If you
don't account for this behavior in your SAX
handlers, you may run into very obscure and hard-to-reproduce bugs.
Even if the parser you usually use does combine text nodes for you,
you never know when you may want to run your code in a situation
where a different parser is selected. You'd need to
write logic to accommodate the possibility, which can be rather
cumbersome when mixed into typical SAX-style state machine logic.

The class text_normalize_filter presented in this
recipe ensures that all text events are reported to downstream SAX
handlers in the contiguous manner that most developers would expect.
In this recipe's example case, the filter would
consolidate the three characters events into a single one for the
entire text node 'abc'.

For more information on SAX filters in general, see my article
"Tip: SAX filters for flexible
processing," http://www-106.ibm.com/developerworks/xml/library/x-tipsaxflexl.

Python's XMLGenerator does not do
anything with processing instructions, so, if you run the main code
presented in this recipe on an XML document that uses them,
you'll have a gap in the output, along with other
minor deviations between input and output. Comments are similar but
worse, because XMLFilterBase does not even filter
them; if you do need to get comments, your
test_normalize_filter class must multiply inherit
from xml.sax.saxlib.LexicalHandler, as well as
from xml.sax.saxutils.XMLFilterBase, and it must
override the parse method as follows:

    def parse(self, source):
# force connection of self as the lexical handler
self._parent.setProperty(property_lexical_handler, self)
# Delegate to XMLFilterBase for the rest
XMLFilterBase.parse(self, source)

This code is hairy enough, using the
"internal" attribute
self._parent, and the need to deal properly with
XML comments is rare enough, to make this addition somewhat doubtful,
which is why it is not part of this recipe's
Solution.

If you need ease of chaining to other filters, you may prefer not to
take both upstream and downstream parameters in _ _init_ _. In this case, keep the same signature as
XMLFilterBase._ _init_ _:

    def _ _init_ _(self, parent):
XMLFilterBase._ _init_ _(self, parent)
self._accumulator = [  ]

and change the _wrap_complete factory function so
that the wrapper, rather than calling methods on the downstream
handler directly, delegates to the default implementations in
XMLFilterBase, which in turn call out to handlers
that have been set on the filter with such methods as
setContentHandler and the like:

def _wrap_complete(method_name):
def method(self, *a, **k):
self._complete_text_node( )
getattr(XMLFilterBase, method_name)(self, *a, **k)
# 2.4 only: method._ _name_ _ = method_name
setattr(text_normalize_filter, method_name, method)

This is slightly less convenient for the typical simple case, but it
pays back this inconvenience by letting you easily chain filters:

parser = sax.make_parser( )
filtered_parser = text_normalise_filter(some_other_filter(parser))

as well as letting you use a filter in contexts that call the
parse method on your behalf:

doc = xml.dom.minidom.parse(input_file, parser=filtered_parser)

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی