Recipe 12.6. Removing Whitespace-only Text Nodes from an XML DOM Node's Subtree
Credit: Brian Quinlan, David
Wilson
Problem
You want to remove, from the DOM representation of an XML document,
all the text nodes within a subtree, which contain only whitespace.
Solution
XML parsers consider several complex conditions when deciding which
whitespace-only text nodes to preserve during DOM construction.
Unfortunately, the result is often not what you want, so
it's helpful to have a function to remove all
whitespace-only text nodes from among a given node's
descendants:
def remove_whilespace_nodes(node):
"" Removes all of the whitespace-only text decendants of a DOM node. ""
# prepare the list of text nodes to remove (and recurse when needed)
remove_list = [ ]
for child in node.childNodes:
if child.nodeType == dom.Node.TEXT_NODE and not child.data.strip( ):
# add this text node to the to-be-removed list
remove_list.append(child)
elif child.hasChildNodes( ):
# recurse, it's the simplest way to deal with the subtree
remove_whilespace_nodes(child)
# perform the removals
for node in remove_list:
node.parentNode.removeChild(node)
node.unlink( )
Discussion
This recipe's code works with any correctly
implemented Python XML DOM, including the
xml.dom.minidom that is part of the Python
Standard Library and the more complete DOM implementation that comes
with PyXML.The implementation of function
remove_whitespace_node is quite simple but rather
instructive: in the first for loop we build a list
of all child nodes to remove, and then in a second, separate loop we
do the removal. This precaution is a good example of a general rule
in Python: do not alter the very container you're
looping onsometimes you can get away with it, but it is unwise
to count on it in the general case. On the other hand, the function
can perfectly well call itself recursively within its first
for loop because such a call does
not alter the very list
node.childNodes on which the loop is iterating (it
may alter some items in that list, but it does
not alter the list object itself).
See Also
Library Reference and Python in a
Nutshell document the built-in XML support in the Python
Standard Library.