Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 12.7. Parsing Microsoft Excel's XML

Credit: Thomas Guettler

Problem

You have Microsoft Excel spreadsheets saved in
XML form, and want to parse them into memory as Python nested lists.

Solution

The XML form of Excel spreadsheets is quite simple: all text is in
Cell tags, which are nested in
Row tags nested in Table tags.
SAX makes it quite simple to parse this kind of XML into memory:

import sys
from xml.sax import saxutils, parse
class ExcelHandler(saxutils.DefaultHandler):
def _ _init_ _(self):
self.chars = [  ]
self.cells = [  ]
self.rows = [  ]
self.tables = [  ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [  ]
elif name=="Row":
self.cells=[  ]
elif name=="Table":
self.rows = [  ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)
if _ _name_ _ == '_ _main_ _':
excelHandler = ExcelHandler( )
parse(sys.argv[1], excelHandler)
print excelHandler.tables

Discussion

The structure of the parser presented in this recipe is pleasingly
simple: at each of three logical nesting levels of data, we collect
content into a list. Each time a tag of a given level begins, we
start with an empty list for it; each time the tag ends, we append
the tag's contents to the list of the next upper
level. The net result is that the top-level list, the one named
tables, accumulates all of the
spreadsheet's contents with the proper structure (a
triply nested list). At the lowest level, of course, we join all the
text strings that are reported as being within the same cell into a
single cell content text string, when we accumulate, because the
division between the various strings is just an artefact of the XML
parsing process.

For example, consider a tiny spreadsheet with one column and three
rows, where the first two rows each hold the number
2 and the third one holds the number
4 obtained by summing the numbers in the first two
rows with an Excel formula. The relevant snippet of the Excel XML
output (XMLSS format, as Microsoft calls it) is then:

<Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="3"
x:FullColumns="1" x:FullRows="1">
<Row>
<Cell><Data ss:Type="Number">2</Data></Cell>
</Row>
<Row>
<Cell><Data ss:Type="Number">2</Data></Cell>
</Row>
<Row>
<Cell ss:Formula="=SUM(R[-2]C, R[-1]C)">
<Data ss:Type="Number">4</Data></Cell>
</Row>
</Table>

and running the script in this recipe over this file emits:

[[[u'2'], [u'2'], [u'4']]]

As you can see, the XMLSS file also contains a lot of supplementary
information that this recipe is not collectingthe attributes
hold information about the type of data (number or string), the
formula used for the computation (if any), and so on. If you need any
or all of this supplemental information, it's not
hard to enrich this recipe to record and use it.

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی