Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 12.7. Parsing Microsoft Excel's XML


Credit: Thomas Guettler


Problem




You have Microsoft Excel spreadsheets saved in
XML form, and want to parse them into memory as Python nested lists.


Solution


The XML form of Excel spreadsheets is quite simple: all text is in
Cell tags, which are nested in
Row tags nested in Table tags.
SAX makes it quite simple to parse this kind of XML into memory:

import sys
from xml.sax import saxutils, parse
class ExcelHandler(saxutils.DefaultHandler):
def _ _init_ _(self):
self.chars = [ ]
self.cells = [ ]
self.rows = [ ]
self.tables = [ ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [ ]
elif name=="Row":
self.cells=[ ]
elif name=="Table":
self.rows = [ ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)
if _ _name_ _ == '_ _main_ _':
excelHandler = ExcelHandler( )
parse(sys.argv[1], excelHandler)
print excelHandler.tables


Discussion


The structure of the parser presented in this recipe is pleasingly
simple: at each of three logical nesting levels of data, we collect
content into a list. Each time a tag of a given level begins, we
start with an empty list for it; each time the tag ends, we append
the tag's contents to the list of the next upper
level. The net result is that the top-level list, the one named
tables, accumulates all of the
spreadsheet's contents with the proper structure (a
triply nested list). At the lowest level, of course, we join all the
text strings that are reported as being within the same cell into a
single cell content text string, when we accumulate, because the
division between the various strings is just an artefact of the XML
parsing process.

For example, consider a tiny spreadsheet with one column and three
rows, where the first two rows each hold the number
2 and the third one holds the number
4 obtained by summing the numbers in the first two
rows with an Excel formula. The relevant snippet of the Excel XML
output (XMLSS format, as Microsoft calls it) is then:

<Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="3"
x:FullColumns="1" x:FullRows="1">
<Row>
<Cell><Data ss:Type="Number">2</Data></Cell>
</Row>
<Row>
<Cell><Data ss:Type="Number">2</Data></Cell>
</Row>
<Row>
<Cell ss:Formula="=SUM(R[-2]C, R[-1]C)">
<Data ss:Type="Number">4</Data></Cell>
</Row>
</Table>

and running the script in this recipe over this file emits:

[[[u'2'], [u'2'], [u'4']]]

As you can see, the XMLSS file also contains a lot of supplementary
information that this recipe is not collectingthe attributes
hold information about the type of data (number or string), the
formula used for the computation (if any), and so on. If you need any
or all of this supplemental information, it's not
hard to enrich this recipe to record and use it.


See Also


Library Reference and Python in a
Nutshell
document the built-in XML support in the Python
Standard Library and SAX in particular.


/ 394