Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 2.26. Extracting Text from OpenOffice.org Documents


Credit: Dirk
Holtwick



Problem


You need to extract the text content (with or without the attending
XML markup) from an OpenOffice.org document.


Solution


An OpenOffice.org document is just a
zip file that aggregates XML documents according
to a well-documented standard. To access our precious data, we
don't even need to have
OpenOffice.org installed:

import zipfile, re
rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE)
def convert_OO(filename, want_text=True):
"" Convert an OpenOffice.org document to XML or text. ""
zf = zipfile.ZipFile(filename, "r")
data = zf.read("content.xml")
zf.close( )
if want_text:
data = " ".join(rx_stripxml.sub(" ", data).split( ))
return data
if _ _name_ _=="_ _main_ _":
import sys
if len(sys.argv)>1:
for docname in sys.argv[1:]:
print 'Text of', docname, ':'
print convert_OO(docname)
print 'XML of', docname, ':'
print convert_OO(docname, want_text=False)
else:
print 'Call with paths to OO.o doc files to see Text and XML forms.'


Discussion


OpenOffice.org documents are
zip files, and in addition to other contents,
they always contain the file content.xml. This
recipe's job, therefore, essentially boils down to
just extracting this file. By default, the recipe then throws away
XML tags with a simple regular expression, splits the result by
whitespace, and joins it up again with a single blank to save space.
Of course, we could use an XML parser to get information in a vastly
richer and more structured way, but if all we need is the rough
textual content, this fast, rough-and-ready approach may suffice.

Specifically, the regular expression rx_stripxml
matches any XML tag (opening or closing) from the leading
< to the terminating >.
Inside function convert_OO, in the statements
guarded by if want_text, we use that regular
expression to change every XML tag into a space, then normalize
whitespace by splitting (i.e., calling the string method
split, which splits on any sequence of
whitespace), and rejoining (with "
".join, to use a single blank character as the
joiner). Essentially, this split-and-rejoin process changes any
sequence of whitespace into a single blank character. More advanced
ways to extract all text from an XML document are shown in Recipe 12.3.


See Also


Library Reference docs on modules
zipfile and re;
OpenOffice.org's web site, http://www.openoffice.org/; Recipe 12.3.


    / 394