Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 2.26. Extracting Text from OpenOffice.org Documents

Credit: Dirk
Holtwick

Problem

You need to extract the text content (with or without the attending
XML markup) from an OpenOffice.org document.

Solution

An OpenOffice.org document is just a
zip file that aggregates XML documents according
to a well-documented standard. To access our precious data, we
don't even need to have
OpenOffice.org installed:

import zipfile, re
rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE)
def convert_OO(filename, want_text=True):
"" Convert an OpenOffice.org document to XML or text. ""
zf = zipfile.ZipFile(filename, "r")
data = zf.read("content.xml")
zf.close( )
if want_text:
data = " ".join(rx_stripxml.sub(" ", data).split( ))
return data
if _ _name_ _=="_ _main_ _":
import sys
if len(sys.argv)>1:
for docname in sys.argv[1:]:
print 'Text of', docname, ':'
print convert_OO(docname)
print 'XML of', docname, ':'
print convert_OO(docname, want_text=False)
else:
print 'Call with paths to OO.o doc files to see Text and XML forms.'

Discussion

OpenOffice.org documents are
zip files, and in addition to other contents,
they always contain the file content.xml. This
recipe's job, therefore, essentially boils down to
just extracting this file. By default, the recipe then throws away
XML tags with a simple regular expression, splits the result by
whitespace, and joins it up again with a single blank to save space.
Of course, we could use an XML parser to get information in a vastly
richer and more structured way, but if all we need is the rough
textual content, this fast, rough-and-ready approach may suffice.

Specifically, the regular expression rx_stripxml
matches any XML tag (opening or closing) from the leading
< to the terminating >.
Inside function convert_OO, in the statements
guarded by if want_text, we use that regular
expression to change every XML tag into a space, then normalize
whitespace by splitting (i.e., calling the string method
split, which splits on any sequence of
whitespace), and rejoining (with "
".join, to use a single blank character as the
joiner). Essentially, this split-and-rejoin process changes any
sequence of whitespace into a single blank character. More advanced
ways to extract all text from an XML document are shown in Recipe 12.3.

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی