Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 2.27. Extracting Text from Microsoft Word Documents


Credit: Simon Brunning, Pavel Kosina


Problem




You want to extract the text
content from each Microsoft Word document in a directory tree on
Windows into a corresponding text file.


Solution


With the PyWin32 extension, we can access Word itself, through COM,
to perform the conversion:

import fnmatch, os, sys, win32com.client
wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")
try:
for path, dirs, files in os.walk(sys.argv[1]):
for filename in files:
if not fnmatch.fnmatch(filename, '*.doc'): continue
doc = os.path.abspath(os.path.join(path, filename))
print "processing %s" % doc
wordapp.Documents.Open(doc)
docastxt = doc[:-3] + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,
FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close( )
finally:
# ensure Word is properly shut down even if we get an exception
wordapp.Quit( )


Discussion


A useful aspect of most Windows applications is that you can script
them via COM, and the PyWin32 extension makes it fairly easy to
perform COM scripting from Python. The extension enables you to write
Python scripts to perform many kinds of Window tasks. The script in
this recipe's Solution drives Microsoft Word to
extract the text from every .doc file in a
"directory" tree into a
corresponding .txt text file. Using the
os.walk function, we can access every subdirectory
in a tree with a simple for statement, without
recursion. With the fnmatch.fnmatch function, we
can check a filename to determine whether it matches an appropriate
wildcard, here '*.doc'. Once we have determined
the name of a Word document file, we process that name with functions
from os.path to turn it into a complete absolute
path, and have Word open it, save it as text, and close it again.

If you don't have Word, you may need to take a
completely different approach. One possibility is to use
OpenOffice.org, which is able to load Word documents. Another is to
use a program specifically designed to read Word documents, such as
Antiword, found at http://www.winfield.demon.nl/. However, we
have not explored these alternative options.


See Also


Mark Hammond, Andy Robinson, Python Programming on
Win32
(O'Reilly), for documentation on
PyWin32; http://msdn.microsoft.com, for
Microsoft's documentation of the object model of
Microsoft Word; Library Reference and
Python in a Nutshell sections on modules
fnmatch and os.path, and
function os.walk.


/ 394