Recipe 2.27. Extracting Text from Microsoft Word Documents
Credit: Simon Brunning, Pavel Kosina
Problem
You want to extract the text
content from each Microsoft Word document in a directory tree on
Windows into a corresponding text file.
Solution
With the PyWin32 extension, we can access Word itself, through COM,
to perform the conversion:
import fnmatch, os, sys, win32com.client
wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")
try:
for path, dirs, files in os.walk(sys.argv[1]):
for filename in files:
if not fnmatch.fnmatch(filename, '*.doc'): continue
doc = os.path.abspath(os.path.join(path, filename))
print "processing %s" % doc
wordapp.Documents.Open(doc)
docastxt = doc[:-3] + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,
FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close( )
finally:
# ensure Word is properly shut down even if we get an exception
wordapp.Quit( )
Discussion
A useful aspect of most Windows applications is that you can script
them via COM, and the PyWin32 extension makes it fairly easy to
perform COM scripting from Python. The extension enables you to write
Python scripts to perform many kinds of Window tasks. The script in
this recipe's Solution drives Microsoft Word to
extract the text from every .doc file in a
"directory" tree into a
corresponding .txt text file. Using the
os.walk function, we can access every subdirectory
in a tree with a simple for statement, without
recursion. With the fnmatch.fnmatch function, we
can check a filename to determine whether it matches an appropriate
wildcard, here '*.doc'. Once we have determined
the name of a Word document file, we process that name with functions
from os.path to turn it into a complete absolute
path, and have Word open it, save it as text, and close it again.If you don't have Word, you may need to take a
completely different approach. One possibility is to use
OpenOffice.org, which is able to load Word documents. Another is to
use a program specifically designed to read Word documents, such as
Antiword, found at http://www.winfield.demon.nl/. However, we
have not explored these alternative options.
See Also
Mark Hammond, Andy Robinson, Python Programming on
Win32 (O'Reilly), for documentation on
PyWin32; http://msdn.microsoft.com, for
Microsoft's documentation of the object model of
Microsoft Word; Library Reference and
Python in a Nutshell sections on modules
fnmatch and os.path, and
function os.walk.