Introduction
Credit: Mark Lutz, author of Programming
Python and Python Quick
Reference, co-author of
Learning PythonBehold the fileone of the first things that any reasonably
pragmatic programmer reaches for in a programming
language's toolbox. Because processing external
files is a very real, tangible task, the quality of file-processing
interfaces is a good way to assess the practicality of a programming
tool.As the recipes in this chapter attest, Python shines in this task.
Files in Python are supported in a variety of layers: from the
built-in open function (a synonym for the standard
file object type), to specialized tools in
standard library modules such as os, to
third-party utilities available on the Web. All told,
Python's arsenal of file tools provides several
powerful ways to access files in your scripts.
File Basics
In Python, a file object is an instance of built-in type
file. The built-in function
open creates and returns a file object. The first
argument, a string, specifies the file's path (i.e.,
the filename preceded by an optional directory path). The second
argument to open, also a string, specifies the
mode in which to open the file. For example:
input = open('data', 'r')open accepts a file path in which
output = open('/tmp/spam', 'w')
directories and files are separated by slash characters
(/), regardless of the proclivities of the
underlying operating system. On systems that don't
use slashes, you can use a backslash character (\)
instead, but there's no real reason to do so.
Backslashes are harder to fit nicely in string literals, since you
have to double them up or use "raw"
strings. If the file path argument does not include the
file's directory name, the file is assumed to reside
in the current working directory (which is a disjoint concept from
the Python module search path).For the mode argument, use 'r' to read the file in
text mode; this is the default value and is commonly omitted, so that
open is called with just one argument. Other
common modes are 'rb' to read the file in binary
mode, 'w' to create and write to the file in text
mode, and 'wb' to create and write to the file in
binary mode. A variant of 'r' that is sometimes
precious is 'rU', which tells Python to read the
file in text mode with "universal
newlines": mode 'rU' can read
text files independently of the line-termination convention the files
are using, be it the Unix way, the Windows way, or even the (old) Mac
way. (Mac OS X today is a Unix for all intents and purposes, but
releases of Mac OS 9 and earlier, just a few years ago, were quite
different.) The distinction between text mode and
binary mode is important on non-Unix-like platforms because of the
line-termination characters used on these systems. When you open a
file in binary mode, Python knows that it doesn't
need to worry about line-termination characters; it just moves bytes
between the file and in-memory strings without any kind of
translation. When you open a file in text mode on a non-Unix-like
system, however, Python knows it must translate between the
'\n' line-termination characters used in strings
and whatever the current platform uses in the file itself. All of
your Python code can always rely on '\n' as the
line-termination character, as long as you properly indicate text or
binary mode when you open the file.Once you have a file object, you perform all file I/O by calling
methods of this object, as we'll discuss in a
moment. When you're done with the file, you should
finish by calling the close method on the object,
to close the connection to the file:
input.close( )In short scripts, people often omit this step, as Python
automatically closes the file when a file object is reclaimed during
garbage collection (which in mainstream Python means the file is
closed just about at once, although other important Python
implementations, such as Jython and IronPython, have other, more
relaxed garbage-collection strategies). Nevertheless, it is good
programming practice to close your files as soon as possible, and it
is especially a good idea in larger programs, which otherwise may be
at more risk of having excessive numbers of uselessly open files
lying about. Note that
TRy/finally is particularly
well suited to ensuring that a file gets closed, even when a function
terminates due to an uncaught exception.To write to a file, use the write
method:
output.write(s)where s is a string. Think of s
as a string of characters if output is open for
text-mode writing, and as a string of bytes if
output is open for binary-mode writing. Files have
other writing-related methods, such as flush, to
send any data being buffered, and writelines, to
write a sequence of strings in a single call. However,
write is by far the most commonly used
method. Reading from a file is more common than writing to a file, and more
issues are involved, so file objects have more reading methods than
writing ones. The readline method reads and
returns the next line from a text file. Consider the following loop:
while True:This was once idiomatic Python but it is no longer the best way to
line = input.readline( )
if not line: break
process(line)
read and process all of the lines from a file. Another dated
alternative is to use the readlines method, which
reads the whole file and returns a list of
lines:
for line in input.readlines( ):readlines is useful only for files that fit
process(line)
comfortably in physical memory. If the file is truly huge,
readlines can fail or at least slow things down
quite drastically (virtual memory fills up and the operating system
has to start copying parts of physical memory to disk). In
today's Python, just loop on the file object itself
to get a line at a time with excellent memory and performance
characteristics:
for line in input:Of course, you
process(line)
don't always want to read a file line by line. You
may instead want to read some or all of the bytes in the file,
particularly if you've opened the file for
binary-mode reading, where lines are unlikely to be an applicable
concept. In this case, you can use the read
method. When called without arguments, read reads
and returns all the remaining bytes from the file. When
read is called with an integer argument
N, it reads and returns the next
N bytes (or all the remaining bytes, if
less than N bytes remain). Other methods
worth mentioning are seek and
tell, which support random access to files. These
methods are normally used with binary files made up of fixed-length
records.
Portability and Flexibility
On the surface, Python's
file support is straightforward. However, before you peruse the code
in this chapter, I want to underscore two aspects of
Python's file support: code portability and
interface flexibility.Keep in mind that most file interfaces in Python are fully portable
across platform boundaries. It would be difficult to overstate the
importance of this feature. A Python script that searches all files
in a "directory" tree for a bit of
text, for example, can be freely moved from platform to platform
without source-code changes: just copy the script's
source file to the new target machine. I do it all the timeso
much so that I can happily stay out of operating system wars. With
Python's portability, the underlying platform is
almost irrelevant.Also, it has always struck me that Python's
file-processing interfaces are not restricted to real, physical
files. In fact, most file tools work with any kind of object that
exposes the same interface as a real file object. Thus, a file reader
cares only about read methods, and a file writer cares only about
write methods. As long as the target object implements the expected
protocol, all goes well.For example, suppose you have written a general file-processing
function such as the following, meant to apply a passed-in function
to each line of an input file:
def scanner(fileobject, linehandler):If you code this function in a module file and drop that file into a
for line in fileobject:
linehandler(line)
"directory" that's
on your Python search path (sys.path), you can use
it any time you need to scan a text file line by line, now or in the
future. To illustrate, here is a client script that simply prints the
first word of each line:
from myutils import scannerSo far, so good; we've just coded a small, reusable
def firstword(line):
print line.split( )[0]
file = open('data')
scanner(file, firstword)
software component. But notice that there are no type declarations in
the scanner function, only an interface
constraintany object that is iterable line by line will do.
For instance, suppose you later want to provide canned test input
from a string object, instead of using a real, physical file. The
standard StringIO module, and the equivalent but
faster cStringIO, provide the appropriate wrapping
and interface forgery:
from cStringIO import StringIOStringIO objects are plug-and-play compatible with
from myutils import scanner
def firstword(line): print line.split( )[0]
string = StringIO('one\ntwo xxx\nthree\n')
scanner(string, firstword)
file objects, so scanner takes its three lines of
text from an in-memory string object, rather than a true external
file. You don't need to change the scanner to make
this workjust pass it the right kind of object. For more
generality, you can even use a class to implement the expected
interface instead:
class MyStream(object):This time, as scanner attempts to read the file, it
def _ _iter_ _(self):
# grab and return text from wherever
return iter(['a\n', 'b c d\n'])
from myutils import scanner
def firstword(line):
print line.split( )[0]
object = MyStream( )
scanner(object, firstword)
really calls out to the _ _iter_ _ method
you've coded in your class. In practice, such a
method might use other Python standard tools to grab text from a
variety of sources: an interactive user, a popup GUI input box, a
shelve object, an SQL database, an XML or HTML
page, a network socket, and so on. The point is that
scanner doesn't know or care what
type of object is implementing the interface it expects, or what that
interface actually does.Object-oriented
programmers know this deliberate naiveté as
polymorphism. The type of the object being
processed determines what an operation, such as the
for-loop iteration in scanner,
actually does. Everywhere in Python, object interfaces, rather than
specific data types, are the unit of coupling. The practical effect
is that functions are often applicable to a much broader range of
problems than you might expect. This is especially true if you have a
background in statically typed languages such as C or C++. It is
almost as if we get C++ templates for free in Python. Code has an
innate flexibility that is a by-product of Python's
strong but dynamic typing.Of course, code portability and flexibility run rampant in Python
development and are not really confined to file interfaces. Both are
features of the language that are simply inherited by file-processing
scripts. Other Python benefits, such as its easy scriptability and
code readability, are also key assets when it comes time to change
file-processing programs. But rather than extolling all of
Python's virtues here, I'll simply
defer to the wonderful recipes in this chapter and this book at large
for more details. Enjoy!