Recipe 2.1. Reading from a File
Credit: Luther Blissett
Problem
You want to read text or data from a file.
Solution
Here's the most
convenient way to read all of the file's contents at
once into one long string:
all_the_text = open('thefile.txt').read( )However, it is safer to bind the file object to a name, so that you
# all text from a text file
all_the_data = open('abinfile', 'rb').read( )
# all data from a binary file
can call close on it as soon as
you're done, to avoid ending up with open files
hanging around. For example, for a text file:
file_object = open('thefile.txt')You don't necessarily have to use the
try:
all_the_text = file_object.read( )
finally:
file_object.close( )
TRy/finally statement here, but
it's a good idea to use it, because it ensures the
file gets closed even when an error occurs during reading.The simplest, fastest, and most Pythonic way to read a text
file's contents at once as a list of strings, one
per line, is:
list_of_all_the_lines = file_object.readlines( )This leaves a '\n' at the end of each line; if you
don't want that, you have alternatives, such as:
list_of_all_the_lines = file_object.read( ).splitlines( )The simplest and fastest way to process a text file one line at a
list_of_all_the_lines = file_object.read( ).split('\n')
list_of_all_the_lines = [L.rstrip('\n') for L in file_object]
time is simply to loop on the file object with a
for statement:
for line in file_object:This approach also leaves a '\n' at the end of
process line
each line; you may remove it by starting the for
loop's body with:
line = line.rstrip('\n')or even, when you're OK with getting rid of trailing
whitespace from each line (not just a trailing
'\n'), the generally handier:
line = line.rstrip( )
Discussion
Unless the file you're
reading is truly huge, slurping it all into memory in one gulp is
often fastest and most convenient for any further processing. The
built-in function open creates a Python file
object (alternatively, you can equivalently call the built-in type
file). You call the read method
on that object to get all of the contents (whether text or binary) as
a single long string. If the contents are text, you may choose to
immediately split that string into a list of lines with the
split method or the specialized
splitlines method. Since splitting into lines is
frequently needed, you may also call readlines
directly on the file object for faster, more convenient
operation.You can also loop directly on the file object, or pass it to
callables that require an iterable, such as list
or maxwhen thus treated as an iterable, a
file object open for reading has the file's text
lines as the iteration items (therefore, this should be done for text
files only). This kind of line-by-line iteration is cheap in terms of
memory consumption and fairly speedy too.On Unix and Unix-like systems, such as Linux, Mac OS X, and other BSD
variants, there is no real distinction between text files and binary
data files. On Windows and very old Macintosh systems, however, line
terminators in text files are encoded, not with the standard
'\n' separator, but with '\r\n'
and '\r', respectively. Python translates these
line-termination characters into '\n' on your
behalf. This means that you need to tell Python when you open a
binary file, so that it won't perform such
translation. To do so, use 'rb' as the second
argument to open. This is innocuous even on
Unix-like platforms, and it's a good habit to
distinguish binary files from text files even there, although
it's not mandatory in that case. Such good habits
will make your programs more immediately understandable, as well as
more compatible with different
platforms.
If you're unsure about which line-termination
convention a certain text file might be using, use
'rU' as the second argument to
open, requesting universal endline translation.
This lets you freely interchange text files among Windows, Unix
(including Mac OS X), and old Macintosh systems, without worries: all
kinds of line-ending conventions get mapped to
'\n', whatever platform your code is running on.You can call methods such as read directly on the
file object produced by the open function, as
shown in the first snippet of the solution. When you do so, you no
longer have a reference to the file object as soon as the reading
operation finishes. In practice, Python notices the lack of a
reference at once, and immediately closes the file. However, it is
better to bind a name to the result of open, so
that you can call close yourself explicitly when
you are done with the file. This ensures that the file stays open for
as short a time as possible, even on platforms such as Jython,
IronPython, and other hypothetical future versions of Python, on
which more advanced garbage-collection mechanisms might delay the
automatic closing that the current version of C-based Python performs
at once. To ensure that a file object is closed even if errors happen
during its processing, the most solid and prudent approach is to use
the try/finally statement:
file_object = open('thefile.txt')Be careful not to place the call to
try:
for line in file_object:
process line
finally:
file_object.close( )
open inside the
try clause of this
try/finally statement (a rather
common error among beginners). If an error occurs during the opening,
there is nothing to close, and besides, nothing gets bound to name
file_object, so you definitely
don't want to call file_object.close()!If you choose to read the file a little at a time, rather than all at
once, the idioms are different. Here's one way to
read a binary file 100 bytes at a time, until you reach the end of
the file:
file_object = open('abinfile', 'rb')Passing an argument N to the
try:
while True:
chunk = file_object.read(100)
if not chunk:
break
do_something_with(chunk)
finally:
file_object.close( )
read method ensures that read
will read only the next N bytes (or fewer,
if the file is closer to the end). read returns
the empty string when it reaches the end of the file. Complicated
loops are best encapsulated as reusable generators. In this case, we
can encapsulate the logic only partially, because a
generator's yield keyword is not
allowed in the try clause of a
try/finally statement. Giving
up on the assurance of file closing afforded by
try/finally, we can therefore
settle for:
def read_file_by_chunks(filename, chunksize=100):Once this read_file_by_chunks generator is
file_object = open(filename, 'rb')
while True:
chunk = file_object.read(chunksize)
if not chunk:
break
yield chunk
file_object.close( )
available, your application code to read and process a binary file by
fixed-size chunks becomes extremely simple:
for chunk in read_file_by_chunks('abinfile'):Reading a text file one line at a time is a frequent task. Just loop
do_something_with(chunk)
on the file object, as in:
for line in open('thefile.txt', 'rU'):Here, too, in order to be 100% certain that no uselessly open file
do_something_with(line)
object will ever be left just hanging around, you may want to code
this snippet in a more rigorously correct and prudent way:
file_object = open('thefile.txt', 'rU'):
try:
for line in file_object:
do_something_with(line)
finally:
file_object.close( )
See Also
Recipe 2.2; documentation
for the open built-in function and
file objects in the Library
Reference and Python in a
Nutshell.