Recipe 2.8. Updating a Random-Access File
Credit: Luther Blissett
Problem
You want to read a binary record from
somewhere inside a large file of fixed-length records, change some or
all of the values of the record's fields, and write
the record back.
Solution
Read the record, unpack it, perform whatever computations you need
for the update, pack the fields back into the record, seek to the
start of the record again, write it back. Phew. Faster to code than
to say:
import struct
format_string = '8l' # e.g., say a record is 8 4-byte integers
thefile = open('somebinfile', 'r+b')
record_size = struct.calcsize(format_string)
thefile.seek(record_size * record_number)
buffer = thefile.read(record_size)
fields = list(struct.unpack(format_string, buffer))
# Perform computations, suitably modifying fields, then:
buffer = struct.pack(format_string, *fields)
thefile.seek(record_size * record_number)
thefile.write(buffer)
thefile.close( )
Discussion
This approach works only on files (generally binary ones) defined in
terms of records that are all the same, fixed size; it
doesn't work on normal text files. Furthermore, the
size of each record must be that defined by a
struct format string, as shown in the
recipe's code. A typical format string, for example,
might be '8l', to specify that each record is made
up of eight four-byte integers, each to be interpreted as a signed
value and unpacked into a Python int. In this
case, the fields variable in the recipe
would be bound to a list of eight ints. Note that
struct.unpack returns a tuple. Because tuples are
immutable, the computation would have to rebind the entire
fields variable. A list is mutable, so
each field can be rebound as needed. Thus, for convenience, we
explicitly ask for a list when we bind
fields. Make sure, however, not to alter
the length of the list. In this case, it needs to remain composed of
exactly eight integers, or the struct.pack call
will raise an exception when we call it with a
format_string of '8l'.
Also, this recipe is not suitable when working with records that are
not all of the same, unchanging length.To seek back to the start of the record,
instead of using the record_size*record_number
offset again, you may choose to do a relative seek:
thefile.seek(-record_size, 1)The second argument to the seek method
(1) tells the file object to seek relative to the
current position (here, so many bytes back, because we used a
negative number as the first argument).
seek's default is to seek to an
absolute offset within the file (i.e., from the start of the file).
You can also explicitly request this default behavior by calling
seek with a second argument of
0.You don't need to open the file just before you do
the first seek, nor do you need to close it right
after the write. Once you have a file object that
is correctly opened (i.e., for updating and as a binary rather than a
text file), you can perform as many updates on the file as you want
before closing the file again. These calls are shown here to
emphasize the proper technique for opening a file for random-access
updates and the importance of closing a file when you are done with
it.The file needs to be opened for updating (i.e., to allow both reading
and writing). That's what the
'r+b' argument to open means:
open for reading and writing, but do not implicitly perform any
transformations on the file's contents because the
file is a binary one. (The 'b' part is unnecessary
but still recommended for clarity on Unix and Unix-like systems.
However, it's absolutely crucial on other platforms,
such as Windows.) If you're creating the binary file
from scratch, but you still want to be able to go back, reread, and
update some records without closing and reopening the file, you can
use a second argument of 'w+b' instead. However, I
have never witnessed this strange combination of requirements; binary
files are normally first created (by opening them with
'wb', writing data, and closing the file) and
later reopened for updating with 'r+b'.While this approach is normally useful only on a file whose records
are all the same size, another, more advanced possibility exists: a
separate "index file" that provides
the offset and length of each record inside the
"data file". Such indexed
sequential access approaches aren't much in fashion
any more, but they used to be very important. Nowadays, one meets
just about only text files (of many kinds, more and more often XML
ones), databases, and occasional binary files with fixed-length
records. Still, if you do need to access an indexed sequential binary
file, the code is quite similar to that shown in this recipe, except
that you must obtain the record_size and the offset
argument to pass to thefile.seek by reading them
from the index file, rather than computing them yourself as shown in
this recipe's Solution.
See Also
The sections of the Library Reference and
Python in a Nutshell on file objects and the
struct module; Perl
Cookbook recipe 8.13.