Recipe 7.3. Using Compression with Pickling
Credit: Bill McNeill, Andrew Dalke
Problem
You want to pickle generic Python objects
to and from disk in a compressed form.
Solution
Standard library modules cPickle and
gzip offer the needed functionality; you just need
to glue them together appropriately:
import cPickle, gzip
def save(filename, *objects):
''' save objects into a compressed diskfile '''
fil = gzip.open(filename, 'wb')
for obj in objects: cPickle.dump(obj, fil, proto=2)
fil.close( )
def load(filename):
''' reload objects from a compressed diskfile '''
fil = gzip.open(filename, 'rb')
while True:
try: yield cPickle.load(fil)
except EOFError: break
fil.close( )
Discussion
Persistence and compression,
as a general rule, go well together. cPickle
protocol 2 saves Python objects quite compactly, but the resulting
files can still compress quite well. For example, on my Linux box,
open('/usr/dict/share/words').readlines( )
produces a list of over 45,000 strings. Pickling that list with the
default protocol 0 makes a disk file of 972 KB, while protocol 2
takes only 716 KB. However, using both gzip and
protocol 2, as shown in this recipe, requires only 268 KB, saving a
significant amount of space. As it happens, protocol 0 produces a
more compressible file in this case, so that using
gzip and protocol 0 would save even more space,
taking only 252 KB on disk. However, the difference between 268 and
252 isn't all that meaningful, and protocol 2 has
other advantages, particularly when used on instances of new-style
classes, so I recommend the mix I use in the functions shown in this
recipe.Whatever protocol you
choose to save your data, you don't need to worry
about it when you're reloading the data. The
protocol is recorded in the file together with the data, so
cPickle.load can figure out by itself all it
needs. Just pass it an instance of a file or
pseudo-file object with a read
method, and cPickle.load returns each object that
was pickled to the file, one after the other, and raises
EOFError when the file's done. In
this recipe, we wrap a generator around
cPickle.load, so you can simply loop over all
recovered objects with a for statement, or,
depending on what you need, you can use some call such as
list(load('somefile.gz')) to get a list with all
recovered objects as its items.
See Also
Modules gzip and cPickle in the
Library Reference.