Recipe 7.2. Serializing Data Using the pickle and cPickle Modules
Credit: Luther Blissett
Problem
You want to serialize and reconstruct,
at a reasonable speed, a Python data structure, which may include
both fundamental Python object as well as classes and instances.
Solution
If you don't want to assume that your data is
composed only of fundamental Python objects, or you need portability
across versions of Python, or you need to transmit the serialized
form as text, the best way of serializing your data is with the
cPickle module. (The pickle
module is a pure-Python equivalent and totally interchangeable, but
it's slower and not worth using except if
you're missing cPickle.) For
example, say you have:
data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}You can serialize data to a text string:
import cPickleor to a binary string, a choice that is faster and takes up less
text = cPickle.dumps(data)
space:
bytes = cPickle.dumps(data, 2)You can now sling text or bytes
around as you wish (e.g., send across a network, include as a BLOB in
a databasesee Recipe 7.10, Recipe 7.11, and Recipe 7.12) as long as you keep
text or bytes intact. In the
case of bytes, it means keeping the arbitrary
binary bytes intact. In the case of text, it means
keeping its textual structure intact, including newline characters.
Then you can reconstruct the data at any time, regardless of machine
architecture or Python release:
redata1 = cPickle.loads(text)Either call reconstructs a data structure that compares equal to
redata2 = cPickle.loads(bytes)
data. In particular, the order of keys in
dictionaries is arbitrary in both the original and reconstructed data
structures, but order in any kind of sequence is meaningful, and thus
it is preserved. You don't need to tell
cPickle.loads whether the original
dumps used text mode (the default, also readable
by some very old versions of Python) or binary (faster and more
compact)loads figures it out by examining
its argument's contents.When you specifically want to write the data to a file, you can also
use the dump function of the
cPickle module, which lets you dump several data
structures to the same file one after the other:
ouf = open('datafile.txt', 'w')Once you have done this, you can recover from
cPickle.dump(data, ouf)
cPickle.dump('some string', ouf)
cPickle.dump(range(19), ouf)
ouf.close( )
datafile.txt the same data structures you dumped
into it, one after the other, in the same order:
inf = open('datafile.txt')You can also pass cPickle.dump a third argument
a = cPickle.load(inf)
b = cPickle.load(inf)
c = cPickle.load(inf)
inf.close( )
with a value of 2 to tell
cPickle.dump to serialize the data in binary form
(faster and more compact), but the data file must then be opened for
binary I/O, not in the default text mode, both when you originally
dump to the file and when you later load from the file.
Discussion
Python offers several ways to serialize data (i.e., make the data
into a string of bytes that you can save on disk, save in a database,
send across the network, etc.) and corresponding ways to reconstruct
the data from such serialized forms. Typically, the best approach is
to use the cPickle module. A pure-Python
equivalent, called pickle (the
cPickle module is coded in C as a Python
extension) is substantially slower, and the only reason to use it is
if you don't have cPickle (e.g.,
with a Python port onto a mobile phone with tiny storage space, where
you saved every byte you possibly could by installing only an
indispensable subset of Python's large standard
library). However, in cases where you do need to
use pickle, rest assured that it is completely
interchangeable with cPickle: you can pickle with
either module and unpickle with the other one, without any problems
whatsoever.cPickle supports most elementary data types (e.g.,
dictionaries, lists, tuples, numbers, strings) and combinations
thereof, as well as classes and instances. Pickling classes and
instances saves only the data involved, not the code. (Code objects
are not even among the types that cPickle knows
how to serialize, basically because there would be no way to
guarantee their portability across disparate versions of Python. See
Recipe 7.6 for a way to
serialize code objects, as long as you don't need
the cross-version guarantee.) See Recipe 7.4 for more about pickling
classes and instances.cPickle guarantees compatibility from one Python
release to another, as well as independence from a specific
machine's architecture. Data serialized with
cPickle will still be readable if you upgrade your
Python release, and pickling is also guaranteed to work if
you're sending serialized data between different
machines.The dumps function of cPickle
accepts any Python data structure and returns a text string
representing it. If you call dumps with a second
argument of 2, dumps returns an
arbitrary bytestring instead: the operation is faster, and the
resulting string takes up less space. You can pass either the text or
the bytestring to the loads function, which will
return another Python data structure that compares equal
(==) to the one you originally dumped. In between
the dumps and loads calls, you
can subject the text or bytestring to any procedure you wish, such as
sending it over the network, storing it in a database and retrieving
it, or encrypting and decrypting it. As long as the
string's textual or binary structure is correctly
restored, loads will work fine on it (even across
platforms and in future releases). If you need to produce data
readable by old (pre-2.3) versions of Python, consider using 1 as the
second argument: operation will be slower, and the resulting strings
will not be as compact as those obtained by using 2, but the strings
will be unpicklable by old Python versions as well as current and
future ones.When you specifically need to save the data into a file, you can also
use cPickle's
dump function, which takes two arguments: the data
structure you're dumping and the open file or
file-like object. If the file is opened for binary I/O, rather than
the default (text I/O), then by giving dump a
third argument of 2, you can ask for binary format, which is faster
and takes up less space (again, you can also use 1 in this position
to get a binary format that's neither as compact nor
as fast, but is understood by old, pre-2.3 Python versions too). The
advantage of dump over dumps is
that, with dump, you can perform several calls,
one after the other, with various data structures and the same open
file object. Each data structure is then dumped with information
about how long the dumped string is. Consequently, when you later
open the file for reading (binary reading, if you asked for binary
format) and then repeatedly call cPickle.load,
passing the file as the argument, each data structure previously
dumped is reloaded sequentially, one after the other. The return
value of load, like that of
loads, is a new data structure that compares equal
to the one you originally dumped.Those accustomed to other languages and libraries offering
"serialization" facilities may be
wondering whether pickle imposes substantial
practical limits on the size of objects you can
serialize or deserialize. Answer: Nope. Your
machine's memory might, but as long as everything
fits comfortably in memory, pickle practically
imposes no further limit.
See Also
Recipe 7.2 and Recipe 7.4; documentation for the
standard library module cPickle in the
Library Reference and Python in a
Nutshell.