Introduction
Credit: Aaron Watters, Software ConsultantThere are three kinds of people in this world: those who can count
and those who can''t.However, there are
only two kinds of computer programs: toy programs and programs that
interact with some kind of persistent databases. That is to say, most
real computer programs must retrieve stored information and record
information for future use. These days, this description applies to
almost every computer game, which can typically save and restore the
state of the game at any time. So when I refer to toy
programs, I mean programs written as exercises, or for the
fun of programming. Nearly all real programs (such as programs that
people get paid to write) have some persistent database
storage/retrieval component.When
I was a Fortran programmer in the 1980s, I noticed that although
almost every program had to retrieve and store information, they
almost always did it using home-grown methods. Furthermore, since the
storage and retrieval parts of the program were the least interesting
components from the programmer''s point of view,
these parts of the program were frequently implemented very sloppily
and were hideous sources of intractable bugs. This repeated
observation convinced me that the study and implementation of
database systems sat at the core of programming pragmatics, and that
the state of the art as I saw it then required much improvement.
Later, in graduate school, I was delighted
to find an impressive and sophisticated body of work relating to the
implementation of database systems. The literature of database
systems covered issues of concurrency, fault tolerance, distribution,
query optimization, database design, and transaction semantics, among
others. In typical academic fashion, many of the concepts had been
elaborated to the point of absurdity (such as the silly notion of
conditional multivalued dependencies), but much of the work was
directly related to the practical implementation of reliable and
efficient storage and retrieval systems. The starting point for much
of this work was E.F. Codd''s seminal paper,
"A Relational Model of Data for Large Shared Data
Banks."[1]
[1] E.F. Codd,
"A Relational Model of Data for Large Shared Data
Banks," Communications of the
ACM, 13, no. 6 (1970), pp. 377-87, http://www.acm.org/classics/nov95/2.
Among my fellow graduate students, and even among most of the
faculty, the same body of knowledge was either disregarded or
regarded with some scorn. Everyone recognized that knowledge of
conventional relational technology could be lucrative, but they
generally considered such knowledge to be on the same level as
knowing how to write (or more importantly, maintain) COBOL programs.
This situation was not helped by the fact that the emerging database
interface standard, SQL (which is now very well established), looked
like an extension of COBOL and bore little obvious relationship to
any modern programming
language.
More than a decade later, there is little indication that anything
will soon overtake SQL-based relational technology for the majority
of data-based applications. In fact, relational-database technology
seems more pervasive than ever. The largest software
vendorsIBM, Microsoft, and Oracleall provide various
relational-database implementations as crucial components of their
core offerings. Other large software firms, such as SAP and
PeopleSoft, essentially provide layers of software built on top of a
relational-database core.Generally, relational
databases have been augmented rather than replaced. Enterprise
software-engineering dogma frequently espouses three-tier systems, in
which the bottom tier is a carefully designed relational database,
the middle tier defines a view of the database as business objects,
and the top tier consists of applications or transactions that
manipulate the business objects, with effects that ultimately
translate to changes in the underlying relational tables. Microsoft''s
Open Database Connectivity (ODBC) standard provides a common
programming API for SQL-based relational databases that permits
programs to interact with many different database engines with no or
few changes. For example, a Python program could be first implemented
using Microsoft Jet[2] as a
backend database for testing and debugging purposes. Once the program
is stable, it can be put into production use, remotely accessing,
say, a backend DB2 database on an IBM mainframe residing on another
continent, by changing (at most) one line of code.
[2]
Microsoft Jet is
commonly but erroneously known as the "Microsoft
Access database." Access is a product that Microsoft
sells for designing and implementing database frontends; Jet is a
backend that you may download for free from
Microsoft''s web site.
Relational databases are not appropriate
for all applications. In particular, a computer game or engineering
design tool that must save and restore sessions should probably use a
more direct method of persisting the logical objects of the program
than the flat tabular representation encouraged in
relational-database design. However, even in domains such as
engineering or scientific information, a hybrid approach that uses
some relational methods is often advisable. For example, I have seen
a complex relational-database schema for archiving genetic-sequencing
informationin which the sequences show up as binary large
objects (BLOBs)but a tremendous amount of important ancillary
information can fit nicely into relational tables. But as the reader
has probably surmised, I fear, I speak as a relational
zealot.
Within the Python world there are many ways
of providing persistence and database functionality. My personal
favorite is Gadfly, http://gadfly.sourceforge.net/, a simple and
minimal SQL implementation that works primarily with in-memory
databases. It is my favorite for no other reason than because it is
mine, and its biggest advantage is that, if it becomes unworkable for
you, it is easy to switch over to another, industrial-strength SQL
engine. Many Gadfly users have started an application with Gadfly
(because it was easy to use) and switched later (because they needed
more).
However, many people may prefer to
start by using other SQL implementations such as MySQL, Microsoft
Access, Oracle, Sybase, Microsoft SQL Server, SQLite, or others that
provide the advantages of an ODBC interface (which Gadfly does not
do).
Python
provides a standard interface for accessing relational databases: the
Python DB Application Programming Interface (Py-DBAPI), originally
designed by Greg Stein. Each underlying database API requires a
wrapper implementation of the Py-DBAPI, and implementations are
available for just about all underlying database interfaces, notably
Oracle and ODBC. When the relational
approach is overkill, Python provides built-in facilities for storing
and retrieving data. At the most basic level, the programmer can
manipulate files directly, as covered in Chapter 2. A step up from files, the
marshal module allows programs to serialize data
structures constructed from simple Python types (not including, e.g.,
classes or class instances). marshal has the
advantage of being able to retrieve large data structures with
blinding speed. The pickle and
cPickle modules allow general storage of objects,
including classes, class instances, and circular structures.
cPickle is so named because it is implemented in C
and is consequently quite fast, but it remains slower than
marshal. For access to structured data in a
somewhat human-readable form, it is also worth considering storing
and retrieving data in XML format (taking advantage of
Python''s several XML parsing and generation
modules), covered in Chapter 12but this
option works best for write once, read
many-type applications. Serialized data or XML
representations may be stored in SQL databases to create a hybrid
approach as well.
While marshal and
pickle provide basic serialization and
deserialization of structures, the application programmer will
frequently desire more functionality, such as transaction support and
concurrency control. When the relational model
doesn''t fit the application, a direct object
database implementation such as the Z-Object Database (ZODB) might be
appropriatesee http://zope.org/Products/ZODB3.2.I must conclude with a plea to those who are dismissive of
relational-database technology. Remember that it is successful for
good reasons, and it might be worth considering. To paraphrase
Churchill:
text = "" Indeed, it has been said that democracy is the worst form of
government, except for all those others that have been tried
from time to time. ""
import string
for a, b in [("democracy", "SQL"), ("government", "database")]:
text = string.replace(text, a, b)
print text