Recipe 17.7. Accessing a Python Sequence Item-by-Item with the Iterator Protocol
Credit: Luther Blissett
Problem
You want to write a
Python-callable C extension that takes as an argument a Python
sequence (or other iterable) and accesses it sequentially, one item
at a time, requiring no extra storage.
Solution
If you can afford to access the sequence item-by-item, without
knowing in advance the number of items it has, you can often save
memory by using PyObject_GetIter instead of
PySequence_Fast:
#include <Python.h>
static PyObject *totalIter(PyObject *self, PyObject *args)
{
PyObject* seq;
PyObject* item;
double result;
/* get one argument as an iterator */
if(!PyArg_ParseTuple(args, "O", &seq))
return 0;
seq = PyObject_GetIter(seq);
if(!seq)
return 0;
/* process data sequentially */
result = 0.0;
while((item=PyIter_Next(seq))) {
PyObject *fitem;
fitem = PyNumber_Float(item);
if(!fitem) {
Py_DECREF(seq);
Py_DECREF(item);
PyErr_SetString(PyExc_TypeError, "all items must be numbers");
return 0;
}
result += PyFloat_AS_DOUBLE(fitem);
Py_DECREF(fitem);
Py_DECREF(item);
}
/* clean up and return result */
Py_DECREF(seq);
return Py_BuildValue("d", result);
}
static PyMethodDef totitMethods[ ] = {
{"totit", totalIter, METH_VARARGS, "Sum a sequence of numbers."},
{0} /* sentinel */
};
void
inittotit(void)
{
(void) Py_InitModule("totit", totitMethods);
}
Discussion
PyObject_GetIter is appropriate only when
it's OK for the rest of your C code to get the items
one at a time, without knowing in advance the number of items in
total. When this condition is met,
PyObject_GetIter gives you roughly the same
performance as PySequence_Fast (if the input
argument is a list or tuple), but it can save memory allocation, and
therefore can run faster, if the input argument is an iterator or
another kind of sequence or iterable. In this
recipe's function, since we are just summing the
items, it is indeed perfectly OK to get them one at a time, and we
don't need to know in advance the total number;
therefore, using PyObject_GetIter is preferable.
(In the real world, you would use Python's built-in
function sum for this specific functionality,
rather than coding a dedicated C function, but then, this
is meant to be just an example!)PyObject_GetIter takes one argument: a Python
object from which an iterator is desired (much like
Python's iter built-in function).
It either returns 0, indicating an error, or an
iterator object, on which you can repeatedly call
PyIter_Next to get the next item (or
0, NULL, which does not
indicate an error, but rather indicates the end of the iteration).
Both PyObject_GetIter and
PyIter_Next return new references, so we must use
Py_DECREF when we're done with
the respective objects.As usual, the best way to build this extension (assuming that
you've saved it as a file named
totit.c) is with the
distutils package. Place in the same directory as
the C source a file named setup.py such as:
from distutils.core import setup, Extensionthen build and install by running:
setup(name="totit", maintainer="Luther Blissett", maintainer_email=
"situ@tioni.st", ext_modules=[Extension('totit', sources=['totit.c'])]
)
<m>$ python setup.py install</m>Part of the appeal of this approach is that it works on any platform,
assuming that you have access to the same C compiler used to build
your version of Python, and permission to write on the
site-packages directory where the resulting
dynamically loaded library gets installed.Since Python extensions are often coded in C to maximize performance,
it's interesting to measure performance compared to
pure Python code dealing with the same task. A typical measurement
setup might be a script such as the following
timon.py:
import timeit, operatorThis script uses the timeit module of the Python
from total import total
from totit import totit
def timo(fn, sq, init):
T = timeit.Timer('timon.%s(%s)'%(fn,sq), 'import timon\n'+init)
print ' %5.5s: %5.2f' % (fn, T.timeit(40000))
def totpy(x):
result = 0.0
for item in x: result += item
return result
def totre(x):
return reduce(operator.add, x, 0.0)
def totsu(x):
return sum(x, 0.0)
if _ _name_ _ == '_ _main_ _':
print 'on lists:'
for f in 'totre totpy total totit totsu'.split( ):
timo(f, 'seq', 'seq=range(2000)')
print 'on iters:'
for f in 'totre totpy total totit totsu'.split( ):
timo(f, 'g( )', 'def g( ):\n for x in range(2000): yield x')
Standard Library to measure accurately 40,000 calls to each function
on 2,000-item lists and 2,000-item generators. The
timeit.Timer constructor takes two string
arguments: first the statement we're timing, then
the setup statements that run before timing begins. Here, the
statement we're timing calls functions in this
module; therefore, the setup statements must import this
modulewhich is why we add the import timon
at the beginning of the setup statement string. I have also taken
care to make all these functions strictly comparable, by having them
all sum floats (not just ints).
This purpose is the reason that I provide the explicit
0.0 initial arguments to built-in functions
reduce and sum.On my machine, running with the command-line switch
-O so that Python can optimize operations a little
bit, the timing results on Python 2.3 are:
<m>$ python -O timon.py</m>As you can see, the most important optimization is to avoid the
on lists:
totre: 136.04
totpy: 118.18
total: 56.61
totit: 59.66
totsu: 74.11
on iters:
totre: 220.86
totpy: 198.98
total: 199.72
totit: 201.70
totsu: 157.44
"attractive nuisance" of the
reduce built-in function: even a pure Python loop
is faster! When we're dealing with lists, the
special-purpose C-coded extensions presented in the last two recipes
are fastest; but when we're dealing with generators,
the fastest solution is provided by the built-in function
sum. In practice, one would always use
sum for this functionality, rather than bothering
to code or expose special-purpose C functions.
See Also
The Extending and Embedding manual is
available as part of the standard Python documentation set at
http://www.python.org/doc/current/ext/extl;
documentation on the Python C API is at http://www.python.org/doc/current/api/apil;
the section "Distributing Python
Modules" in the standard Python documentation set is
still incomplete but is a good source of information on the
distutils package: Chapter 19
of this book covers iterators and generators in pure Python terms;
Python in a Nutshell covers the essentials of
extending and embedding, of the Python C API, of the
distutils package, and of iterators;
Python's Library Reference
covers the timeit module.