Introduction
Credit: Raymond HettingerLather, Rinse, RepeatDocs for my bottle of shampoo
The Iterator Protocol
After
namespaces, iterators and generators emerged as the next
"honking great ideas" in Python.
Since their introduction in Python 2.2, they have come to pervade and
unify the language. They encourage a loosely coupled programming
style that is simple to write, easy to read, flexible, and
extendable.Simply put, the iterator protocol has two halves, a producer and a
consumer. An iterable object says, "I know how to
supply data one element at a time," and the consumer
says "please give me data one element at a time and
say Stop when you're done."The producer/consumer connection can appear in a number of guises.
The simplest is where a function or constructor wraps around an
iterable object. For example,
sorted(set('simsalabim')) has the set constructor
looping over the elements of the iterable string and a
sorted function wrapping around the resulting
iterable set object. replaceable
literalIn addition to functions and constructors, regular Python statements
can use the in operator to loop over iterable
objects. for line in myfile: print line loops over
lines of an iterable file object. Likewise,
if token in sequence loops over elements of a
sequence until it finds a match (or until it reaches the end with no
matches).Both guises of the consumer side of the iterator protocol use the
protocol implicitly. In addition, an explicit form is more flexible
but used less often. The iterator object is saved as a variable,
it = iter(mystring). Then, the
iterator's next method is called
to retrieve a data element, elem = it.next( ).
Such calls are usually wrapped in
try/except statements to catch
the StopIteration exception that the iterator
raises when the data stream is exhausted.All of these approaches provide the full range of iterator benefits,
including loose coupling and memory friendliness. The loose coupling
is evident in the first example, where the independently created and
maintained sorted function, set
data type, and string objects were readily
combined. The memory friendliness derives from the one-at-a-time
structure of the iterator protocol. Programs using iterators are
likely to be less resource intensive and more scalable than their
list-based counterparts.
Iterators and Generators
An object that wants to be iterable should implement an _
_iter_ _ method, which returns an iterator object. Ideally,
the iterator should be a distinct object from the iterable, to make
it possible to have multiple iterators over the same iterable
container. There are exceptions to this general recommendation: for
example, a sequential file object does not readily
lend itself to multiple iterations; therefore, it is more appropriate
in this case to have the file object be its own iterator rather than
return a separate iterator object; given any file
instance f, indeed,
iter(f) is
f.Any iterator object must implement a next method
and an _ _iter_ _ method. The
next method should raise
StopIteration when the iteration is complete. Care
should be taken that subsequent calls to next
continue to raise StopIteration (once stopped, it
stays stopped). The _ _iter_ _ method of an
iterator object should always return the iterator itself (_
_iter_ _ is idempotent on iterators). This simplifies
client code by allowing it to treat iterators and iterables the same
way (i.e., both return an iterator in response to the
iter function).To be useful, most iterators have a stored state that enables them to
return a new data element on each call. The previously described
responsibilities and the need to have a stored state both suggest
that classes are the way to create iterable objects. That approach is
obvious, explicit, and rarely used (only two recipes in this chapter
make any use of classes).Instead of writing classes, two alternate approaches dominate.
Starting with the observation that many functions and types both
accept iterable inputs and return iterable outputs, an obvious
approach is to link them together in a "pipes and
filters" style to create new tools. For example,
def uniq(seq): return sorted(set(seq)) is a way to
create a new tool directly from existing functions and types. Like
functional programming, the resulting code is terse, readable,
trivial to debug, and often runs at the speed of compiled C code. The
economy of this approach motivated the creation of an entire module
of iterator building blocks, the itertools module.
Indeed, many of the brilliant, effective recipes in this chapter make
frequent use of itertools components.If no combination of building blocks solves the problem, the next
best approach is to write a generator. The Recipe 19.1
shows how trivially easy it
is to write a generator. By introducing a yield
keyword, the responsibilities of creating an iterator are handled
automatically. The iterator objects obtained by calling a generator
are distinct, save their state, have an idempotent _ _iter_
_ method, and have a next method that
raises StopIteration when complete and stays
stopped if called again afterwards. Python internally takes care of
all of these details. Because of generators'
compelling simplicity, most of the recipes in this chapter make use
of generators.Starting with version 2.4, Python continued its evolution toward
using iterators everywhere by introducing generator expressions
(genexps for short). Genexps can be likened to
a memory-efficient, scalable form of list comprehensions. Simply by
replacing brackets with parentheses, an expression will yield one
element at a time rather than filling memory all at once. Used
correctly (i.e., in a context where they are consumed immediately,
one item at a time), genexps can offer remarkable clarity and
economy: sum(x*x for x in xrange(10)) is a great
way to express the sum of the squares of the first ten natural
numbers.
Thinking Out of the Box
Paradoxically, the simpler and more general an idea, the more likely
that people will find extraordinary and unexpected ways of using it.
Here is a brief sampling of the ways that iterators and generators
have been pushed to their outer limits.Observing that the
yield keyword has the unique capability of
stopping execution, saving state, and later resuming, it is not
surprising that techniques have been discovered for using generators
to simulate co-routines and continuations. The core idea is to
implement each routine as a generator and having a
dispatch function launch the routines in
orderly succession. Whenever a task switch is needed, the routines
yield back to the dispatcher, which then launches or resumes the next
routine by calling its next method. Small
complications are involved for startup, termination, and data
sharing, but they each are solvable without much ado and present
fewer challenges than equivalent thread-based solutions. See
Recipe 9.8 for an example.Observing that some tools can be both producers and consumers, it is
natural to want to stack them together like pipes and filters. While
that analogy can lead to useful decoupling, be aware that underlying
models are different. Iterators do not run independently from start
to finish; instead, an outermost layer is always in control,
requesting data elements one at a time, so that nothing runs until
the outer layer starts making requests.When stacking tools together (as in the first example with
sorted, set, and a string), the
code takes on the appearance of a functional programming language.
The resemblance is not shallow: iterators do fulfill some of the
promise of lazy languages. So, it is natural to borrow some of the
most successful techniques from those languages, such as Haskell and
SML.One such technique is to write innermost iterators to yield infinite
streams and concentrate the control logic in an outermost driver
function. For instance, in numerical programming, write a generator
that yields successively better approximations to a desired result
and call it from a function that stops whenever two successive
approximations fall within a tolerance value. Separating the control
logic from the calculation decouples the two, making them easier to
write, test, and debug, and makes them more reusable in other
contexts.
Odds and Ends
Here are some instructive snippets. Consider each of them carefully,
study how they work, and you'll be well on your way
towards understanding how best to link iterators together to solve
practical problems. Each of the following lines is independent from
the "other"s:
result = dict(enumerate(myseq))The idea for restartable iterators surfaces every so often and then
result = set(word for line in page for word in line.split( ))
def dotproduct(v1, v2):
return sum(itertools.imap(operator.mul, v1, v2))
def dotproduct(v1, v2):
return sum(x*y for x,y in itertools.izip(v1, v2))
randgen = itertools.starmap(random.random, itertools.repeat(( )))
randgen = iter(random.random, -1.0)
drowns in quicksand. sys.stdin is a plain example
of an iterable that cannot logically be restarted unless an entire
session is saved in memory. A craving for restartability should be
taken as a cue that a list might well be a more
appropriate data structure.Just because iterators cannot be restarted doesn't
mean they cannot be abandoned in mid-stream. The lazy, just-in-time
style of production is a key feature of iterators. Take advantage of
it. That's why the for statement
supports a break keyword, after all.The core itertools and their derivatives (see the
recipes in the itertools docs that are part of the
Python Library Reference) all run at nearly
the speed of compiled code. When Python 2.4 introduced a native
set data type, I timed it against the pure-Python
version, sets.py, and learned that some of the
set logic (intersection, union, etc.) achieved only a two to one
increase in speed. The reason was that sets.py
used itertools, and itertools
performance was exceptional. So, when performance is an issue,
consider an itertools solution before turning to
more labor-intensive optimizations or native language extensions.