Recipe 5.6. Processing All of a List's Items in Random Order
Credit: Iuri Wickert, Duncan Grisby, T. Warner, Steve
Holden, Alex Martelli
Problem
You need to process, in random
order, all of the items of a long list.
Solution
As usual in Python, the best approach is the simplest one. If we are
allowed to change the order of items in the input list, then the
following function is simplest and fastest:
def process_all_in_random_order(data, process):If we must preserve the input list intact, or if the input data may
# first, put the whole list into random order
random.shuffle(data)
# next, just walk over the list linearly
for elem in data: process(elem)
be some iterable that is not a list, just insert as the first
statement of the function the assignment data
= list(data).
Discussion
While it's a common
mistake to be overly concerned with speed, don't
make the opposite mistake of ignoring the different performances of
various algorithms. Suppose we must process all of the items in a
long list in random order, without repetition (assume that
we're allowed to mangle or destroy the input list).
The first idea to suggest itself might be to repeatedly pick an item
at random (with function random.choice), removing
each picked item from the list to avoid future repetitions:
import randomHowever, this function is painfully slow, even for input lists of
def process_random_removing(data, process):
while data:
elem = random.choice(data)
data.remove(elem)
process(elem)
just a few hundred elements. Each call to
data.remove must linearly search through the list
to find the element to delete. Since the cost of each of
n steps is O(n), the
whole process is
O(n2)time
proportional to the square of the length of the
list (and with a large multiplicative constant, too).Minor improvements to this first idea could focus on obtaining random
indices, using the pop method of the list to get
and remove an item at the same time, low-level fiddling with indices
to avoid the costly removal in favor of swapping the picked item with
the last yet-unpicked one towards the end, or using dictionaries or
sets instead of lists. This latest idea might be based on a hope of
using a dict's
popitem method (or the equivalent method
pop of class sets.Set and
Python 2.4's built-in type set),
which may look like it's designed exactly to pick
and remove a random item, but, beware!
dict.popitem is documented to return and remove an
arbitrary item of the dictionary, and
that's a far cry from a random
item. Check it out:
>>> d=dict(enumerate('ciao'))It may surprise you, but in most Python implementations this snippet
>>> while d: print d.popitem( )
will print d's items in a
far from random order, typically (0,'c') then
(1,'i') and so forth. In short, if you need
pseudo-random behavior in Python, you need standard library module
randompopitem is not an
alternative.If you thought about using a dictionary rather than a list, you are
definitely on your way to "thinking
Pythonically", even though it turns out that
dictionaries wouldn't provide a substantial
performance boost for this specific problem. However, an approach
that is even more Pythonic than choosing the right data structure is
best summarized as: let the standard library do it!. The Python
Standard Library is large, rich, and chock full of useful, robust,
fast functions and classes for a wide variety of tasks. In this case,
the key intuition is realizing that, to walk over a sequence in a
random order, the simplest approach is to first
put that sequence into random order (known as
shuffling the sequence, an analogy with
shuffling a deck of cards) and then walk over the shuffled sequence
linearly. Function random.shuffle performs the
shuffling, and the function shown in this recipe's
Solution just uses it.Performance should always be
measured, never guessed at, and that's what standard
library module timeit is for. Using a null
process function and a list of length 1,000 as
data, process_all_in_random_order
is almost 10 times faster than
process_random_removing; with a list of length
2,000, the performance ratio grows to almost 20. While an improvement
of, say, 25%, or even a constant factor of 2, usually can be
neglected without really affecting the performance of your program as
a whole, the same does not apply to an algorithm that is 10 or 20
times as slow as it could be. Such terrible performance is likely to
make that program fragment a bottleneck, all by itself. Moreover,
this risk increases when we're talking about
O(n2)
versus
O(n)
behavior: with such differences in big-O behavior, the performance
ratio between bad and good algorithms keeps increasing without bounds
as the size of the input data grows.
See Also
The documentation for the random and
timeit modules in the Library
Reference and Python in a
Nutshell.