Recipe 18.4. Generating Random Samples Without Replacement
Credit: Raymond Hettinger
Problem
You need to generate random samples without replacement out of a
"population" (the integers between
included and some n excluded), and you
want better memory consumption characteristics than
random.sample provides.
Solution
A generator for this purpose requires only constant memory and makes
a small number of calls to random.random:
import random
def sample(n, r):
" Generate r randomly chosen, sorted integers from [0,n) "
rand = random.random
pop = n
for samp in xrange(r, 0, -1):
cumprob = 1.0
x = rand( )
while x < cumprob:
cumprob -= cumprob * samp / pop
pop -= 1
yield n-pop-1
Discussion
random.sample(xrange(10), 3) produces output
statistically equal to list(sample(10,
3)) using this recipe's
sample. Differently from
random.sample(xrange(n),
r), this
recipe's
sample(n,
r) requires a
bounded amount of memory (which does not grow with either
r or n) and is
guaranteed to make only r calls to
random.random. Moreover, this
recipe's sample yields the
r numbers of the sample in sorted order,
while random.sample returns them in random
orderwhich may be insignificant or a crucial advantage one way
or the other, depending on your application's needs.
A definite advantage of random.sample is that its
running time is
O(r),
while this recipe's sample
function's running time is
O(n).This recipe was inspired by Knuth's
Algorithm S in Donald E. Knuth, The
Art of Computer Programming, Volume 3,
Seminumerical Algorithms, in
section 3.4.2. However, this recipe has one improvement over
Knuth's algorithm: by tracking a cumulative
probability for each selection, this recipe eliminates the need to
make n calls to
random.random.A potential major improvement would be to find a direct formula
giving the same result as the inner loop: given
x, samp, and
pop, compute the index of the first
sample. Finding this formula would reduce the running time to
O(r).
See Also
Library Reference and Python in a
Nutshell docs about random.