Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources]

David Ascher, Alex Martelli, Anna Ravenscroft

نسخه متنی -صفحه : 394/ 338
نمايش فراداده

Recipe 18.4. Generating Random Samples Without Replacement

Credit: Raymond Hettinger

Problem

You need to generate random samples without replacement out of a "population" (the integers between included and some n excluded), and you want better memory consumption characteristics than random.sample provides.

Solution

A generator for this purpose requires only constant memory and makes a small number of calls to random.random:

import random
def sample(n, r):
" Generate r randomly chosen, sorted integers from [0,n) "
rand = random.random
pop = n
for samp in xrange(r, 0, -1):
cumprob = 1.0
x = rand( )
while x < cumprob:
cumprob -= cumprob * samp / pop
pop -= 1
yield n-pop-1

Discussion

random.sample(xrange(10), 3) produces output statistically equal to list(sample(10, 3)) using this recipe's sample. Differently from random.sample(xrange(n), r), this recipe's sample(n, r) requires a bounded amount of memory (which does not grow with either r or n) and is guaranteed to make only r calls to random.random. Moreover, this recipe's sample yields the r numbers of the sample in sorted order, while random.sample returns them in random orderwhich may be insignificant or a crucial advantage one way or the other, depending on your application's needs. A definite advantage of random.sample is that its running time is O(r), while this recipe's sample function's running time is O(n).

This recipe was inspired by Knuth's Algorithm S in Donald E. Knuth, The Art of Computer Programming, Volume 3, Seminumerical Algorithms, in section 3.4.2. However, this recipe has one improvement over Knuth's algorithm: by tracking a cumulative probability for each selection, this recipe eliminates the need to make n calls to random.random.

A potential major improvement would be to find a direct formula giving the same result as the inner loop: given x, samp, and pop, compute the index of the first sample. Finding this formula would reduce the running time to O(r).

See Also

Library Reference and Python in a Nutshell docs about random.