Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 19.14. Merging Sorted Sequences


Credit: Sébastien Keim, Raymond Hettinger, Danny
Yoo


Problem


You have several sorted sequences (iterables) and need to iterate on
the overall sorted sequence that results from
"merging" these sequences.


Solution


A generator is clearly the right tool for the job, in the general
case (i.e., when you might not have enough memory to comfortably hold
all the sequences). Implementing the generator is made easy by the
standard library module heapq, which supplies
functions to implement the "heap"
approach to priority queues:

import heapq
def merge(*subsequences):
# prepare a priority queue whose items are pairs of the form
# (current-value, iterator), one each per (non-empty) subsequence
heap = [ ]
for subseq in subsequences:
iterator = iter(subseq)
for current_value in iterator:
# subseq is not empty, therefore add this subseq's pair
# (current-value, iterator) to the list
heap.append((current_value, iterator))
break
# make the priority queue into a heap
heapq.heapify(heap)
while heap:
# get and yield lowest current value (and corresponding iterator)
current_value, iterator = heap[0]
yield current_value
for current_value in iterator:
# subseq is not finished, therefore add this subseq's pair
# (current-value, iterator) back into the priority queue
heapq.heapreplace(heap, (current_value, iterator))
break
else:
# subseq has been exhausted, therefore remove it from the queue
heapq.heappop(heap)


Discussion


The need for "merging" sorted
subsequences into a larger sorted sequence is reasonably frequent. If
the amount of data is small enough to fit entirely in memory without
problems, then the best approach is to build a list by concatenating
all subsequences, then sort the list:

def smallmerge(*subsequences):
result = [ ]
for subseq in subsequences: result.extend(subseq)
result.sort( )
return result

The sort method of list objects is based on a
sophisticated natural merge algorithm, able to
take advantage of existing sorted subsequences in the list
you're sorting; therefore, this approach is quite
fast, as well as simple (and general, since this
approach's correctness does not
depend on all subsequences being already sorted). If you can choose
this approach, it has many other advantages. For example,
smallmerge works fine even if one of the
subsequences isn't perfectly sorted
to start with; and in Python 2.4, you may add a generic keywords
argument **kwds to smallmerge and
pass it right along to the result.sort( ) step, to
achieve the flexibility afforded in that version by the
cmp=, key=, and
reverse= arguments to list's
sort method.

However, you sometimes deal with large sequences, which might not
comfortably fit in memory all at the same time (e.g., your sequences
might come from files on disk, or be computed on the fly, item by
item, by other generators). When this happens, this
recipe's generator will enable you to perform your
sequence merging while consuming a very moderate amount of extra
memory (dependent only on the number of subsequences, not on the
number of items in the subsequences).

The recipe's implementation uses a classic
sequence-merging algorithm based on a priority queue, which, in turn,
lets it take advantage of the useful heapq module
in the Python Standard Library. heapq offers
functions to implement a priority queue through the data structure
known as a heap.

A heap is any list
H such that, for any valid index
0<=i<len(H),
H[i]<=H[2*i+1], and
H[i]<=H[2*i+2] (if 2*i+1 and
2*i+2 are also valid indices into
H). This heap
property
is fast to establish on an arbitrary list
(function heapify does that) and very fast to
re-establish after altering or removing the smallest item (and
functions heapreplace and
heappop do that). The smallest item is always
H[0]
(it's easy to see that the
"heap" property implies this), and
being able to find the smallest item instantly makes heaps an
excellent implementation of priority queues.

In this recipe, we use as items in the
"heap" a
"pair" (i.e., two-items tuple) for
each subsequence that is not yet exhausted (i.e., each subsequence
through which we have not yet fully iterated). As its first item,
each pair has the "current item" in
the corresponding subsequence and, as its second item, an iterator
over that subsequence. At each iteration step, we yield the smallest
"current item", then we advance the
corresponding iterator and re-establish the
"heap" property; when an iterator
is exhausted, we remove the corresponding pair from the
"heap" (so that, clearly,
we're finished when the
"heap" is emptied). Note the idiom
that we use to advance an iterator by one step, dealing with the
possibility that the iterator is exhausted:

for current_value in iterator:
# if we get here the iterator was not empty, current_value was
# its first value, and the iterator has been advanced one step
...use pair (current_value, iterator)...
# we break at once as we only wanted the first item of iterator
break
else:
# if we get here the break did not execute, so the iterator
# was empty (exhausted)
# deal with the case of iterator being exhausted...

We use this idiom twice in the recipe, although in the first of the
two uses we do not need the else clause since we
can simply ignore iterators that are immediately exhausted (they
correspond to empty subsequences, which can be ignored for merging
purposes).

If you find this idiom confusing or tricky (because it uses a
for statement whose body immediately
breaksi.e., a statement that looks like a
loop but is not really a loop because it never executes more than
once!), you may prefer a different approach:

try:
current_value = iterator.next( )
except StopIteration:
# if we get here the iterator was empty (exhausted)
# deal with the case of iterator being exhausted...
else:
# if we get here the iterator was not empty, current_value was
# its first value, and the iterator has been advanced one step
# use pair (current_value, iterator)...

I slightly prefer the idiom using for; in my view,
it gains in clarity by putting the normal case (i.e., an unexhausted
iterator) first and the rare case (an exhausted iterator) later. A
variant of the try/except idiom
that has the same property is:

try:
current_value = iterator.next( )
# if we get here the iterator was not empty, current_value was
# its first value, and the iterator has been advanced one step
# use pair (current_value, iterator)...
except StopIteration:
# if we get here the iterator was empty (exhausted)
# deal with the case of iterator being exhausted...

However, I somewhat dislike this variant (even though
it's quite OK for the two specific uses of this
recipe) because it crucially depends on the code indicated as
"use
pair
"
never raising a
StopIteration exception. As a general principle,
it's best to use a TRy
clause's body that is as small as
possiblejust the smallest fragment that you
do expect to possibly raise the exception
you're catching in the following handlers
(except clauses), not the
follow-on code that must execute only if the exception was not
raised. The follow-on code goes in the else clause
of the try statement, in properly defensive
Pythonic coding style. In any case, as long as you are fully aware of
the tradeoffs in clarity and defensiveness between these three
roughly equivalent idioms, you're welcome to develop
your own distinctive Pythonic style and, in particular, to choose
freely among them!

If you do choose either of the idioms that explicitly call
iterator.next( ), a further
"refinement" (i.e., a tiny
optimization) is to keep as the second item of each pair, rather than
the iterator object, the bound-method
iterator.next directly, ready for you to call.
This optimization is not really tricky at all (it
is quite common in Python to stash away bound
methods and other such callables), but it may nevertheless result in
code of somewhat lower readability. Once again, the choice is up to
you!


See Also


Chapter 5 for general issues about sorting and
Recipe 5.7 and Recipe 5.8 about
heapq specifically; Library
Reference
and Python in a Nutshell
documentation on module heapq and
lists' sort method; Robert
Sedgewick, Algorithms (Addison-Wesley) (heaps
are covered starting on p. 178 in the 2d edition);
heapq.py in the Python sources contains an
interesting discussion of heaps.


/ 394