Recipe 5.7. Keeping a Sequence Ordered as Items Are Added
Credit: John Nielsen
Problem
You want to
maintain a sequence, to which items are added, in a sorted state, so
that at any time, you can easily examine or remove the smallest item
currently present in the sequence.
Solution
Say you start with an unordered list, such as:
the_list = [903, 10, 35, 69, 933, 485, 519, 379, 102, 402, 883, 1]You could call the_list.sort( ) to make the list
sorted and then result=the_list.pop(0) to get and
remove the smallest item. But then, every time you add an item (say
with the_list.append(0)), you need to call
the_list.sort( ) again to keep the list sorted.Alternatively, you can use the heapq module of the
Python Standard Library:
import heapqNow the list
heapq.heapify(the_list)
is not necessarily fully sorted, but it does satisfy the
heap property (meaning if all indices involved
are valid, the_list[i]<=the_list[2*i+1] and
the_list[i]<=the_list[2*i+2])so, in
particular, the_list[0] is the smallest item. To
keep the heap property valid, use
result=heapq.heappop(the_list) to get and remove
the smallest item and heapq.heappush(the_list,
newitem) to add a new item. When you need to do
bothadd a new item while getting and removing the previously
smallest itemyou can use
result=heapq.heapreplace(the_list,
newitem).
Discussion
When you need to retrieve data in an ordered way (at each retrieval
getting the smallest item among those you currently have at hand),
you can pay the runtime cost for the sorting when you retrieve the
data, or you can pay for it when you add the data. One approach is to
collect your data into a list and sort the list. Now
it's easy to get your data in order, smallest to
largest. However, you have to keep calling sort
each time you add new data during the retrieval, to make sure you can
later keep retrieving from the smallest current item after each
addition. The method sort of Python lists is
implemented with a little-known algorithm called Natural
Mergesort, which minimizes the runtime cost of this
approach. Yet the approach can still be burdensome: each addition
(and sorting) and each retrieval (and removal, via
pop) takes time proportional to the number of
current items in the list (O(N), in common
parlance).An alternative
approach is to use a data organization known as a
heap, a type of binary tree implemented
compactly, yet ensuring that each
"parent" is always less than its
"children". The best way to
maintain a heap in Python is to use a list and
have it managed by the heapq library module, as
shown in this recipe's Solution. The list does not
get fully sorted, yet you can be sure that, whenever you
heappop an item from the list, you always get the
lowest item currently present, and all others will be adjusted to
ensure the heap property is still valid. Each addition with
heappush, and each removal with
heappop, takes a short time proportional to the
logarithm of the current length of the list (O(log
N), in common parlance). You pay as you go, a little at a
time (and not too much in total, either.)A good occasion to use this heap approach, for example, is when you
have a long-running queue with new data periodically arriving, and
you always want to be able to get the most important item off the
queue without having to constantly re-sort your data or perform full
searches. This concept is known as a priority
queue, and a heap is an excellent way to implement it.
Note that, intrinsically, the heapq module
supplies you with the smallest item at each
heappop, so make sure to arrange the way you
encode your items' priority values to reflect this.
For example, say that you receive incoming items each accompanied by
a cost, and the most important item at any time is the one with the
highest cost that is currently on the queue; moreover, among items of
equal cost, the most important one is the one that arrived earliest.
Here's a way to build a "priority
queue" class respecting these specs and based on
functions of module
heapq:
class prioq(object):The main idea in this snippet is to push on the heap tuples whose
def _ _init_ _(self):
self.q = [ ]
self.i = 0
def push(self, item, cost):
heapq.heappush(self.q, (-cost, self.i, item))
self.i += 1
def pop(self):
return heapq.heappop(self.q)[-1]
first item is the cost with changed sign, so
that higher costs result in
smaller tuples (by Python's
natural comparison); right after the cost, we put a progressive
index, so that, among items with equal cost, the one arriving
earliest will be in a smaller tuple.In Python 2.4, module heapq has been reimplemented
and optimized; see Recipe 5.8 for more information about
heapq.
See Also
Docs for module heapq in the Library
Reference and Python in a
Nutshell; heapq.py in the Python
sources contains a very interesting discussion of heaps; Recipe 5.8 for more information about
heapq; Recipe 19.14 for merging sorted sequences
using heapq.