Recipe 5.12. Performing Frequent Membership Tests on a Sequence
Credit: Alex Martelli
Problem
You need to
perform frequent tests for membership in a sequence. The
O(n) behavior of repeated in
operators hurts performance, but you can't switch to
using just a dictionary or set instead of the
sequence, because you also need to keep the
sequence's order.
Solution
Say you need to append items to a list only if
they're not already in the list. One sound approach
to this task is the following function:
def addUnique(baseList, otherList):If your code has to run only under Python 2.4, you can get exactly
auxDict = dict.fromkeys(baseList)
for item in otherList:
if item not in auxDict:
baseList.append(item)
auxDict[item] = None
the same effect with an auxiliary set rather than an auxiliary
dictionary.
Discussion
A simple (naive?) approach to this recipe's task
looks good:
def addUnique_simple(baseList, otherList):and it may be sort of OK, if the lists are very
for item in otherList:
if item not in baseList:
baseList.append(item)
small.However, the simple approach can be quite slow if the lists are not
small. When you check if item not in baseList,
Python can implement the in operator in only one
way: an internal loop over the elements of baseList,
ending with a result of true as soon as an element
compares equal to item, with a result of
False if the loop terminates without having found
any equality. On average, executing the
in-operator takes time proportional to
len(baseList). addUnique_simple
executes the in-operator
len(otherList) times, so, in all, it takes time
proportional to the product of the lengths of
the two lists.In the addUnique function shown in the
"Solution", we first build the
auxiliary dictionary auxDict, a step that takes time
proportional to len(baseList). Then, the
in-operator inside the loop checks for membership
in a dicta step that makes all the
difference because checking for membership in a
dict takes roughly constant time, independent of
the number of items in the dict! So, the
for loop takes time proportional to
len(otherList), and the entire function takes time
proportional to the sum of the lengths of the
two lists.The analysis of the running times should in fact go quite a bit
deeper, because the length of baseList is not
constant in addUnique_simple; baseList grows each
time an item is processed that was not already
there. But the gist of the (surprisingly complicated) analysis is not
very different from what this simplified version indicates. We can
check this by measuring. When each list holds 10 integers, with an
overlap of 50%, the simple version is about 30% slower than the one
shown in the "Solution", the kind
of slowdown that can normally be ignored. But with lists of 100
integers each, again with 50% overlap, the simple version is
twelve times slower than the one shown in the
"Solution"a level of
slowdown that can never be ignored, and it only gets worse if the
lists get really substantial.Sometimes, you could obtain even better overall performance for your
program by permanently placing the auxiliary dict
alongside the sequence, encapsulating both into one object. However,
in this case, you must maintain the dict as the
sequence gets modified, to ensure it stays in sync with the
sequence's current membership. This maintenance task
is not trivial, and it can be architected in many different ways.
Here is one such way, which does the syncing "just
in time," rebuilding the auxiliary
dict when a membership test is required and the
dictionary is possibly out of sync with the list's
contents. Since it costs very little, the following class optimizes
the index method, as well as membership tests:
class list_with_aux_dict(list):The list_with_aux_dict class extends
def _ _init_ _(self, iterable=( )):
list._ _init_ _(self, iterable)
self._dict_ok = False
def _rebuild_dict(self):
self._dict = { }
for i, item in enumerate(self):
if item not in self._dict:
self._dict[item] = i
self._dict_ok = True
def _ _contains_ _(self, item):
if not self._dict_ok:
self._rebuild_dict( )
return item in self._dict
def index(self, item):
if not self._dict_ok:
self._rebuild_dict( )
try: return self._dict[item]
except KeyError: raise ValueError
def _wrapMutatorMethod(methname):
_method = getattr(list, methname)
def wrapper(self, *args):
# Reset 'dictionary OK' flag, then delegate to the real mutator method
self._dict_ok = False
return _method(self, *args)
# in Python 2.4, only: wrapper._ _name_ _ = _method._ _name_ _
setattr(list_with_aux_dict, methname, wrapper)
for meth in 'setitem delitem setslice delslice iadd'.split( ):
_wrapMutatorMethod('_ _%s_ _' % meth)
for meth in 'append insert pop remove extend'.split( ):
_wrapMutatorMethod(meth)
del _wrapMethod # remove auxiliary function, not needed any more
list and delegates to it every method, except
_ _contains_ _ and index. Every
method that can modify list membership is wrapped in a closure that
resets a flag asserting that the auxiliary dictionary is OK.
Python's in-operator calls the
_ _contains_ _ method.
list_with_aux_dict's _
_contains_ _ method rebuilds the auxiliary dictionary,
unless the flag is set (when the flag is set, rebuilding is
unnecessary); the index method works the same way.Instead of building and installing wrapping closures for all the
mutating methods of the list into the
list_with_aux_dict class with a helper function, as
the recipe does, we could write all the def
statements for the wrapper methods in the body of
list_with_aux_dict. However, the code for the class
as presented has the important advantage of minimizing boilerplate
(repetitious plumbing code that is boring and voluminous, and thus a
likely home for bugs). Python's strengths at
introspection and dynamic modification give you a choice: you can
build method wrappers, as this recipe does, in a smart and concise
way; or, you can choose to code the boilerplate anyway, if you prefer
to avoid what some would call the black magic of introspection and
dynamic modification of class objects.The architecture of class list_with_aux_dict caters
well to a rather common pattern of use, where sequence-modifying
operations happen in bunches, followed by a period of time in which
the sequence is not modified, but several membership tests may be
performed. However, the addUnique_simple function
shown earlier would not get any performance benefit if argument
baseList was an instance of this
recipe's list_with_aux_dict rather
than a plain list: the function interleaves
membership tests and sequence modifications. Therefore, too many
rebuilds of the auxiliary dictionary for
list_with_aux_dict would impede the
function's performance. (Unless a typical case was
for a vast majority of the items of
otherList to be already contained in
baseList, so that very few modifications
occurred compared to the number of membership tests.)An important requisite for any of these membership-test optimizations
is that the values in the sequence must be hashable (otherwise, of
course, they cannot be keys in a dict, nor items
in a set). For example, a list of tuples might be
subjected to this recipe's treatment, but for a list
of lists, the recipe as it stands is just not applicable.
See Also
The Library Reference and Python in
a Nutshell sections on sequence types and mapping
types.