Recipe 18.2. Removing Duplicates from a Sequence While Maintaining Sequence Order
Credit: Alex Martelli
Problem
You have a sequence that may include duplicates, and you want to
remove the duplicates in the fastest possible way. Moreover, the
output sequence must respect the item ordering of the input sequence.
Solution
The need to respect the item ordering of the input sequence means
that picking unique items becomes a problem quite different from that
explored previously in Recipe 18.1. This requirement often
arises in conjunction with a function f
that defines an equivalence relation among items:
x is equivalent to
y if and only if
f(x)==f(y).
In this case, the task of removing duplicates may often be better
described as picking the first representative of each resulting
equivalence class. Here is a function to perform this task:
# support 2.3 as well as 2.4
try: set
except NameError: from sets import Set as set
# f defines an equivalence relation among items of sequence seq, and
# f(x) must be hashable for each item x of seq
def uniquer(seq, f=None):
"" Keeps earliest occurring item of each f-defined equivalence class ""
if f is None: # f's default is the identity function f(x) -> x
def f(x): return x
already_seen = set( )
result = [ ]
for item in seq:
marker = f(item)
if marker not in already_seen:
already_seen.add(marker)
result.append(item)
return result
Discussion
The previous Recipe 18.1 is
applicable only if you are not concerned about item ordering or, in
other words, if the sequences involved are meaningful only as the
sets of their items, which is often the case. When sequential order
is significant, a different approach is needed.If the items are hashable, it's not hard to maintain
sequence order, keeping only the first occurrence of each value. More
generally, we may want uniqueness within equivalence classes, as
shown in this recipe's Solution: the
uniquer function accepts as an argument a function
f that must return hashable objects, such
that
f(x)==f(y)
if and only if items x and
y are equivalent. Identity (in the
mathematical sense, not in the Python sense) is used as the default
when no argument f is supplied. The added
generality of allowing an f different from
the identity function costs no added complication whatsoever.If you need to keep the last occurrence, rather than the earliest
occurrence, of an item in each equivalence class, the simplest
approach is to reverse the input sequence (or,
rather, a copy thereof into a local list, since the input might be
immutable or at any rate not support reversing), then, after
processing with uniquer, reverse
the resulting list:
def uniquer_last(seq, f=None):In Python 2.4, instead of the first three statements of this version
seq = list(seq)
seq.reverse( )
result = uniquer(seq, f)
result.reverse( )
return result
of uniquer_last, you could use the single statement:
result = uniquer(reversed(seq), f)exploiting the new built-in reversed. However,
this Python 2.4 version, while marginally faster, is less general,
because it does require seq to be really a sequence,
while the previously shown version (and the
uniquer function in the
"Solution") work with any
iterable seq. For example:
somelines = uniquer_last(open('my.txt'), str.lower)binds name somelines to the list of unique
lines from text file my.txt, considering two
lines equivalent if they're equal aside from
uppercase and lowercase distinctions, picking the last occurring one
from each set of equivalent lines, and preserving the order of the
lines in the file (phew). If you used Python 2.4's
built-in reversed, this latest snippet would not
work, due to reversed's
prerequisites.If you must deal with nonhashable items, the simplest fallback
approach is to use a set-like container that supports the
add method and membership testing without
requiring items to be hashable. Unfortunately, performance will be
much worse than with a real
set. Here's the simplest fallback
implementation, which demands of items nothing but support for
equality testing:
def uniquer_with_simplest_fallback(seq, f=None):A more refined approach would be to use two levels of fallback, the
if f is None:
def f(x): return x
already_seen = set( )
result = [ ]
for item in seq:
marker = f(item)
try:
new_marker = marker not in already_seen
except TypeError:
class TotallyFakeSet(list):
add = list.append
already_seen = TotallyFakeSet(already_seen)
new_marker = marker not in already_seen
if new_marker:
already_seen.add(marker)
result.append(item)
return result
intermediate one based on sorting, as shown previously in Recipe 18.1 testing in a sorted
list can be performed efficiently by using the Python Standard
Library module bisect.However, remember that you can often use an
f that gives you hashable markers for
nonhashable items. The built-in function repr can
often be useful for this purpose. For example:
lol = [ [1, 2], [ ], [1, 2], [3], [ ], [3, 4], [1, 2], [ ], [2, 1] ]While the items of lol are lists, and thus
print uniquer(lol, repr)
# emits: [[1, 2], [ ], [3], [3, 4], [2, 1]]
are not hashable, the built-in function repr
produces representations of each of the items as a string, which
is hashable. This enables use of the fast
function uniquer. Unfortunately,
repr is not useful for nonhashable items of other
types, including dict and set.
Because of the workings of hash-collision resolution,
it's quite possible to have
d1==d2
and yet
repr(d1)!=repr(d2)
for two dictionaries d1 and
d2, depending on the exact sequences of
adds that built each dict. Still, you may be able
build your own repr-like function to work around
these issues, depending on the exact nature of your data. Whether
repr can help for instances of a certain
user-defined type depends on how accurately and usefully that
specific type defines special method _ _repr_ _,
which repr calls.The task of picking one representative item, out of all of those
belonging to each equivalence class, can be generalized. Instead of
the simple ideas of implicitly picking the first such item, or the
last such item, we can choose among multiple items in the same
equivalence class via an arbitrary picking
function p that considers both the actual
items and their indexes of occurrence. As long as function
p can operate pairwise, the key idea is
just to keep a dictionary that maps the marker of each equivalence
class to the index and item of the representative so far picked for
that class. At the end, we reconstruct sequence order by sorting on
the indices:
def fancy_unique(seq, f, p):It's possible that the picking function cannot
"" Keeps "best" item of each f-defined equivalence class, with
picking function p doing pairwise choice of (index, item) ""
representative = { }
for index, item in enumerate(seq):
marker = f(item)
if marker in representative:
# It's NOT a problem to rebind index and item within the
# for loop: the next leg of the loop does not use their binding
index, item = p((index, item), representative[marker])
representative[marker] = index, item
# reconstruct sequence order by sorting on indices
auxlist = representative.values( )
auxlist.sort( )
return [item for index, item in auxlist]
operate pairwise, but rather must be presented with the whole bunch
of (index,
item) pairs
for each equivalence class in order to pick the best representative
of that class (e.g., it may have to get the
median of the items in each class as being the
best representative of that class). Then we need one pass over the
sequence to collect the bunches, followed by a pass over the bunches,
to pick the representatives:
def fancier_uniquer(seq, f, p):These fancy approaches that rely on a picking function are useful
"" Keeps "best" item of each f-defined equivalence class, with
picking function p choosing appropriate (index, item) for each
equivalence class from the list of all (index, item) pairs in
that class ""
bunches = { }
for index, item in enumerate(seq):
marker = f(item)
bunches.setdefault(marker, [ ]).append((index, item))
auxlist = [p(candidates) for candidates in bunches.values( )]
auxlist.sort( )
return [item for index, item in auxlist]
only for substantial equivalence functions, not for identity, so I
removed f's default value
from these versions.An example of use for fancy_unique may help. Say
we're given a list of words, and we need to get a
sublist from it, respecting order, such that no two words on the
sublist begin with the same letter. Out of all the words in the
"or"iginal list that begin with
each given letter, we need to keep the longest word and, in case of
equal lengths, the word appearing later on the list. This sounds
complicated, but with fancy_unique to help us,
it's really not that bad:
def complicated_choice(words):The prefer nested function within
def first_letter(aword):
return aword[0].lower( )
def prefer((indx1, word1), (indx2, word2)):
if len(word2) > len(word1):
return indx2, word2
return indx1, word1
return fancy_unique(words, first_letter, prefer)
complicated_choice is simplified because it knows
fancy_unique always calls it with
indx2<indx1. So, the older indx2,
word2 pair must be returned only when
word2 is longer than
word1; otherwise, indx1,
word1 is always the proper result. The automatic tuple
unpacking in prefer's signature is
debatable, stylewise, but I like it (it reminds me of SML or
Haskell).Out of all the general programming techniques presented in the
various functions of this recipe, the idea of writing higher-order
functions, which organize a computation and appropriately call back
to functions that they receive as arguments, is easily the most
precious and widely applicable concept. This idea is well worth
keeping in mind in several circumstancesnot just for old
Haskell-heads, because it works just as well in Python.
See Also
Recipe 18.1.