Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 18.1. Removing Duplicates from a Sequence


Credit: Tim Peters


Problem


You have a sequence that
may include duplicates, and you want to remove the duplicates in the
fastest possible way, without knowing much about the properties of
the items in the sequence. You do not care about the
"or"der of items in the resulting
sequence.


Solution


The key is to try several approaches, fastest first, and use
try/except to handle the
failing cases of the faster approaches by falling back to slower
approaches. Here's a function that implements
exactly this strategy:

# support 2.3 as well as 2.4
try: set
except NameError: from sets import Set as set
def unique(s):
"" Return a list of the elements in s in arbitrary order, but without
duplicates. ""
# Try using a set first, because it's the fastest and will usually work
try:
return list(set(s))
except TypeError:
pass # Move on to the next method
# Since you can't hash all elements, try sorting, to bring equal items
# together and then weed them out in a single pass
t = list(s)
try:
t.sort( )
except TypeError:
del t # Move on to the next method
else:
# the sort worked, so we're fine -- do the weeding
return [x for i, x in enumerate(t) if not i or x != t[i-1]]
# Brute force is all that's left
u = [ ]
for x in s:
if x not in u:
u.append(x)
return u


Discussion


The purpose of this recipe's unique
function is to take a sequence s as an
argument and return a list of the items in s in
arbitrary order, but without duplicates. For example, calling
unique([1, 2, 3, 1, 2, 3]) returns an arbitrary
permutation of [1, 2, 3],
calling unique('abcabc') returns an arbitrary
permutation of ['a', 'b', 'c'], and calling
unique(([1, 2], [2, 3], [1, 2])) returns an
arbitrary permutation of [[2, 3], [1, 2]].

The fastest way to remove duplicates from a sequence depends on
fairly subtle properties of the sequence elements, such as whether
they're hashable and whether they support full
comparisons. The unique function shown in this
recipe tries three methods, from fastest to slowest, letting runtime
exceptions pick the best method for the sequence at hand.

For fastest speed, all sequence elements must be hashable. When they
are, the unique function will usually work in linear
time (i.e.,
O(n),
or directly proportional to the number of elements in the input,
which is good and highly scalable performance behavior).

If it turns out that hashing the elements (e.g., using them as
dictionary keys, or, as in this case, set
elements) is not possible, the next best situation is when the
elements enjoy a total ordering, meaning that each element can be
compared to each other element with the <
operator. If
list(s).sort() doesn't raise a
TypeError, we can assume that
s' elements can be sorted
and therefore enjoy a total ordering. Then unique
will usually work in
O(n
log(n)) time.
Python lists' sort method is
particularly efficient in the presence of partially ordered data
(including, e.g., data with many duplicates), so the sorting approach
may be more effective in Python than elsewhere.

If sorting also turns out to be impossible, the sequence items must
at least support equality testing, or else the very concept of
duplicates can't really be meaningful for them. In
this case, unique works in quadratic timethat
is,
O(n2),
meaning time proportional to the square of the number of elements in
the input: not very scalable, but the least of all evils, given the
sequence items' obviously peculiar nature (assuming
we get all the way to this subcase).

This recipe is a pure example of how algorithm efficiency depends on
the strength of the assumptions you can make about the data. You
could split this recipe's function into three
distinct functions and directly call the one that best meets your
needs. In practice, however, the brute-force method is so slow for
large sequences that nothing measurable is lost by simply letting the
function as written try the faster methods first.

If you need to preserve the same order of items in the output
sequence as in the input sequence, see Recipe 18.2.


See Also


Recipe 18.2.


/ 394