Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 18.8. Implementing a Bag (Multiset) Collection Type


Credit: Raymond Hettinger, Alex Martelli, Matt R


Problem


You need a
set-like collection type that lets each element be
in the set a number of times. In other words, you need a collection
type of the kind that is known as multiset in C++
and SQL, and bag in Smalltalk, Objective C, and
Haskell's Edison module.


Solution


We can implement bag as a class. We could restrict
the implementation to language constructs that are present in Python
2.3 or are easy to emulate; however, such restrictions would give
substantial inefficiencies or complications with comparison to a pure
Python 2.4 implementation. So, here is a Python 2.4 implementation,
with no attempt to support Python 2.3:

from operator import itemgetter
from heapq import nlargest
class bag(object):
def _ _init_ _(self, iterable=( )):
# start empty, then add the `iterable' if any
self._data = { }
self.update(iterable)
def update(self, iterable):
# update from an element->count mapping, or from any iterable
if isinstance(iterable, dict):
for elem, n in iterable.iteritems( ):
self[elem] += n
else:
for elem in iterable:
self[elem] += 1
def _ _contains_ _(self, elem):
# delegate membership test
return elem in self._data
def _ _getitem_ _(self, elem):
# default all missing items to a count of 0
return self._data.get(elem, 0)
def _ _setitem_ _(self, elem, n):
# setting an item to a count of 0 removes the item
self._data[elem] = n
if n == 0:
del self._data[elem]
def _ _delitem_ _(self, elem):
# delegate to _ _setitem_ _ to allow deleting missing items
self[elem] = 0
def _ _len_ _(self):
# length is computed on-the-fly
return sum(self._data.itervalues( ))
def _ _nonzero_ _(self):
# avoid truth tests using _ _len_ _, as it's relatively slow
return bool(self._data)
def _ _eq_ _(self, other):
# a bag can only equal another bag
if not isinstance(other, bag):
return False
return self._data == other._data
def _ _ne_ _(self, other):
# a bag always differs from any non-bag
return not (self == other)
def _ _hash_ _(self):
# a bag can't be a dict key nor an element in a set
raise TypeError
def _ _repr_ _(self):
# typical string-representation
return '%s(%r)' % (self._ _class_ _._ _name_ _, self._data)
def copy(self):
# make and return a shallow copy
return self._ _class_ _(self._data)
_ _copy_ _ = copy # For the copy module
def clear(self):
# remove all items
self._data.clear( )
def _ _iter_ _(self):
# yield each element the # of times given by its count
for elem, cnt in self._data.iteritems( ):
for i in xrange(cnt):
yield elem
def iterunique(self):
# yield each element only once
return self._data.iterkeys( )
def itercounts(self):
# yield element-count pairs
return self._data.iteritems( )
def mostcommon(self, n=None):
# return the n (default: all) most common elements, each as an
# element-count pair, as a list sorted by descending counts
if n is None:
return sorted(self.itercounts( ), key=itemgetter(1), reverse=True)
it = enumerate(self.itercounts( ))
nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it))
return [(elem, cnt) for cnt, i, elem in nl]


Discussion


Python offers several useful container classes, including built-in
tuples, lists and
dicts, sets (in Python 2.4,
sets are built-in; in Python 2.3,
they're in module
sets)which, unlike bags,
can be seen as "holding only one
instance" of each of their elementsand
double-ended queues, deques (in Python 2.4 only,
in module collections). This abundance of
container classes doesn't mean there is no use for
yet more. The bag, (i.e., multiset), presented in
this recipe, is widely useful, since counting the numbers of
occurrences of different objects is a frequent task useful in many
applications. Rather than coding a bag each time
you need such a container (generally based on a dictionary mapping
items to counts), it's better to design and code it
once, put it among one's utilities, and lobby for it
to become part of the standard library for a future Python, such as
2.5 (which can be expected sometime in 2006 and will focus on
enhancements to the standard library rather than to the core
language).

The API offered by the bag class presented in this
recipe is largely based on indexing, due to the strong analogy
between a bag and a mapping of items to counts.
For example:

>>> b = bag('banana')
>>> b['a']
3
>>> b['a'] += 1
>>> b['a']
4
>>> del b['a'] # removes all 'a's from the bag
>>> b['a']
0

Items that are not in the bag can also be used as
indices, giving a value (i.e., count) of 0; a lot of
bag's convenience comes from this
default. A bag also offers several ways to iterate
on it (by unique elements; by elements, each repeated as many times
as its count; by
(element,
count)
pairs); and also supplies a handy method mostcommon
to return
(element,
count) pairs
sorted by descending count (all such pairs, or just the top
n). An example use of
mostcommon:

>>> bag(word for line in open('somefile.txt')
... for word in line.split( )).mostcommon(5)
[('to', 83), ('for', 71), ('the', 61), ('of', 53), ('and', 52)]

All design choices are tradeoffs. For some applications, it might be
more convenient to have bag's API
closer to set's rather than to
dict's, with an
add method, and binary operators, for example, to
join two bags returning a new one (as
list does with operator + and
set does with the
"or", vertical-bar operator
|). In most cases, this would be overkill. After
all, "a designer knows he has achieved perfection,
not when there is nothing left to add, but when there is nothing left
to take away" (Antoine de Saint-Exupéry).
So, for example, to join two bags, getting a new
one, without altering either input bag, code a
little function using roughly the same
update-based approach you would use with
dicts, as follows:

def bagjoin(b1, b2):
b = bag(b1)
b.update(b2)
return b

Just as would be the case for an analogous function joining
dicts, this works, not only when
b1 and b2 are
bags, but also when they are other kinds of
objects that can be passed to bag and
bag.updateobjects such as arbitrary
iterables or mappings (generally dictionaries) from items to counts.
Such polymorphism comes at negligible cost, and it's
well worth preserving.

Although the crucial design choices in this recipe are those about
bag's API, some implementation
choices must be made as well. In this recipe's code,
implementation choices are oriented towards simplicity. In
particular, there is no attempt to allow this code to run on Python
2.3. This recipe is optimized for Python 2.4 because it is
Python's current release and is likely to be used
soon in lieu of Python 2.3, particularly among Pythonistas who are
sensitive to performance issues, given the amount of highly
successful effort that was devoted to optimizing version
2.4's performance. If Python 2.3 support was deemed
necessary, it would be best implemented separately, rather than
hobbling the primary 2.4 implementation with inefficiencies or
complications.


See Also


Smalltalk's Bag class at
http://www.gnu.org/software/smalltalk/gst-manual/gst_49l;
C++'s std::multiset template class at http://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/classstd_1_1multisetl.


/ 394