Recipe 1.8. Checking Whether a String Contains a Set of Characters
Credit: Jürgen Hermann, Horst
Hansen
Problem
You need to check for the occurrence of any of a set of characters in
a string.
Solution
The simplest approach is clear, fast, and general (it works for any
sequence, not just strings, and for any container on which you can
test for membership, not just sets):
def containsAny(seq, aset):You can gain a little speed by moving to a higher-level, more
"" Check whether sequence seq contains ANY of the items in aset. ""
for c in seq:
if c in aset: return True
return False
sophisticated approach, based on the itertools
standard library module, essentially expressing the same approach in
a different way:
import itertools
def containsAny(seq, aset):
for item in itertools.ifilter(aset._ _contains_ _, seq):
return True
return False
Discussion
Most problems related to sets are best handled by using the
set built-in type introduced in Python 2.4 (if
you're using Python 2.3, you can use the equivalent
sets.Set type from the Python Standard Library).
However, there are exceptions. Here, for example, a pure set-based
approach would be something like:
def containsAny(seq, aset):However,
return bool(set(aset).intersection(seq))
with this approach, every item in seq inevitably has
to be examined. The functions in this recipe's
Solution, on the other hand,
"short-circuit": they return as
soon as they know the answer. They must still check every item in
seq when the answer is
Falsewe could never affirm that no item in
seq is a member of aset without
examining all the items, of course. But when the answer is
true, we often learn about that very soon, namely
as soon as we examine one item that is a member
of aset. Whether this matters at all is very
data-dependent, of course. It will make no practical difference when
seq is short, or when the answer is typically
False, but it may be extremely important for a
very long seq (when the answer can typically be soon
determined to be true).
The first
version of containsAny presented in the recipe has
the advantage of simplicity and clarity: it expresses the fundamental
idea with total transparency. The second version may appear to be
"clever", and that is not a
complimentary adjective in the Python world, where simplicity and
clarity are core values. However, the second version is well worth
considering, because it shows a higher-level approach, based on the
itertools module of the standard library.
Higher-level approaches are most often preferable to lower-level ones
(although the issue is moot in this particular case).
itertools.ifilter takes a predicate and an
iterable, and yields the items in that iterable that satisfy the
"predicate". Here, as the
"predicate", we use
anyset._ _contains_ _, the bound method that is
internally called when we code in anyset for
membership testing. So, if ifilter yields anything
at all, it yields an item of seq that is also a
member of anyset, so we can return
True as soon as this happens. If we get to the statement
following the for, it must mean the
return True never executed, because no items of
seq are members of anyset, so we
can return False.
containsAny to check whether a string (or other
sequence) contains any members of a set, you may also need such
variants as:
def containsOnly(seq, aset):containsOnly is the same function as
"" Check whether sequence seq contains ONLY items in aset. ""
for c in seq:
if c not in aset: return False
return True
containsAny, but with the logic turned upside-down.
Other apparently similar tasks don't lend themselves
to short-circuiting (they intrinsically need to examine all items)
and so are best tackled by using the built-in type
set (in Python 2.4; in 2.3, you can use
sets.Set in the same way):
def containsAll(seq, aset):If
"" Check whether sequence seq contains ALL the items in aset. ""
return not set(aset).difference(seq)
you're not accustomed to using the
set (or sets.Set) method
difference, be aware of its semantics: for any
set a,
a.difference(b) (just like
a-set(b)) returns the set of all elements of
a that are not in
b. For example:
>>> L1 = [1, 2, 3, 3]which hopefully helps explain why:
>>> L2 = [1, 2, 3, 4]
>>> set(L1).difference(L2)
set([ ])
>>> set(L2).difference(L1)
set([4])
>>> containsAll(L1, L2)(In other words, don't confuse
False
>>> containsAll(L2, L1)
True
difference with another method of
set, symmetric_difference,
which returns the set of all items that are in
either argument and not in the other.)When you're dealing specifically with (plain,
not Unicode) strings for both
seq and aset, you may not need the
full generality of the functions presented in this recipe, and may
want to try the more specialized approach explained in Recipe 1.10 based on
strings' method TRanslate and the
string.maketrans function from the Python Standard
Library. For example:
import string
notrans = string.maketrans('', '') # identity "translation"
def containsAny(astr, strset):
return len(strset) != len(strset.translate(notrans, astr))
def containsAll(astr, strset):
return not strset.translate(notrans, astr)
This
somewhat tricky approach relies on strset.translate(notrans,
astr) being the subsequence of strset that
is made of characters not in astr. When that
subsequence has the same length as strset, no
characters have been removed by strset.translate,
therefore no characters of strset are in
astr. Conversely, when the subsequence is empty,
all characters have been removed, so all characters of
strset are in astr. The
translate method keeps coming up naturally when
one wants to treat strings as sets of characters, because
it's speedy as well as handy and flexible; see
Recipe 1.10 for more
details.These two sets of approaches to the recipe's tasks
have very different levels of generality. The earlier approaches are
very general: not at all limited to string processing, they make
rather minimal demands on the objects you apply them to. The approach
based on the translate method, on the other hand,
works only when both astr and
strset are strings, or very
closely mimic plain strings' functionality. Not even
Unicode strings suffice, because the TRanslate
method of Unicode strings has a signature that is different from that
of plain stringsa single argument (a dict
mapping code numbers to Unicode strings or None)
instead of two (both strings).
See Also
Recipe 1.10; documentation
for the translate method of strings and Unicode
objects, and maketrans function in the
string module, in the Library
Reference and Python in a
Nutshell; ditto for documentation of built-in
set (Python 2.4 only), modules
sets and itertools, and the
special method _ _contains_ _.