Recipe 1.10. Filtering a String for a Set of Characters
Credit: Jürgen Hermann, Nick Perkins, Peter
Cogolo
Problem
Given a set of characters to keep, you need to build a filtering
function that, applied to any string s,
returns a copy of s that contains only
characters in the set.
Solution
The TRanslate method of string objects is fast and
handy for all tasks of this ilk. However, to call
translate effectively to solve this
recipe's task, we must do some advance preparation.
The first argument to TRanslate is a translation
table: in this recipe, we do not want to do any translation, so we
must prepare a first argument that specifies "no
translation". The second argument to
TRanslate specifies which characters we want to
delete: since the task here says that
we're given, instead, a set of characters to
keep (i.e., to not delete),
we must prepare a second argument that gives the set
complementdeleting all characters we must not
keep. A closure is the best way to do this advance preparation just
once, obtaining a fast filtering function tailored to our exact
needs:
import string
# Make a reusable string of all characters, which does double duty
# as a translation table specifying "no translation whatsoever"
allchars = string.maketrans('', '')
def makefilter(keep):
"" Return a function that takes a string and returns a partial copy
of that string consisting of only the characters in 'keep'.
Note that `keep' must be a plain string.
""
# Make a string of all characters that are not in 'keep': the "set
# complement" of keep, meaning the string of characters we must delete
delchars = allchars.translate(allchars, keep)
# Make and return the desired filtering function (as a closure)
def thefilter(s):
return s.translate(allchars, delchars)
return thefilter
if _ _name_ _ == '_ _main_ _':
just_vowels = makefilter('aeiouy')
print just_vowels('four score and seven years ago')
# emits: ouoeaeeyeaao
print just_vowels('tiger, tiger burning bright')
# emits: ieieuii
Discussion
The key to understanding this recipe
lies in the definitions of the maketrans function
in the string module of the Python Standard
Library and in the translate method of string
objects. TRanslate returns a copy of the string
you call it on, replacing each character in it with the corresponding
character in the translation table passed in as the first argument
and deleting the characters specified in the second argument.
maketrans is a utility function to create
translation tables. (A translation table is a string
t of exactly 256 characters: when you pass
t as the first argument of a
translate method, each character
c of the string on which you call the
method is translated in the resulting string into the character
t[ord(c)].)In this recipe, efficiency is maximized by splitting the filtering
task into preparation and execution phases. The string of all
characters is clearly reusable, so we build it once and for all as a
global variable when this module is imported. That way, we ensure
that each filtering function uses the same string-of-all-characters
object, not wasting any memory. The string of characters to delete,
which we need to pass as the second argument to the
translate method, depends on the set of characters
to keep, because it must be built as the "set
complement" of the latter: we must tell
translate to delete every character that we do not
want to keep. So, we build the delete-these-characters string in the
makefilter factory function. This building is done
quite rapidly by using the translate method to
delete the "characters to keep"
from the string of all characters. The translate
method is very fast, as are the construction and execution of these
useful little resulting functions. The test code that executes when
this recipe runs as a main script shows how to build a filtering
function by calling makefilter, bind a name to the
filtering function (by simply assigning the result of calling
makefilter to a name), then call the filtering
function on some strings and print the results.Incidentally, calling a filtering function with
allchars as the argument puts the set of characters
being kept into a canonic string form, alphabetically sorted and
without duplicates. You can use this idea to code a very simple
function to return the canonic form of any set of characters
presented as an arbitrary string:
def canonicform(s):The Solution uses a def statement to make the
"" Given a string s, return s's characters as a canonic-form string:
alphabetized and without duplicates. ""
return makefilter(s)(allchars)
nested function (closure) it returns, because def
is the most normal, general, and clear way to make functions. If you
prefer, you could use lambda instead, changing the
def and return statements in
function makefilter into just one return
lambda statement:
return lambda s: s.translate(allchars, delchars)Most Pythonistas, but not all, consider using def
clearer and more readable than using lambda.Since this recipe deals with strings seen as sets of characters, you
could alternatively use the sets.Set type (or, in
Python 2.4, the new built-in set type) to perform
the same tasks. Thanks to the translate
method's power and speed, it's
often faster to work directly on strings, rather than go through
sets, for tasks of this ilk. However, just as noted in Recipe 1.8, the functions in this
recipe only work for normal strings, not for
Unicode strings.To solve this recipe's task for Unicode strings, we
must do some very different preparation. A Unicode
string's translate method takes
only one argument: a mapping or sequence, which is indexed with the
code number of each character in the string. Characters whose codes
are not keys in the mapping (or indices in the sequence) are just
copied over to the output string. Otherwise, the value corresponding
to each character's code must be either a Unicode
string (which is substituted for the character) or
None (in which case the character is deleted). A
very nice and powerful arrangement, but unfortunately not one
that's identical to the way plain strings work, so
we must recode.Normally, we use either a dict or a
list as the argument to a Unicode
string's translate method to
translate some characters and/or delete some. But for the specific
task of this recipe (i.e., keep just some
characters, delete all others), we might need an inordinately large
dict or string, just mapping
all other characters to None.
It's better to code, instead, a little class that
appropriately implements a _ _getitem_ _ method
(the special method that gets called in indexing operations). Once
we're going to the (slight) trouble of coding a
little class, we might as well make its instances callable and have
makefilter be just a synonym for the class itself:
import setsWe might name the class itself makefilter, but, by
class Keeper(object):
def _ _init_ _(self, keep):
self.keep = sets.Set(map(ord, keep))
def _ _getitem_ _(self, n):
if n not in self.keep:
return None
return unichr(n)
def _ _call_ _(self, s):
return unicode(s).translate(self)
makefilter = Keeper
if _ _name_ _ == '_ _main_ _':
just_vowels = makefilter('aeiouy')
print just_vowels(u'four score and seven years ago')
# emits: ouoeaeeyeaao
print just_vowels(u'tiger, tiger burning bright')
# emits: ieieuii
convention, one normally names classes with an uppercase initial;
there is essentially no cost in following that convention here, too,
so we did.
See Also
Recipe 1.8; documentation
for the TRanslate method of strings and Unicode
objects, and maketrans function in the
string module, in the Library
Reference and Python in a
Nutshell.