Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید


Recipe 1.13. Accessing Substrings


Credit: Alex Martelli


Problem


You
want to access portions of a string. For example,
you've read a fixed-width record and want to extract
the record's fields.


Solution


Slicing is great, but it only does one field at a time:

afield = theline[3:8]

If you need to think in terms of field lengths,
struct.unpack may be appropriate. For example:

import struct
# Get a 5-byte string, skip 3,
get two 8-byte strings, then all the rest:
baseformat = "5s 3x 8s 8s"
# by how many bytes does theline exceed the length implied by this
# base-format (24 bytes in this case, but struct.calcsize is general)
numremain = len(theline) - struct.calcsize(baseformat)
# complete the format with the appropriate 's' field, then unpack
format = "%s %ds" % (baseformat, numremain)
l, s1, s2, t = struct.unpack(format, theline)

If you want to skip rather than get "all the
rest
", then just unpack the initial part of
theline with the right length:

l, s1, s2 = struct.unpack(baseformat,
theline[:struct.calcsize(baseformat)])

If you need to split at five-byte boundaries, you can easily code a
list comprehension (LC) of slices:

fivers = [theline[k:k+5] for k in xrange(0, len(theline), 5)]

Chopping a string into individual characters is of course
easier:

chars = list(theline)

If you prefer to think of your data as being cut up at specific
columns, slicing with LCs is generally handier:

cuts = [8, 14, 20, 26, 30]
pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]

The call to zip in this LC returns a list of pairs
of the form (cuts[k], cuts[k+1]), except that the
first pair is (0, cuts[0]), and the last one is
(cuts[len(cuts)-1], None). In
other words, each pair gives the right (i, j) for
slicing between each cut and the next, except that the first one is
for the slice before the first cut, and the last one is for the slice
from the last cut to the end of the string. The rest of the LC just
uses these pairs to cut up the appropriate slices of
theline.


Discussion


This recipe was inspired by recipe 1.1 in the Perl
Cookbook
. Python's slicing takes the
place of Perl's substr.
Perl's built-in unpack and
Python's struct.unpack are
similar. Perl's is slightly richer, since it accepts
a field length of * for the last field to mean all
the rest. In Python, we have to compute and insert the exact length
for either extraction or skipping. This isn't a
major issue because such extraction tasks will usually be
encapsulated into small functions. Memoizing,
also known as automatic caching, may help with
performance if the function is called repeatedly, since it allows you
to avoid redoing the preparation of the format for the struct
unpacking. See Recipe 18.5
for details about
memoizing.

In a purely Python context, the point of this recipe is to remind you
that struct.unpack is often viable, and sometimes
preferable, as an alternative to string slicing (not quite as often
as unpack versus substr in
Perl, given the lack of a *-valued field length,
but often enough to be worth keeping in mind).

Each of these snippets is, of course, best encapsulated in a
function. Among other advantages, encapsulation ensures we
don't have to work out the computation of the last
field's length on each and every use. This function
is the equivalent of the first snippet using
struct.unpack in the
"Solution":

def fields(baseformat, theline, lastfield=False):
# by how many bytes does theline exceed the length implied by
# base-format (struct.calcsize computes exactly that length)
numremain = len(theline)-struct.calcsize(baseformat)
#complete the format with the appropriate 's'or'x' field, then unpack
format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x")
return struct.unpack(format, theline)

A design decision worth noticing (and, perhaps, worth criticizing) is
that of having a lastfield=False optional
parameter. This reflects the observation that, while we often want to
skip the last, unknown-length subfield, sometimes we want to retain
it instead. The use of lastfield in the expression
lastfield and
s or
x (equivalent to
C's ternary operator
lastfield?"s":"c")
saves an if/else, but
it's unclear whether the saving is worth the
obscurity. See Recipe 18.9
for more about simulating ternary operators in Python.

If function fields is called in a loop, memoizing
(caching) with a key that is the tuple (baseformat,
len(theline), lastfield)
may offer faster performance.
Here's a version of fields with
memoizing:

def fields(baseformat, theline, lastfield=False, _cache={  }):
# build the key and try getting the cached format string
key = baseformat, len(theline), lastfield
format = _cache.get(key)
if format is None:
# no format string was cached, build and cache it
numremain = len(theline)-struct.calcsize(baseformat)
_cache[key] = format = "%s %d%s" % (
baseformat, numremain, lastfield and "s" or "x")
return struct.unpack(format, theline)

The idea behind this memoizing is to perform the somewhat costly
preparation of format only once for each set of
arguments requiring that preparation, thereafter storing it in the
_cache dictionary. Of course, like all
optimizations, memoizing needs to be validated by measuring
performance to check that each given optimization does actually speed
things up. In this case, I measure an increase in speed of
approximately 30% to 40% for the memoized version, meaning that the
optimization is probably not worth the bother unless the function is
part of a performance bottleneck for your program.

The function equivalent of the next LC snippet in the solution is:

def split_by(theline, n, lastfield=False):
# cut up all the needed pieces
pieces = [theline[k:k+n] for k in xrange(0, len(theline), n)]
# drop the last piece if too short and not required
if not lastfield and len(pieces[-1]) < n:
pieces.pop( )
return pieces

And for the last snippet:

def split_at(theline, cuts, lastfield=False):
# cut up all the needed pieces
pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]
# drop the last piece if not required
if not lastfield:
pieces.pop( )
return pieces

In both of these cases, a list comprehension doing slicing turns out
to be slightly preferable to the use of
struct.unpack.

A completely different approach is to use generators, such as:

def split_at(the_line, cuts, lastfield=False):
last = 0
for cut in cuts:
yield the_line[last:cut]
last = cut
if lastfield:
yield the_line[last:]
def split_by(the_line, n, lastfield=False):
return split_at(the_line, xrange(n, len(the_line), n), lastfield)

Generator-based approaches are particularly appropriate when all you
need to do on the sequence of resulting fields is loop over it,
either explicitly, or implicitly by calling on it some
"accumulator" callable such as
''.join. If you do need to materialize a list of
the fields, and what you have available is a generator instead, you
only need to call the built-in list on the
generator, as in:

list_of_fields = list(split_by(the_line, 5))


See Also


Recipe 18.9 and Recipe 18.5; Perl
Cookbook
recipe 1.1.

/ 394