Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 5.13. Finding Subsequences


Credit: David Eppstein, Alexander Semenov


Problem


You need to find occurrences
of a subsequence in a larger sequence.


Solution


If the sequences are strings (plain or
Unicode), Python strings' find
method and the standard library's
re module are the best approach. Otherwise, use
the Knuth-Morris-Pratt algorithm (KMP):

def KnuthMorrisPratt(text, pattern):
''' Yields all starting positions of copies of subsequence 'pattern'
in sequence 'text' -- each argument can be any iterable.
At the time of each yield, 'text' has been read exactly up to and
including the match with 'pattern' that is causing the yield. '''
# ensure we can index into pattern, and also make a copy to protect
# against changes to 'pattern' while we're suspended by `yield'
pattern = list(pattern)
length = len(pattern)
# build the KMP "table of shift amounts" and name it 'shifts'
shifts = [1] * (length + 1)
shift = 1
for pos, pat in enumerate(pattern):
while shift <= pos and pat != pattern[pos-shift]:
shift += shifts[pos-shift]
shifts[pos+1] = shift
# perform the actual search
startPos = 0
matchLen = 0
for c in text:
while matchLen == length or matchLen >= 0 and pattern[matchLen] != c:
startPos += shifts[matchLen]
matchLen -= shifts[matchLen]
matchLen += 1
if matchLen == length: yield startPos


Discussion


This recipe implements the Knuth-Morris-Pratt algorithm for finding
copies of a given pattern as a contiguous subsequence of a larger
text. Since KMP accesses the text sequentially, it is natural to
implement it in a way that allows the text to be an arbitrary
iterator. After a preprocessing stage that builds a table of shift
amounts and takes time that's directly proportional
to the length of the pattern, each text symbol is processed in
constant amortized time. Explanations and demonstrations of how KMP
works can be found in all good elementary texts about algorithms. (A
recommendation is provided in See Also.)

If text and
pattern are both Python strings, you can
get a faster solution by suitably applying Python built-in search
methods:

def finditer(text, pattern):
pos = -1
while True:
pos = text.find(pattern, pos+1)
if pos < 0: break
yield pos

For example, using an alphabet of length 4 ('ACGU'
. . .), finding all occurrences of a pattern of length 8 in
a text of length 100000, on my machine, takes about 4.3 milliseconds
with finditer, but the same task takes about 540
milliseconds with KnuthMorrisPratt
(that's with Python 2.3; KMP is faster with Python
2.4, taking about 480 milliseconds, but that's still
over 100 times slower than finditer). So remember:
this recipe is useful for searches on generic
sequences, including ones that you cannot keep in memory all at once,
but if you're searching on strings,
Python's built-in searching methods rule.


See Also


Many excellent books cover the fundamentals of algorithms; among such
books, a widely admired one is Thomas H. Cormen, Charles E.
Leiserson, Ronald L. Rivest, Clifford Stein, Introduction
to Algorithms
, 2d ed. (MIT Press).


/ 394