Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 5.13. Finding Subsequences

Credit: David Eppstein, Alexander Semenov

Problem

You need to find occurrences
of a subsequence in a larger sequence.

Solution

If the sequences are strings (plain or
Unicode), Python strings' find
method and the standard library's
re module are the best approach. Otherwise, use
the Knuth-Morris-Pratt algorithm (KMP):

def KnuthMorrisPratt(text, pattern):
''' Yields all starting positions of copies of subsequence 'pattern'
in sequence 'text' -- each argument can be any iterable.
At the time of each yield, 'text' has been read exactly up to and
including the match with 'pattern' that is causing the yield. '''
# ensure we can index into pattern, and also make a copy to protect
# against changes to 'pattern' while we're suspended by `yield'
pattern = list(pattern)
length = len(pattern)
# build the KMP "table of shift amounts" and name it 'shifts'
shifts = [1] * (length + 1)
shift = 1
for pos, pat in enumerate(pattern):
while shift <= pos and pat != pattern[pos-shift]:
shift += shifts[pos-shift]
shifts[pos+1] = shift
# perform the actual search
startPos = 0
matchLen = 0
for c in text:
while matchLen == length or matchLen >= 0 and pattern[matchLen] != c:
startPos += shifts[matchLen]
matchLen -= shifts[matchLen]
matchLen += 1
if matchLen == length: yield startPos

Discussion

This recipe implements the Knuth-Morris-Pratt algorithm for finding
copies of a given pattern as a contiguous subsequence of a larger
text. Since KMP accesses the text sequentially, it is natural to
implement it in a way that allows the text to be an arbitrary
iterator. After a preprocessing stage that builds a table of shift
amounts and takes time that's directly proportional
to the length of the pattern, each text symbol is processed in
constant amortized time. Explanations and demonstrations of how KMP
works can be found in all good elementary texts about algorithms. (A
recommendation is provided in See Also.)

If text and
pattern are both Python strings, you can
get a faster solution by suitably applying Python built-in search
methods:

def finditer(text, pattern):
pos = -1
while True:
pos = text.find(pattern, pos+1)
if pos < 0: break
yield pos

For example, using an alphabet of length 4 ('ACGU'
. . .), finding all occurrences of a pattern of length 8 in
a text of length 100000, on my machine, takes about 4.3 milliseconds
with finditer, but the same task takes about 540
milliseconds with KnuthMorrisPratt
(that's with Python 2.3; KMP is faster with Python
2.4, taking about 480 milliseconds, but that's still
over 100 times slower than finditer). So remember:
this recipe is useful for searches on generic
sequences, including ones that you cannot keep in memory all at once,
but if you're searching on strings,
Python's built-in searching methods rule.

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی