Recipe 5.13. Finding Subsequences
Credit: David Eppstein, Alexander Semenov
Problem
You need to find occurrences
of a subsequence in a larger sequence.
Solution
If the sequences are strings (plain or
Unicode), Python strings' find
method and the standard library's
re module are the best approach. Otherwise, use
the Knuth-Morris-Pratt algorithm (KMP):
def KnuthMorrisPratt(text, pattern):
''' Yields all starting positions of copies of subsequence 'pattern'
in sequence 'text' -- each argument can be any iterable.
At the time of each yield, 'text' has been read exactly up to and
including the match with 'pattern' that is causing the yield. '''
# ensure we can index into pattern, and also make a copy to protect
# against changes to 'pattern' while we're suspended by `yield'
pattern = list(pattern)
length = len(pattern)
# build the KMP "table of shift amounts" and name it 'shifts'
shifts = [1] * (length + 1)
shift = 1
for pos, pat in enumerate(pattern):
while shift <= pos and pat != pattern[pos-shift]:
shift += shifts[pos-shift]
shifts[pos+1] = shift
# perform the actual search
startPos = 0
matchLen = 0
for c in text:
while matchLen == length or matchLen >= 0 and pattern[matchLen] != c:
startPos += shifts[matchLen]
matchLen -= shifts[matchLen]
matchLen += 1
if matchLen == length: yield startPos
Discussion
This recipe implements the Knuth-Morris-Pratt algorithm for finding
copies of a given pattern as a contiguous subsequence of a larger
text. Since KMP accesses the text sequentially, it is natural to
implement it in a way that allows the text to be an arbitrary
iterator. After a preprocessing stage that builds a table of shift
amounts and takes time that's directly proportional
to the length of the pattern, each text symbol is processed in
constant amortized time. Explanations and demonstrations of how KMP
works can be found in all good elementary texts about algorithms. (A
recommendation is provided in See Also.)If text and
pattern are both Python strings, you can
get a faster solution by suitably applying Python built-in search
methods:
def finditer(text, pattern):For example, using an alphabet of length 4 ('ACGU'
pos = -1
while True:
pos = text.find(pattern, pos+1)
if pos < 0: break
yield pos
. . .), finding all occurrences of a pattern of length 8 in
a text of length 100000, on my machine, takes about 4.3 milliseconds
with finditer, but the same task takes about 540
milliseconds with KnuthMorrisPratt
(that's with Python 2.3; KMP is faster with Python
2.4, taking about 480 milliseconds, but that's still
over 100 times slower than finditer). So remember:
this recipe is useful for searches on generic
sequences, including ones that you cannot keep in memory all at once,
but if you're searching on strings,
Python's built-in searching methods rule.
See Also
Many excellent books cover the fundamentals of algorithms; among such
books, a widely admired one is Thomas H. Cormen, Charles E.
Leiserson, Ronald L. Rivest, Clifford Stein, Introduction
to Algorithms, 2d ed. (MIT Press).