Python implements the KMP algorithm

Article directory

fully understand

In the string matching algorithm, the KMP algorithm can achieve almost O(N) complexity, the key is to eliminate the backtracking of the main pointer, which can save a lot of time.

For example, if you want to match abcdabceand abcematch, then the brute force algorithm is shown in the following table, each time you need to compare 4 characters, a total of 5 times.

	a	b	c	d	a	b	c	e
1	a	b	c	e
2		a	b	c	e
3			a	b	c	e
4				a	b	c	e
√					a	b	c	e

However, we can see at a glance that this dis not in abceit at all, so if we can store some other information, maybe we can skip this at once d.

	a	b	c	d	a	b	c	e
1	a	b	c	e
√					a	b	c	e

But sometimes you can't jump more, for example, if you want abcabcabcto match cab, a better solution is roughly as follows

	a	b	c	a	b	c	a	b	c
1	c	a	b
√			c	a	b

So the crux of the question is, why can the above case be skipped directly d, and the following case can only be skipped exactly two in an impartial manner.

To sum up, two rules are found. Let txt be a long text, and you need to find a str in txt. If the character currently compared is ch, there are two simple rules.


ch is not in str	str skip this ch
ch happens to be str[0]	str is transferred to the position of this ch

So next, if ch is in str, but not str[0], what should be considered?

Of course, it cannot be skipped directly, because there may be repeated sequences in str, such as matching abababcfrom them ababc, then the best solution should be

	a	b	a	b	a	b	c
1	a	b	a	b	c
√			a	b	a	b	c

That is, for ababcsuch strings, due to athe different positions, when we get a new one a, a different decision will be taken.

In the figure below, the circle represents the current match, the arrow represents a new character, and the arrow points to the next jumping position. An exclamation mark means not a certain character.

Seeing here is not a feeling of understanding, this is the so-called state machine, and the construction process of this state machine has nothing to do with txt. In other words, just do a self-match on str before matching txt, and you can get a state machine like this.

The state in the state machine actually represents the pointer position in the string to be matched. Returning to a means that the pointer points to 0; when moving forward, the pointer increases by 1; when abab receives an a and returns to the step of aba, it means The pointer rolls back from 5 to 3, as shown in the table below.

	abab	ababc
a	3
b	0
c	0	success
other	0

Did you suddenly realize that the so-called state machine is actually a matrix.

The next thing we have to do is generate this state matrix.

rough implementation

Or consider finding str from txt, and the first step is to establish the state matrix of str

test = "ababcdabadc"    #python中str是关键字，所以改个名
length = len(test)
#创建用于存储状态的字典
status = {
    
    s:[0 for _ in range(length)] for s in set(test)}
for ch in status:
    for i in range(length):
        for j in range(i+1):
            if test[i-j:i]+ch == test[0:j+1]:
                status[ch][i] = j+1

get

b [0, 2, 0, 4, 0, 0, 0, 8, 0, 4, 0]
d [0, 0, 0, 0, 0, 6, 0, 0, 0, 10, 0]
c [0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 11]
a [1, 1, 3, 1, 3, 1, 7, 1, 9, 1, 1]

For a string ababcdabadc, the initial pointer is 0. If there is one at this time a, the pointer will jump to it a[0]=1, indicating athe state it is in at this time; if there is another one at this time b, the pointer will jump to b[1]=2, indicating abthe state it is in at this time. And so on.

After making the state matrix of str, it is very convenient to perform string comparison.

txt = "ababcdabasdcdasdfababcdabadc"
test = "ababcdabadc" 
KMP(txt,test)

def KMP(txt,test):
    status = setStatus(test)
    length = len(test)
    keySet = set(status.keys())
    match = []
    j = 0
    for i in range(len(txt)):
        s = txt[i]
        j = status[s][j] if s in keySet else 0
        if j==length:
            match.append(i-length+1)
    return match

def setStatus(test):
    length = len(test)
    #创建用于存储状态的字典
    status = {
    
    s:[0 for _ in range(length)] for s in set(test)}
    for ch in status:
        for i in range(length):
            for j in range(i+1):
                if test[i-j:i]+ch == test[0:j+1]:
                    status[ch][i] = j+1
    return status

Optimization of the matching matrix

In general, aaaaaaathe match matrix for strings other than the maddening one is sparse, which means we do a lot of useless comparisons. Therefore, the solution process of its matching matrix can be roughly optimized, at least the outermost loop can be removed.

def setStatus(test):
    length = len(test)
    #创建用于存储状态的字典
    status = {
    
    s:[0 for _ in range(length)] for s in set(test)}
    for i in range(length):
        for j in range(i+1):
            if test[i-j:i] == test[0:j]:
                ch = test[j]
                status[ch][i] = j+1
    return status

Now there are only two layers of loops left, which looks refreshing, but the people who eat melons who don't know the truth still mind $O(N^2)$ complexity. In order to be more refreshing, let's examine the characteristics of the matching matrix

b [0, 2, 0, 4, 0, 0, 0, 8, 0, 4, 0]
d [0, 0, 0, 0, 0, 6, 0, 0, 0, 10, 0]
c [0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 11]
a [1, 1, 3, 1, 3, 1, 7, 1, 9, 1, 1]

First, most non-zero values are incremented. Such das the non-zero value in 6,11; cthe non-zero value 5,11in . And once there is a smaller value, then this value must have appeared before, for example b[9]=4, this 4 has appeared before.

If it is taken 索引降低，则索引必然重复as a principle, then obviously 0and 1can also be included in this principle, b,c,dthe 0 in the middle has appeared in the first place; and athere is no 0 in the middle, because a[0]=1its minimum value can only be 1.

Or, a[i]either i+1, or a value that has ever occurred.

In this case, for a Nstring of length str, I can strextract the previous Mcharacter from it for backward matching. For example "ababcdabadc", match first to aget a set of positions that are successfully matched; then match these positions ab, and so on, until the matching position is the 0 position.