"Algorithm"-string pattern matching algorithm 2 [KMP algorithm]

introduction

\quad \quad The previous article talked about the BF algorithm , but its time performance is relatively low?

  • There are a lot of backtracking when each match is unsuccessful, and the result of partial matching is not well used. For example, when the match is unsuccessful, the main string must be traced back to the next character, and the pattern string must be traced back to 0 to start rematching one by one.
  • Is it possible that the main string does not backtrack and the pattern string moves?
  • KMP algorithm

1 Overview

\quad \quad KMP algorithm is an improved string matching algorithm, proposed by DEKnuth, JHMorris and VRPratt, so people call it Knuth-Morris-Pratt operation (referred to as KMP algorithm).

Improvement ideas:

  • Use the partially matched result to speed up the sliding speed of the pattern string and the pointer i of the main string S does not need to be backtracked! Speed ​​can be increased to O(n+m)!
  • The new comparison starting point K to which the pattern string T slides to the right is only related to the pattern string T.

2. The basic idea

  • See this article
  • Basic concepts
    : Longest Prefix Suffix (Longest Prefix Suffix) is the longest equal suffix length
    Insert picture description here

3. KMP algorithm-next array

1、effect: Define the next[j] function to indicate that when the j-th character in the pattern is "mismatched" with the corresponding character in the main string, the position of the character that needs to be compared with the character in the main string in the pattern.

2、Method of calculating next[j]

\quad \quad The value of the next array is only related to the pattern string itself.

  • When j=0, next[j]=-1, which means no character comparison
  • When j>0, the value of next[j] is: the maximum length of the same character string at the beginning and the end in the substring formed by the position of the pattern string from 0 to j-1, that is, the longest equal prefix and suffix length.
  • When there is no substring with the same beginning and end, the value of next[j] is 0, which means that the character comparison starts from the head of the pattern string.

Insert picture description here

  • Example: The next array of the pattern string p= "abcabcmn" is next[0]=-1 (there is no separate string processing before), next[1]=0; next[2]=0; next[3]=0; next[4]=1; next[5]=2; next[6]=3; next[7]=0.

Insert picture description here
3. Pseudo code

Insert picture description here

  • Suppose we calculate the next array from left to right, at a certain moment, we have got next[0]~next[i], now we need to calculate next[i+1], set j=next[i], because we know next[i], so we know that T[0,j-1]=T[ij,i-1], now compare T[j] and T[i], if they are equal, from the definition of the next array, you can directly get Out next[i+1]=j+1.
  • The specific methods are as follows:

1. Initialize the prefix pointer j=-1, the suffix pointer i=0, next=[-1]*m, where m is the length of the pattern string
2. When the suffix pointer i<the length of the pattern string, execute the following loop:

  • 2.1 When j==-1 or T[i]==T[j]: both pointers i and j move one bit backward, next[i]=j
  • 2.2 Otherwise, j=next[j]

4. Python implementation

def gen_next(T):
    j=-1 #前缀指针
    i=0 #后缀指针
    m=len(T)
    next=[-1]*m
    #i表示p中的第几个元素,j表示当前元素前面子串中最长公共前缀长度后的字符索引
    while i <m-1:
        if j==-1 or T[i]==T[j]:
            j+=1
            i+=1
            next[i]=j
        else:
            j=next[j]
    return next
if __name__=="__main__":
    s="abcdabd"
    b=gen_next(s)
    print(b)
    

Insert picture description here

4. KMP algorithm

  • When a character in the pattern string does not match a character in the text string, which position should the pattern string jump to next. For example, when the character at j in the pattern string does not match the character at i in the text string, the next step is to use the character at next [j] to continue to match the character at i in the text string, which is equivalent to moving the pattern string to the right by j -next[j] bits.
  • The difference between the KMP algorithm and the BF algorithm lies in the backtracking of i and j.

1. Pseudo code

1. The initial subscripts i=0 and j=0 in the main string S and the pattern string T
2. Calculate the next array of the pattern string T
3. Loop until all characters of S or T have not been compared:

  • 3.1 If S[i]==T[j], continue to compare the next character of S and T
  • 3.2 Otherwise, slide j to the right to the next[j] position, that is, j=next[j]
  • 3.3 If j=-1, add one to i and j to prepare for the next comparison

4. If all characters in T are compared, the match is successful and the starting index of the match is returned; otherwise, the match fails and -1 is returned.

2. Python implementation

def gen_next(T):
    j=-1 #前缀指针
    i=0 #后缀指针
    m=len(T)
    next=[-1]*m
    #i表示p中的第几个元素,j表示当前元素前面子串中最长公共前缀长度后的字符索引
    while i <m-1:
        if j==-1 or T[i]==T[j]:
            j+=1
            i+=1
            next[i]=j
        else:
            j=next[j]
    return next
def matching_KMP(S,T):
    """KMP串匹配,主函数"""
    i,j=0,0
    next=gen_next(T)
    n,m=len(S),len(T)
    while i<n and j<m:
        if j==-1 or S[i]==T[j]:
            i,j=i+1,j+1
        else:
            j=next[j]
    if j==m:
        return i-j
    else:
        return -1
if __name__=="__main__":
    s="abcabaabcabac"
    t="baab"
    b=matching_KMP(s,t)
    print(b)           
  • Time complexity: O (n + m) O(n+m)O ( n+m)
  • Time complexity: O (m) O(m)O ( m )

5. Improvement of KMP algorithm

-Nextval array

The KMP algorithm can be improved, the following is an example to illustrate:

  • Main string s= "aaaaabaaaaac"
  • Substring t="aaaaac"
  • In this example, when'b' and'c' do not match,'b' should be compared with the previous'a' of'c', which obviously does not match. The character after the backtracking of the'a' before the'c' is still'a'.
  • It is conceivable that there is no need to compare'b' with'a'
  • Because the characters after the backtracking are the same as the original characters, the original characters do not match, and the characters after the backtracking are naturally impossible to match. But the KMP algorithm still compares'b' with the backtracked'a'. This is where we can improve.
  • Our improved next array is named: nextval array. The improvement of the KMP algorithm can be summarized as follows: If the a-bit character is equal to the b-bit character pointed to by its next value, then the nextval value of the a-bit points to the nextval value of the b-bit, if not equal, the nextval value of the a-bit is The next value of its own a bit. This should be the most simple explanation. For example, the next array and nextval array of the string "ababaaab" are:

Insert picture description here

1. Python implementation of nextval array

def next_val(T):
    j=-1 #前缀指针
    i=0 #后缀指针
    m=len(T)
    next_val=[-1]*m
    #i表示p中的第几个元素,j表示当前元素前面子串中最长公共前缀长度后的字符索引
    while i <m-1:
        if j==-1 or T[i]==T[j]:
            j+=1
            i+=1
            if T[i]!=T[j]:
                next_val[i]=j
            else:
                next_val[i]=next_val[j]
            
        else:
            j=next_val[j]
    return next_val
    
if __name__=="__main__":
    t="ababaaab"
    b=next_val(t)
    print(b)   
# 结果:[-1, 0, -1, 0, -1, 3, 1, 0]

2. Implementation of KMP improved algorithm

  • Just replace the previous next array with next_val
def next_val(T):
    j=-1 #前缀指针
    i=0 #后缀指针
    m=len(T)
    next_val=[-1]*m
    #i表示p中的第几个元素,j表示当前元素前面子串中最长公共前缀长度后的字符索引
    while i <m-1:
        if j==-1 or T[i]==T[j]:
            j+=1
            i+=1
            if T[i]!=T[j]:
                next_val[i]=j
            else:
                next_val[i]=next_val[j]
            
        else:
            j=next_val[j]
    return next_val
def matching_KMP(S,T):
    """KMP串匹配,主函数"""
    i,j=0,0
    next=next_val(T)
    n,m=len(S),len(T)
    while i<n and j<m:
        if j==-1 or S[i]==T[j]:
            i,j=i+1,j+1
        else:
            j=next[j]
    if j==m:
        return i-j
    else:
        return -1
if __name__=="__main__":
    s="abcabaabcabac"
    t="baab"
    b=matching_KMP(s,t)
    print(b)           
    

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112444013
Recommended