Python implements KMP algorithm [Regularly update the understanding of KMP algorithm]

KMP algorithm introduction

The KMP algorithm is an improved string matching algorithm proposed by DEKnuth, JHMorris and VRPratt, so people call it Knuth-Morris-Pratt operation (referred to as KMP algorithm). The core of the KMP algorithm is to use the information after the matching failure to minimize the matching times between the pattern string and the main string to achieve the purpose of fast matching. The specific implementation is through a next() function, which itself contains the local matching information of the pattern string. The time complexity of the KMP algorithm is O(m+n).

 

Violent solution

Target string: ABACABAB

Pattern string: ABAB  

Use the form of a table to explain the idea of ​​the violent solution. Each time one character of the target string is compared with the character of the pattern string. As long as they are equal, the next one is compared.

If they are not equal, start from the next position of the target string.

 

Target string A B A C A B A B  
the first time A B A B        

×

the second time   A B A B      

×

the third time     A B A B    

×

the fourth time       A B A B  

×

the fifth time         A A B

 

The code is as follows, the easiest idea for people to understand. But the space complexity is O(m*n).

def Brute_Force(target_str, model_str):
    """
    :param target_str:目标串 
    :param model_str: 模式串
    :return: 首次匹配位置
    """

    target_len = len(target_str)
    model_len = len(model_str)

    # 循环次数为目标串长度-模式串长度
    for i in range(target_len - model_len + 1):
        # 变量K的作用,遍历模式串 
        k = 0
        for j in range(i, i + model_len):
            # print("i", i, "j", j, "target_str", target_str[j], "model_str", model_str[k])
            # 不匹配
            if target_str[j] != model_str[k]:
                break
            else:
                k += 1
        else:
            return i

Change the version, match, reduce the number of cycles, and only use one cycle.

def Brute_Force_upgrade(target_str, model_str):
    target_len = len(target_str)
    model_len = len(model_str)
    # 循环用上面代码中循环一样 
    for i in range(target_len - model_len + 1):
        # 比较字符时候,取同模式串长度一样的目标串
        # 满足输出首次匹配位置
        if target_str[i:i + model_len] == model_str:
            return i

 

The general solution method is easy to understand, but the number of cycle comparisons is relatively large, and each time is from the next position of the target string cycle, starting from a fresh start. This will waste a lot of cycle comparisons.

Target string: ABABBBAAABABABBA

Pattern string: ABABABB

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  
Main string A B A B B B A A A B A B A B B A T/F
1 A B A B A                       ×
2   A                             ×
3     A B A                       ×
4           A                     ×
5             A B                 ×
6                 A B A B A B B  

 

Analyze this target string and pattern string:

First Comparison: Compare to the first Wu positions, error

Second comparison: Compared to the second dicarboxylic position, an error

Third Comparison: Compare to the second Wu positions, error

......

Sixth comparison: successful match

 

Carefully compare the first four elements of the fifth element of the target string, ABAB, which have overlapping parts. And they all fail at the fifth position of the target string.

Therefore, the previous comparisons are invalid comparisons.

Therefore, we want to optimize the number of comparisons.

Introduce the next array .

 

Next, explain the KMP algorithm

The focus of the KMP algorithm is to solve the next array.

Manual solution of next array

The next array is an array searched on the pattern string, and the next array of the pattern string ABABABB is given first. Next, the solution process is described in detail.

Subscript 1 2 3 4 5 6 7
Pattern string A B A B A B B
next[t] 0 1 1 2 3 4 5

 

The next array is introduced to solve the problem of moving the target string pointer.

Next array, find the repeated substring at the current position of the pattern string.

Without going into too much explanation, why should we introduce this next array, in the simplest way, let you learn how to solve the next array.

 

Comparison of the left and right parts substring longest overlapping part

(The left part is marked as: L [ Ignore the last position of the substring ], and the right part is marked as: R [ Ignore the first position of the substring ])

 

When the subscript=1, initialize the next array, next[0]=0.

When the subscript=2, the corresponding substring is A, L=null, R=null, so there is no repeated string content (the repetition does not belong to it), so the length L=0, next[2]=L+1=1 .

When the subscript=3, the corresponding substring is not AB, L=A, R=B, so there is no repeated string content, so the length L=0, next corresponds to next[3]=L+1=1.

When the subscript=4, the corresponding substring is not ABA, L=AB, R=BA, and the repeated substring is A at both ends, so the length L=1, next corresponds to next[3]=L+1=2.

When the subscript=5, the corresponding substring is not ABAB, L=ABA, R=BAB, and the repeated substring is AB at both ends, so the length L=2, next corresponds to next[3]=L+1=3.

When the subscript=6, the corresponding substring is not ABABA, L=ABAB, R=BABA, and the repeated substring is ABA at both ends, so the length L=3, next corresponds to next[3]=L+1=4.

When the subscript=7, the corresponding substring is not ABABAB, L=ABABA, R=BABAB, and the repeated substring is ABAB at both ends (the repetition does not meet the requirements), so the length L=4, next corresponds to next[3]=L +1=5.

 

Here is an example to deepen understanding. The solution is the same as above.

Pattern string: ABABBBAA

Subscript i 1 2 3 4 5 6 7 8
Pattern string A B A B B B A A
next[t] 0 1 1 2 3 1 1 2

 

Next array code solution

According to the idea of ​​manual solution, compare the current subscript element with t each time

def getNext(substrT):
    next_list = [0 for i in range(len(substrT))]
    # 初始化下标,j指向模式串
    # t记录位置,便于next数组赋值
    j = 1
    t = 0
    while j < len(substrT):
        # 第一次比较 t等于0 直接进行赋值,每次长度自增1
        # 之后进行的比较  判断字符是否相等, python数组下标从0开始,因为均-1
        if t == 0 or substrT[j - 1] == substrT[t - 1]:
            # 长度+1
            next_list[j] = t + 1
            j += 1
            t += 1
        else:
            # 此时的-1 同上边的-1操作是一样
            t = next_list[t - 1]
    return next_list

 

The best way to understand the code is to execute this process manually.

 

The core of the kmp algorithm, the next array has been solved, as long as we optimize the brute force solution we wrote before to reduce the number of comparisons.

def KMP(target_str, model_str):
    next = getNext(model_str)
    # 主串计数
    i = 0
    # 模式串计数
    j = 0
    while (i < len(target_str) and j < len(model_str)):
        if (target_str[i] == model_str[j]):
            i += 1
            j += 1
        elif j != 0:
            # 调用next数组
            j = next[j - 1]
        else:
            # 不匹配主串指针后移
            i += 1
    if j == len(model_str):
        return i - len(model_str)
    else:
        return -1

 

The key is to solve the next array. Manually follow the code flow and write several next arrays, which will definitely help deepen the understanding of the KMP algorithm.

 

There are still some flaws when solving the next data, waiting for subsequent updates.

Pattern string: AAAAAAB

  1 2 3 4 5 6 7
Pattern string A A A A A A B

next[j]

0 1 2 3 4 5 6

 

 

Attach the video link of KMP algorithm:

Baidu cloud link: https://pan.baidu.com/s/1VhmuuORruu-OSwEfkulX_A 
Extraction code: nj6f

Station B link: A total of five video links, only the first video link is placed.

https://www.bilibili.com/video/BV1Vf4y1X7Zc/

 

 

 

Guess you like

Origin blog.csdn.net/qq_41292236/article/details/108221990