String pattern matching algorithm (thought + python implementation)

Naive pattern matching algorithm

one example:

From the main string S = "goodjob", find the position of the substring T = "job". We usually need the following steps.

1. 将子串"job"与主串S从第一位对应匹配。即判断 whether 'g' == 'j';
2. 判断结果为否,继续匹配 whether 'o' == 'j';
3. 直到找到第一位匹配 'j' = 'j';
4. 继续匹配完整 S[4:7] == 'job' == T。

It is the simplest matching algorithm, it is easy to implement brute force, but there are many redundant steps.

KMP pattern matching algorithm

The general idea is to circumvent the steps that can be omitted in the naive pattern matching through the existing information of the string.

The detailed process will not be repeated, and only a brief description will be given here.

The subscript of the known main string is iii , the subscript of the substring isjjj , to sum it up are two short sentences: "iii does not backtrack,jjj n e x t next next”。

i i i No backtracking: It means that during the matching process, the matching subscript of the main string will only increase or remain unchanged, and will not match the subscript of the previous main string.

not a word j n e x t next n e x t : Refers to the subscript matching rule of the substring in accordance withnext nextn e x t array value.

Then we will explain in detail next nextWhat is the value of the n e x t array.

n e x t next derivation of n e x t array values

First give the definition of the mathematical formula:
next [j] = {0, when j = 1, Max {k ∣ 1 <k <j, and 'p 1... Pk − 1 ′ = ′ pj − k + 1 ... pj − 1 ′} 1 when this set is not empty, otherwise next[j] = \begin{cases} 0, & \text{when j = 1} \\[2ex] Max\{ k | 1<k<j, and'p_1 ...p_{k-1}' ='p_{j-k+1}...p_{j-1}'\}& \text when this set is not empty \\[2ex] 1, & \text{other cases} \end{cases}next[j]=0,Max{ k1<k<j , andp1...pk1=pjk+1...pj1}1When  j = 1 whenWhen this set together does not empty whenOther cases
eg T = "abcabx"
j 123456 pattern string T abcabxnext [j] 011123 \begin{array}{c|lcr} j & 123456 \\ \hline pattern string T & abcabx \\ next[j] & 011123 \end{array }jMode type string Tnext[j]123456abcabx011123
One of the substrings "abcab", ab c ab , the prefix and suffix have two characters equal, then next [6] = 2 + 1 = 3 next[6] = 2 + 1 = 3next[6]=2+1=3

The time complexity of the entire algorithm is O(n + m), which is better than the O((n-m + 1) * m) of the naive pattern matching algorithm (the length of T is m).

KMP pattern matching algorithm improvement

An example: the main string S = "aaaabcde", the substring T = "aaaaax", the next array values ​​are 012345, at the beginning, when i = 5, j = 5, we find "b" and "a" I don’t want to wait, therefore, j = next[5] = 4, at this time, "b" and "a" in the fourth position are still not equal, j = next[4] = 3… This will also have a lot of redundant steps . Therefore, we have improved the solution of the next function.

Add a judgment every time next is assigned, judge whether T [i] = = T [j] whether\ T[i] == T[j]whether T[i]==T[j]

Finally, a specific python program for solving the next array is given:

import numpy as np

def get_nextval(T):
    T = '#' + T
    i = 1
    j = 0
    nextval = np.zeros(len(T), dtype=int)
    
    while i < len(T)-1:
        if j == 0 or T[i] == T[j]:
            i = i + 1
            j = j + 1
            if T[i] != T[j]:
                nextval[i] = j
            else:
                nextval[i] = nextval[j]
        else: 
            j = nextval[j]
    nextval = np.delete(nextval,0)
    return nextval

− − − − − − − − − − − − − − − − ---------------- - -
This article first appeared inzyairelu.cn
Welcome to my website to comment and discuss
personal mailbox [email protected]
- - - - - - - - - - - - - - - - -------- --------

Guess you like

Origin blog.csdn.net/weixin_42731543/article/details/103587078