String pattern matching algorithm (thought + python implementation)

Naive pattern matching algorithm

one example:

From the main string S = "goodjob", find the position of the substring T = "job". We usually need the following steps.

1. 将子串"job"与主串S从第一位对应匹配。即判断 whether 'g' == 'j';
2. 判断结果为否，继续匹配 whether 'o' == 'j';
3. 直到找到第一位匹配 'j' = 'j';
4. 继续匹配完整 S[4:7] == 'job' == T。

It is the simplest matching algorithm, it is easy to implement brute force, but there are many redundant steps.

KMP pattern matching algorithm

The general idea is to circumvent the steps that can be omitted in the naive pattern matching through the existing information of the string.

The detailed process will not be repeated, and only a brief description will be given here.

The subscript of the known main string is $i$ , the subscript of the substring is $j$ , to sum it up are two short sentences: " $i$ does not backtrack, $j$ 依 $n e x t$ ”。

$i$ No backtracking: It means that during the matching process, the matching subscript of the main string will only increase or remain unchanged, and will not match the subscript of the previous main string.

$j$ 依 $n e x t$ : Refers to the subscript matching rule of the substring in accordance with $n e x t$ array value.

Then we will explain in detail What is the value of the $n$ $e$ $x$ $t$ array.

derivation of $n$ $e$ $x$ $t$ array values

First give the definition of the mathematical formula:
$\begin{cases} 0, & \text{when j = 1} \\[2ex] Max\{ k | 1<k<j, and'p_1 ...p_{k-1}' ='p_{j-k+1}...p_{j-1}'\}& \text when this set is not empty \\[2ex] 1, & \text{other cases} \end{cases}$
eg T = "abcabx"
$\begin{array}{c|lcr} j & 123456 \\ \hline pattern string T & abcabx \\ next[j] & 011123 \end{array }$
One of the substrings "abcab", ab c ab , the prefix and suffix have two characters equal, then $n e x t [6] = 2 + 1 = 3$ 。

The time complexity of the entire algorithm is O(n + m), which is better than the O((n-m + 1) * m) of the naive pattern matching algorithm (the length of T is m).

KMP pattern matching algorithm improvement

An example: the main string S = "aaaabcde", the substring T = "aaaaax", the next array values are 012345, at the beginning, when i = 5, j = 5, we find "b" and "a" I don’t want to wait, therefore, j = next[5] = 4, at this time, "b" and "a" in the fourth position are still not equal, j = next[4] = 3… This will also have a lot of redundant steps . Therefore, we have improved the solution of the next function.

Add a judgment every time next is assigned, judge $whether\ T[i] == T[j]$ 。

Finally, a specific python program for solving the next array is given:

import numpy as np

def get_nextval(T):
    T = '#' + T
    i = 1
    j = 0
    nextval = np.zeros(len(T), dtype=int)
    
    while i < len(T)-1:
        if j == 0 or T[i] == T[j]:
            i = i + 1
            j = j + 1
            if T[i] != T[j]:
                nextval[i] = j
            else:
                nextval[i] = nextval[j]
        else: 
            j = nextval[j]
    nextval = np.delete(nextval,0)
    return nextval

$- - - - - - - - - - - - - - - -$
This article first appeared inzyairelu.cn
Welcome to my website to comment and discuss
personal mailbox [email protected]
$- - - - - - - - - - - - - - - -$