Two pattern string matching method (BF / KMP algorithm)

Foreword

String, also known as the string, it is limited by a sequence of zero or more characters consisting of string, and sequentially stores the same chain stores two ways can be employed to store, locate positioning substring problem (in the main stream mode match) is one of a string of the most important operations, and different algorithms with different efficiencies, we are learning to compare two string pattern matching method today:

Simple pattern matching algorithm (Brute-Force algorithm, referred to as BF algorithm)
KMP pattern matching algorithm

Simple pattern matching algorithm (BF algorithm)

BF algorithm is a conventional pattern matching algorithm, its idea is:

The first round : a first sub-string is compared with the main character in the first character string
- If equal , then the comparison continues with the main string of the second character substring
- If equal , comparing a second round
The second round : the first character string is compared with the second main character sub-string ......
The first round N : compared in turn continues until all matches

Illustration:

first round:

second round:

...... same principle, an intermediate step is omitted

Fifth round:

Sixth round:

Code:

After reading the text and illustrations to explain, we have to achieve such a hands-algorithm

Briefly summarized above steps is:

Each main character strings match with the beginning of the substring matching is more successful under sub-string of a string with the main match, the match fails then the next string comparison with the main sub-strings, it is clear that we can use two pointers point to the main and sub-string of a character string, to implement such an algorithm

Matches, return to the main position of the substring in the first occurrence of the string, the match fails -1, empty string is returned substring 0

int String::bfFind(const String &s, int pos) const {
    //主串和子串的指针，i主串，j子串
    int i, j;
    //主串比子串小，匹配失败，curLenght为串的长度
    if (curLength < s.curLenght)
        return -1;
    
    while (i < curLength && j < s.curLength) {
        //对应字符相等，指针后移
        if (data[i] == s.data[j])
            i+, j++;
        else {  //对应字符不相等
            i = i -j + 1;   //主串指针移动
            j = 0; //子串从头开始
        }
        //返回子串在主串的位置
        if (j >= s.curLength) 
            return (i - s.curLength);
        else return -1;    
    }
}

Note: Code algorithm only reflect ideas, a specific definition is not given

This algorithm is simple and easy to understand, but there is a big drawback, and that is the need to go back several times, inefficient, if the main string is 000000000001 substring is 00001, which means that every one should compare the substring the last character before the match fails, there is no better way to do it? The following KMP pattern matching algorithm is a good solution to this problem

KMP pattern matching algorithm

如果仅仅进行一些少量数据的运算，可能你甚至觉得BF算法也还行，起码是很容易写出来的，毕竟能跑的就是好程序，但是一旦数据量增大，你就会发现有一些 “无用功” 真的会大大的拖慢你的速度

KMP模式配算法是由 D.E.Knuth，J.H.Morris，V.R.Pratt 三位前辈提出的，它是一种对朴素模式匹配算法的改进，核心就是利用匹配失败后的信息，尽量减少子主串的匹配次数，其体现就是主串指针一直往后移动，子串指针回溯

图示说明：

下面所表示的是朴素模式匹配算法的过程，我们看看如果使用KMP算法的思想，哪些步骤是可以省略掉的

① 中前五个元素，均互相匹配，知道第六个元素才匹配失败，按照BF算法来说，就直接进行 ② ③ 操作，但是，我们可以发现，子串中的前三个元素 a b c 均不是相同的，但是在 ① 中已经与主串相匹配，所以子串分别与主串中的第二第三个元素匹配一定是不匹配的，所以图中的 ② ③ 均可以省略

在 ① 中子串中的第一第二个元素 ab 和第四第五个元素 ab 是相同的，且第四第五个元素 ab 已经与主串中的第四第五个元素匹配成功，这意味着，子串中第一第二个元素 ab 一定与主串中第四第五个元素相匹配，所以 ④ ⑤ 步骤可以省略

如果按照这种思路，上面的例子只需要执行 ① 和 ⑥ 就可以了

next 数组值推导

(一) 主串指针是否需要回溯

我们观察上面的两种过程，BF算法-①②③④⑤⑥，KMP算法-①⑥，如果我们现在假定有两个指针，i 和 j，分别指向主串和子串中的所处位置，从上图我们可以知道，主串指针，也就是 i 的值在 ① 的状态下，指针指向6的位置，而在 ②③④⑤ 中却分别指向了2345，而在 ⑥ 中仍指向6的位置

这说明，朴素模式匹配算法，主串的 i 值会不断的进行回溯，但是 KMP模式匹配算法将这种没必要的回溯省略掉了，所以减少了执行次数

(二) 子串指针优化总结

既然主串指针不进行回溯，那么还可以优化的就是子串指针了，一般会遇到两种情况我们举两个例子：

如果子串为 abcdef，主串为abcdexabcdef，当第一轮匹配到第六个字符f和x的时候，匹配失败了，这个时候如果按照朴素模式匹配，就需要拿子串的首元素a去分别和主串的bcde进行比较，但是由于子串f元素前的元素中没有相同的元素，并且与主串匹配，所以a与主串中的2-5号元素即 bcde 都是不可能相匹配的，所有这几部都可以省略，直接让a和主串中的x去匹配
如果子串为abcabx，主串为abcababcax，在第一轮中，前五个元素子主串分别相匹配，第六个元素位置出错，按照朴素模式匹配，我们需要拿子串首元素a，依次与主串中的a后面的元素匹配，但是子串前面三个字符abc是不相等的，按照我们第一种情况的经验，就直接跳过这些步骤了，所有我们直接拿子串a与主串第四个元素a进行比较就可以了，但是我们发现，子串中出错的位置x前的串 abcab 的前缀和后缀都是 ab，既然第一轮的时候，已经匹配成功，那就意味着，子串中的第一第二个元素ab一定与主串中第四第五个元素 ab相等，所以这个步骤也可以省略，也就直接可以拿子串前缀ab后面的c开始于a进行比对，这也就是我们上面图中例子的详细思路

总结：所以我们得出规律，子串指针的值取决于，子串前后缀元素的相似程度

想要应用到具体代码中，我们可以把子串位置变化的 j 值定义成一个next数组，且长度与子串长度相同

$$next[j]=
\begin{cases}
-1 && 当j = 0\
max & {k|0<k<j 且 "T_0 T_1...T_k-_1" = "T_j-_k T_j-_k+_1 ...T_j-_1"} & 当集合不为空时\
0 &&其他情况 \
\end{cases}$$

情况1：当 j = 0 时，next[j] = -1, 表示子串指针指向下标为0的元素的时候匹配失败，子串无法回溯，(j不能赋值-1) ，此时将主串指针后移一位，子串不，进行下一轮比较
情况2：在已经匹配的子串中，存在相同的前缀串 T₀ T₁ ... T_k-1 和后缀串 T_j-k T_j-k+1 ... T_j-1，子串指针则回溯到next[j] = k的位置，然后进行下一趟比较，例如：子串 abcabc 有相同前缀和后缀ab 所以子串指针回溯到 c的位置
情况3：在已经匹配的子串，若不存在相等的前缀和后缀，则主串指针不动，子串指针回溯到 j = 0 的位置，然后进行下一趟比较

例：主串 S = “abc520abc520abcd”, 子串 T = "abc520abcd" ，利用 KMP算法匹配过程

子串 next 数组

j	0 1 2 3 4 5 6 7 8 9
子串	a b c 5 2 0 a b c d
next[j]	-1 0 0 0 0 0 0 1 2 3

可以看到，在指针 i = 9 且 j = 9 的时候，匹配失败，此时 next[9] = 3 ，所以子串指针回溯到下标 j = 3 的位置也就是元素 5 的位置，进行第二轮比较，然后正好全部匹配成功

(三) 求next数组算法实现

void Stirng::getNext(const String &t, int *next) {
    int i = 0, j = -1;
    next[0] = -1;
    while (i < t.curLength - 1) {
        if ((j == -1) || t[i] == t[j]) {
            ++i, ++j;
            next[i] = j;
        }else{
            j = next[j];
        }
    }
}

KMP算法代码实现

有了 next 数组的铺垫，我们就可以来实现KMP算法了

匹配成功返回子串在主串中第一次出现的位置，失败返回-1，子串为空串返回0

int String::kmpFind(const String &t, int pos) {
    //不允许申请大小为0的数组
    if (t,curLength == 0) return 0;
    //如果主串比子串小，匹配失败
    if(t.curLength < t.curLength) return -1;
    //主串指针i，子串指针j
    int i = 0, j = 0;
    int *next = new int[t.curLrngth];
    getNext(t,next);
    while (i < curLength && j < t,curLength) {
        if (j == -1 || data[i] == t.data[j])    //情况12
            i++, j++;
        else    //情况3
            j = next[j];
    }
    delete []next;
    if (j > t.curLength)
        return (i - t.curLength)
    else
        return -1;
}

KMP模式匹配算法改进

有一种特殊情况的出现，使得我们不得不考虑KMP算法的改进

那就是子串中有多个连续重复的元素，例如主串 S=“aaabcde” 子串T=“aaaaax” 在主串指针不动，移动子串指针比较这些值，其实有很多无用功，因为子串中前5个元素都是相同的a，所以我们可以省略掉这些重复的步骤

void String::getNextVal(const String &t, int *nextVal) {
    int i = 0, j = -1;
    nextVal[0] = -1;
    while (i < t.curLength -1) {
        if ((k == -1) || (t[i] == t[j])) {
            ++i, ++j;
            if (t[i] != t[j])
                nextVal[i] = j;
            else
                nextVal[i] = nextVal[j];
        }
        else 
            j = nextVal[j];
    }
}

这种改进的核心就在于增加了对子串中 t[i] 和 t[j] 是否相等的判断，相等则直接将 nextVal[j] 的值赋给 nextVal[i]

总结

在BF算法中，当主串和子串不匹配的时候，主串和子串你的指针都需要回溯，所以导致了该算法时间复杂度比较高为 O(nm) ，空间复杂度为 O(1) 注：虽然其时间复杂度为 O(nm) 但是在一般应用下执行，其执行时间近似 O(n+m) 所以仍被使用

KMP算法，利用子串的结构相似性，设计next数组，在此之上达到了主串不回溯的效果，大大减少了比较次数，但是相对应的却牺牲了存储空间，KMP算法时间复杂度为 O(n+m) 空间复杂度为 O(n)

结尾：

如果文章中有什么不足，或者错误的地方，欢迎大家留言分享想法，感谢朋友们的支持！

如果能帮到你的话，那就来关注我吧！如果您更喜欢微信文章的阅读方式，可以关注我的公众号

在这里的我们素不相识，却都在为了自己的梦而努力 ❤

一个坚持推送原创开发技术文章的公众号：理想二旬不止