Understanding and Implementation of KMP Algorithm

 
 

1. The principle of the kmp algorithm:

  This part of the content is transferred from: https://www.cnblogs.com/c-cloud/p/3224788.html

  

String matching is one of the fundamental tasks of computers.

For example, there is a string "BBC ABCDAB ABCDABCDABDE", I want to know, does it contain another string "ABCDABD"?

Many algorithms can accomplish this task, and the Knuth-Morris-Pratt algorithm (KMP for short) is one of the most commonly used. It is named after three inventors, the first K is the famous scientist Donald Knuth.

This algorithm is not easy to understand, there are many explanations online , but it is very laborious to read. I didn't really understand this algorithm until I read Jake Boxer 's article. Below, I use my own language to try to write an easy-to-understand explanation of the KMP algorithm.

1.

First, the first character of the string "BBC ABCDAB ABCDABCDABDE" is compared with the first character of the search term "ABCDABD". Because B does not match A, the search term is shifted back one place.

2.

Because B does not match A, the search term is moved further back.

3.

And so on until the string has one character that is the same as the first character of the search term.

4.

Then compare the string with the next character of the search term, again the same.

5.

Until the string has one character that is not identical to the character corresponding to the search term.

6.

At this time, the most natural reaction is to move the entire search term back one place, and then compare them one by one from the beginning. This works, but it's very inefficient, because you have to move the "search position" to the position that has already been compared, and repeat the comparison.

7.

一个基本事实是,当空格与D不匹配时,你其实知道前面六个字符是"ABCDAB"。KMP算法的想法是,设法利用这个已知信息,不要把"搜索位置"移回已经比较过的位置,继续把它向后移,这样就提高了效率。

8.

怎么做到这一点呢?可以针对搜索词,算出一张《部分匹配表》(Partial Match Table)。这张表是如何产生的,后面再介绍,这里只要会用就可以了。

9.

已知空格与D不匹配时,前面六个字符"ABCDAB"是匹配的。查表可知,最后一个匹配字符B对应的"部分匹配值"为2,因此按照下面的公式算出向后移动的位数:

  移动位数 = 已匹配的字符数 - 对应的部分匹配值

因为 6 - 2 等于4,所以将搜索词向后移动4位。

10.

因为空格与C不匹配,搜索词还要继续往后移。这时,已匹配的字符数为2("AB"),对应的"部分匹配值"为0。所以,移动位数 = 2 - 0,结果为 2,于是将搜索词向后移2位。

11.

因为空格与A不匹配,继续后移一位。

12.

逐位比较,直到发现C与D不匹配。于是,移动位数 = 6 - 2,继续将搜索词向后移动4位。

13.

逐位比较,直到搜索词的最后一位,发现完全匹配,于是搜索完成。如果还要继续搜索(即找出全部匹配),移动位数 = 7 - 0,再将搜索词向后移动7位,这里就不再重复了。

14.

下面介绍《部分匹配表》是如何产生的。

首先,要了解两个概念:"前缀"和"后缀"。 "前缀"指除了最后一个字符以外,一个字符串的全部头部组合;"后缀"指除了第一个字符以外,一个字符串的全部尾部组合。

15.

"部分匹配值"就是"前缀"和"后缀"的最长的共有元素的长度。以"ABCDABD"为例,

  - "A"的前缀和后缀都为空集,共有元素的长度为0;

  - "AB"的前缀为[A],后缀为[B],共有元素的长度为0;

  - "ABC"的前缀为[A, AB],后缀为[BC, C],共有元素的长度0;

  - "ABCD"的前缀为[A, AB, ABC],后缀为[BCD, CD, D],共有元素的长度为0;

  - "ABCDA"的前缀为[A, AB, ABC, ABCD],后缀为[BCDA, CDA, DA, A],共有元素为"A",长度为1;

  - "ABCDAB"的前缀为[A, AB, ABC, ABCD, ABCDA],后缀为[BCDAB, CDAB, DAB, AB, B],共有元素为"AB",长度为2;

  - "ABCDABD"的前缀为[A, AB, ABC, ABCD, ABCDA, ABCDAB],后缀为[BCDABD, CDABD, DABD, ABD, BD, D],共有元素的长度为0。

16.

"部分匹配"的实质是,有时候,字符串头部和尾部会有重复。比如,"ABCDAB"之中有两个"AB",那么它的"部分匹配值"就是2("AB"的长度)。搜索词移动的时候,第一个"AB"向后移动4位(字符串长度-部分匹配值),就可以来到第二个"AB"的位置。

C语言代码实现:
#include 
#include 
#include 

using namespace std;
char a[100];
char b[100];
int f[255];
void getFail(char* P,int* f){
    int m=strlen(P);
    f[0]=0;
    f[1]=0;
    for(int i=1;i<m;i++){
        int j=f[i];
        while(j&&P[i]!=P[j]) j=f[j];
        f[i+1]=P[i]==P[j]?j+1:0;
    }
}
int find (char* T,char* P,int* f){
    int n=strlen(T),m=strlen(P);
    getFail(P,f);
    int j=0;
    for(int i=0;i>a>>b;
    cout<<find(a,b,f);
}


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325893537&siteId=291194637