Detailed explanation of data structure kmp algorithm with thousands of words and pictures

The blogger aims to use popular language and pictures to help everyone understand the KMP algorithm. If you have any questions, please contact the blogger at any time! Let’s study together next

Preface

  • Demonstration:
    Given a string s = "abcabcabd" and another string t = "abcabd", search whether t exists in s. If it exists, return the subscript of the first character corresponding to t[0] in s. For example, this example should Return 3.
  • Implemented using naive algorithm
  • Insert image description here
    This algorithm is also called the BF algorithm. When our search fails, we search again by backtracking. The time complexity is O (nm). But it is not difficult to find that the elements in the string s have been compared repeatedly many times. Through step 1, we can know that s[2] is b, s[3] is c, s[4] is a, and t string string The first element is a, so the t string can be moved directly to step 4. Steps 2 and 3 are completely redundant and can be omitted, so our today’s theme evolved—KMP algorithm
    The KMP algorithm is the most difficult and important algorithm in our data structure string . The difficulty is because the code of the KMP algorithm is very beautiful, concise and capable, but it contains very deep thinking. People who really understand the code can say that their understanding of the KMP algorithm is quite deep. Moreover, many things about this algorithm are indeed not easy to understand. Many books directly dissuade countless people once the concepts are laid out. The blogger will try to introduce the KMP algorithm and its improvements in the most detailed way with pictures . At the beginning of the article, I paid tribute to the three founders of the KMP algorithm. After understanding the process of this algorithm, I really have to admire their intelligence.
    Insert image description here

Illustration kmp

Case where the last character does not match
Insert image description here

At this time, our main string does not need to be traced back. At this time, we only need to look forward to the '? 'Is it equal to G? If not, we only need i++, j still points to 1, and then determine whether j=1 is the same as i++ to continue the next search.
Case where the penultimate character does not match:

Insert image description here
However, we know that the fifth element cannot be the first element of the string we are looking for. The same is true for 6 and 7. When it reaches eight, there is another situation where the search may be successful, as follows:
Insert image description here

Other cases

Insert image description here
Insert image description here

The same first and second position mismatch is similar to the above situation . We list the return value in a table to get:

If the current two characters match, then i++, j++.
If a mismatch occurs when j=1, then i++ should be set, and j should still be 1.
If a mismatch occurs when j=2, j should be returned to 1.
If j=3 If a mismatch occurs when
j=4, j should be returned to 1. If a mismatch occurs when j=4, j should be returned to 1.
If a mismatch occurs when j=5, j should be returned to 2.
If a mismatch occurs when j=6 If there is no match, j should be returned to 1

We use the next array to save it** (consider that the array subscript starts from 0)**: int next[6]={-1,0,0,0,1,0};
Let's look at the specific implementation process:

int Index_KMP(SqString S, SqString T, int next[])
{
    
    
	int i = 0, j = 0;
	while (i < S.length && j < T.length)
	{
    
    
		if (j == -1 || S.data[i] == T.data[j])
		{
    
    
			i++;
			j++;
		}
		else
			j = next[j];

	}
	if (j >= T.length)
		return (i - T.length);
	else
		return -1;
}

When we fail to match, just let j=next[j];. Of course, the above next array is only implemented for the specific string we mentioned above, and the next array is not static. Cheer up and let's continue to look at the following content:

next array implementation

At this time, the question arises. How should we implement the function of searching for substrings in the main string? After all, we only analyzed the search substring "GOOGLE" above. What should we do if we change the substring? How to get the next array that belongs to him? Next we analyze how to obtain the next array .
First let's look at the situation of abcabd:
Insert image description here

**At this time we let you next[6]=3;
watch:We can see that matching can be continued by moving two backwards. However, as shown in the figure, if we jump four backwards, we can still match. Can we jump four backwards?
Insert image description here
We can see that in the third element we found the result we need. If we move four elements back, not to mention whether it can be found, finding it is not the most accurate result we need, so do we Should we first look for the largest matched element, or what can be said to be the first to find it? We have the following concepts:

The prefix of a string : a substring that contains the first character of the string and does not include the last character; the suffix
of a string : a substring that contains the last character of the string but does not include the first character.

For example:

字符串 abcdab
前缀的集合:{
    
    a,ab,abc,abcd,abcda}
后缀的集合:{
    
    b,ab,dab,cdab,bcdab}

When the j-th character fails to match, the first 1~j-1 elements are recorded as S, where next[j]=S最长相等前后缀+1; then it becomes very easy to find the next array, and we can also consider using code to implement it. We summarize the next formula as follows:
Insert image description here
The long bar below represents the substring, the red part represents the longest equal suffix on the current match, and the blue part represents t.data[j].
Insert image description here
Insert image description here

Based on these contents, we make the following code:

typedef struct
{
    
     
 char data[MaxSize];
 int length;   //串长
} SqString;
//typedef重命名结构体变量,用SqString T定义一个结构体。
void GetNext(SqString T, int next[])  
{
    
    
	int j, k;
	j = 0; k = -1;
	next[0] = -1;//第一个字符前无字符串,给值-1
	while (j < T.length - 1)
		//因为next数组中j最大为T.length-1,而每一步next数组赋值都是在j++之后
		//所以最后一次经过while循环时j为T.length-2
	{
    
    
		if (k == -1 || T.data[j] == T.data[k])  //k为-1或比较的字符相等时
		{
    
    
			j++; k++;
			next[j] = k;
			//对应字符匹配情况下,s与t指向同步后移
			//通过字符串"aaaaab"求next数组过程想一下这一步的意义
		}
		else
		{
    
    
			k = next[k];
			//我们现在知道next[k]的值代表的是该处字符不匹配时应该回溯到的字符的下标
			//这个值给k后又进行while循环判断,此时t.data[k]即指最长相等前缀后一个字符
		}
	}
}

Time complexity of kmp algorithm

Now let's analyze the time complexity of the KMP algorithm : There is an additional process of finding an array in the KMP algorithm, which consumes more space. We know that KMP is an algorithm that exchanges space for time. We set the length of the main string S to be n and the length of the substring T to be m. The time complexity when finding the next array is O(m). Since the main string does not go back in subsequent matches, the number of comparisons can be recorded as n, soThe total time complexity of the KMP algorithm is O(m+n), and the space complexity is O(m). Compared with the simple pattern matching time (BF algorithm) complexity O(m*n), the KMP algorithm speed-up is very large. This little space consumption in exchange for extremely high time speed-up is very meaningful. This idea It's also very important.

kmp code implementation

int Index_KMP(SqString S, String T)  
{
    
    

	int next[MAXSIZE], i = 0, j = 0;
	GetNext(T, next);
	while (i < S.length && j < T.length)
	{
    
    
		if (j == -1 || S.data[i] == T.data[j])
		{
    
    
			i++; j++;    
		}
		else 
		    j = next[j];   //i不变,j后退到next标记的下标处。
	}
	if (j >= T.length)
		return(i - T.length);   //返回匹配模式串的首字符下标
	else
		return(-1);          //返回不匹配标志,表示查找失败
}

However, our KMP can still be optimized, so why does it need to be optimized when it is already so powerful?

nextval array implementation

In the main string s=“aaabaaaaac”
substring t=“aaac”
example, when 'b' and 'c' do not match, we know that there is no need to compare 'b' and 'a', because the traced characters are the same as the original characters, and the original characters are not the same. Matching, the characters after backtracking are naturally impossible to match. However, the KMP algorithm will still compare 'b' with the backtracked 'a'. This is where we can improve.
So what should we do?

In fact, we can briefly describe it as: If the a-bit character is equal to the b-bit character pointed to by its next value, then the nextval value of the a-bit character points to the nextval value of the b-bit character. If they are not equal, the nextval value of the a-bit character is itself. The next value of bit a.

This should be considered a relatively simple explanation!
We modify the code for finding the next array to:

void GetNextval(SqString T, int nextval[])

{
    
    
	int j = 0, k = -1;
	nextval[0] = -1;
	while (j < T.length)
	{
    
    
		if (k == -1 || T.data[j] == T.data[k])
		{
    
    
			j++; k++;
			if (T.data[j] != T.data[k])
				nextval[j] = k;
			else
				nextval[j] = nextval[k];
			//此时nextval[j]的值就是就是T.data[j]不匹配时应该回溯到的字符的nextval值
		}
		else  k = nextval[k];
	}

}

Summarize

The core of the kmp algorithm is to solve the next (nextval) array. The blogger has also been suffering from kmp for a long time. Today I have an epiphany and I feel more and more powerful about the computer predecessors. You must not give up kmp, savor it carefully, and you will eventually realize it! come on!
If the blogger's article has helped you, how about giving the blogger a free like? It’s great for bloggers!

Guess you like

Origin blog.csdn.net/m0_65038072/article/details/127553415
Recommended