Detailed Explanation of Data Structure KMP Algorithm

Table of contents

1. What is the KMP algorithm?

2. The origin of the KMP algorithm

2.1 Problems to be solved

2.2 The method that came to mind at the beginning

2.3 The KMP algorithm was born

3. Detailed explanation of KMP algorithm

4. Realization of KMP algorithm

5. Improvement of KMP algorithm


1. What is the KMP algorithm?

  • The KMP algorithm is an improved string matching algorithm, that is, an algorithm that can quickly find a substring from the main string , proposed by DEKnuth, JH Morris and VRPratt, so people call it Knut-Morris-Pratt operation (KMP algorithm for short).
  • The core of the KMP algorithm is to use the information after the matching failure to minimize the number of matches between the pattern string and the main string to achieve the purpose of fast matching. The specific implementation is realized through a next() function, and the function itself contains the partial matching information of the pattern string.

2. The origin of the KMP algorithm

2.1 Problems to be solved

The problem we mentioned in the "string" chapter of the data structure: the problem of finding the substring of the main string,
for example: now we have the main string: arr1[ ] =  "abababc" and the substring: arr2[ ] =  "abc ".

2.2 The method that came to mind at the beginning

What we thought of at the beginning was brute force solution. We match the substring with the main string one by one. If the first character is equal, we will continue to match the second character until the substring and the main string are all matched successfully, and then return the position of the substring. , once one of the two characters fails to match, the main string will return to the next character that started the matching character, and the substring will return to the first character.
From the above examples, we can see that the traditional brute force solution requires multiple backtracking, and both the first string and the second string need to be backtracked. Obviously, backtracking must be wrong. Is there any way for him to backtrack? To a specific position, it will not generate many unnecessary steps like a violent solution, so there is the focus of this article: KMP algorithm was born

2.3 The KMP algorithm was born

The KMP algorithm is an improved string matching algorithm, that is, an algorithm that can quickly find substrings from the main string.

3. Detailed explanation of KMP algorithm

(1) Now let's look at a picture first: the first long bar represents the main string, and the second long bar represents the substring. When the element pointed to by the pointer is found to be mismatched, the pointer goes back to the first element. This is the algorithm efficiency The reason for the low is called naive matching algorithm.

(2) Now let's take a look at what the smart KMP matching algorithm looks like: in the figure below, when the element pointed to by the pointer is found not to match, the movement mode changes, so that the prefix moves to the suffix

(3) After this step is understood, the essence of the KMP algorithm is almost mastered. In fact, the string before each character has the longest equal suffix, and the length of the longest equal suffix is ​​the key to our shifting, so we use a separate next array to store the longest equal suffix of the substring length. And the value of the next array is only related to the substring itself.
So next[i]=j means: the length of the character string before the subscript i is the longest and the length of the prefix and suffix is ​​j.

and put them into an array

4. Realization of KMP algorithm

Now let's analyze the time complexity of the KMP algorithm:
The KMP algorithm has an additional process of finding an array, which consumes a little more space. We assume that the length of the main string s is n, and the length of the substring t is m. When finding the next array, the time complexity is O(m). Because the main string does not backtrack in the subsequent matching, the number of comparisons can be recorded as n, so the total time complexity of the KMP algorithm is O(m+n), and the space complexity is recorded as O(m). Compared with the simple pattern matching time complexity O(m*n), the speed-up of the KMP algorithm is very large. It is very meaningful to exchange a little space consumption for a very high time speed-up. This idea is also very important. .

  • Let's appreciate how the computer obtains the next array
typedef struct
{	
	char data[MaxSize];
	int length;			//串长
} SqString;
//SqString 是串的数据结构
//typedef重命名结构体变量,可以用SqString t定义一个结构体。
void GetNext(SqString t,int next[])		//由模式串t求出next值
{
	int j,k;
	j=0;k=-1;
	next[0]=-1;//第一个字符前无字符串,给值-1
	while (j<t.length-1) 
	//因为next数组中j最大为t.length-1,而每一步next数组赋值都是在j++之后
	//所以最后一次经过while循环时j为t.length-2
	{	
		if (k==-1 || t.data[j]==t.data[k]) 	//k为-1或比较的字符相等时
		{	
			j++;k++;
			next[j]=k;
			//对应字符匹配情况下,s与t指向同步后移
			//通过字符串"aaaaab"求next数组过程想一下这一步的意义
			//printf("(1) j=%d,k=%d,next[%d]=%d\n",j,k,j,k);
       	}
       	else
		{
			k=next[k];
			**//我们现在知道next[k]的值代表的是下标为k的字符前面的字符串最长相等前后缀的长度
			//也表示该处字符不匹配时应该回溯到的字符的下标
			//这个值给k后又进行while循环判断,此时t.data[k]即指最长相等前缀后一个字符**
			//为什么要回退此处进行比较,我们往下接着看。其实原理和上面介绍的KMP原理差不多
			//printf("(2) k=%d\n",k);
		}
	}
}
  • KMP algorithm code explanation
int KMPIndex(SqString s,SqString t)  //KMP算法
{

	int next[MaxSize],i=0,j=0;
	GetNext(t,next);
	while (i<s.length && j<t.length) 
	{
		if (j==-1 || s.data[i]==t.data[j]) 
		{
			i++;j++;  			//i,j各增1
		}
		else j=next[j]; 		//i不变,j后退,现在知道为什么这样让子串回退了吧
    }
    if (j>=t.length)
		return(i-t.length);  	//返回匹配模式串的首字符下标
    else  
		return(-1);        		//返回不匹配标志
}

5. Improvement of KMP algorithm

Why does the KMP algorithm need to be improved even though it is so powerful?
Let's take a look at an example:
the main string s="aaaaabaaaaac"
substring t="aaaaac"
In this example, when 'b' and 'c' do not match, 'b' should be compared with the 'a' preceding 'c' , which obviously doesn't match. 'a' before 'c' is still 'a' after backtracking.
We know that there is no need to compare 'b' with 'a', because the backtracked character is the same as the original character, and the original character does not match, and the backtracked character is naturally impossible to match. However, the KMP algorithm still compares 'b' with the backtracked 'a'. This is where we can improve. Our improved next array is named: nextval array. The improvement of the KMP algorithm can be briefly described as follows: If the a-bit character is equal to the b-bit character pointed to by its next value, then the nextval of the a-bit points to the nextval value of the b-bit, if they are not equal, the nextval value of the a-bit is The next value of its own a bit. This should be the most obvious explanation. For example, the next array and nextval array of the string "ababaaab" are respectively:
subscript | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
substring | a | b | a | b | a | a | a |b|
next|-1|0|0|1|2|3|1|1|
nextval|-1|0|-1|0|-1|3|1|0|

Let's analyze the code:

void GetNextval(SqString t,int nextval[])  
//由模式串t求出nextval值
{
	int j=0,k=-1;
	nextval[0]=-1;
   	while (j<t.length) 
	{
       	if (k==-1 || t.data[j]==t.data[k]) 
		{	
			j++;k++;
			if (t.data[j]!=t.data[k]) 
//这里的t.data[k]是t.data[j]处字符不匹配而会回溯到的字符
//为什么?因为没有这处if判断的话,此处代码是next[j]=k;
//next[j]不就是t.data[j]不匹配时应该回溯到的字符位置嘛
				nextval[j]=k;
           	else  
				nextval[j]=nextval[k];
//这一个代码含义是不是呼之欲出了?
//此时nextval[j]的值就是就是t.data[j]不匹配时应该回溯到的字符的nextval值
//用较为粗鄙语言表诉:即字符不匹配时回溯两层后对应的字符下标
       	}
       	else  k=nextval[k];    	
	}

}
int KMPIndex1(SqString s,SqString t)    
//修正的KMP算法
//只是next换成了nextval
{
	int nextval[MaxSize],i=0,j=0;
	GetNextval(t,nextval);
	while (i<s.length && j<t.length) 
	{
		if (j==-1 || s.data[i]==t.data[j]) 
		{	
			i++;j++;	
		}
		else j=nextval[j];
	}
	if (j>=t.length)  
		return(i-t.length);
	else
		return(-1);
}

Guess you like

Origin blog.csdn.net/weixin_43313333/article/details/131362556