118- Realization of KMP Algorithm for String Matching

Insert picture description here
Insert picture description here

#include<stdio.h>
#include<stdlib.h>
#include<string.h> 
#include<assert.h>

void GetNext(const char *p, int *next)//获取k的值 
{
    
    
	next[0] = -1;
	next[1] = 0;
	int lenp = strlen(p);
	int i = 1;
	int k = 0;
	while (i + 1 < lenp)
	{
    
    
		if (k == -1 || p[i] == p[k])
		{
    
    
			next[++i] = ++k;
			/*
			next[i + 1] = k + 1;
			i++;
			k++;
			*/
		}
		else
		{
    
    
			k = next[k];
			/*
			if (k == -1)
			{
				next[i + 1] = k+1;
				i++;
				k++;
			}
			*/
		}
	}
}

// 时间复杂度O(n+m)   空间复杂度O(m)
int KMP(const char *s, const char *p, int pos)//KMP算法的实现 
{
    
    
	int lens = strlen(s);
	int lenp = strlen(p);
	if (s == NULL || p == NULL || lenp > lens)  return -1;
	int i = pos;
	int j = 0;
	int *next = (int *)malloc(sizeof(int) * lenp);
	assert(next != NULL);
	GetNext(p, next);
	while (i < lens && j < lenp)
	{
    
    
		if (j == -1 || s[i] == p[j])
		{
    
    
			i++;
			j++;
		}
		else
		{
    
    
			j = next[j];
			/*
			if (j == -1)
			{
				i++;
				j++;
			}
			*/
		}
	}
	free(next);
	if (j == lenp)
	{
    
    
		return i - j;
	}

	return -1;
}

int main()
{
    
    
	const char *s = "ababcabcdabcde";
	const char *p = "abcd";

	printf("%d\n", KMP(s, p, 6));
	return 0;
}

The results of the operation are as follows. The
Insert picture description here
BF method
starts from the first character of the main string s and the substring t, and compares the characters of the two strings one by one. If a character does not match, the main string traces back to the second character, and the substring traces back The first character is compared one by one. If a character does not match, the main string traces back to the third character, and the substring traces back to the first character and then compares one by one... until all the substring characters are matched successfully.
The vertical line represents equality, and the lightning line represents unequal.
Insert picture description here
Insert picture description here
In the best case, the time complexity of this algorithm is O(n). That is, the n characters of the substring are exactly equal to the first n characters of the main string, and the time complexity is O(m*n) in the worst case. In contrast, the space complexity of this algorithm is O(1), that is, it does not consume space but consumes time.

The main idea of ​​KMP is: "space for time"
proposes a concept: the longest equal prefix and suffix of a string. The set of
string abcda
prefixes: {a,ab,abc,abcd,abcda}
The set of suffixes: {b,ab,dab,cdab,bcdab}
Then the longest equal suffix is ​​ab Do
a small exercise:
string: abcabfabcab What is the longest equal
prefix and suffix in the middle: abcab

Graphic KMP

The first bar represents the main string, and the second bar represents the substring. The red part represents the matched part of the two strings, and the green and blue parts represent the unmatched characters in the main string and substring, respectively.
To be more specific: this picture represents the main string "abcabeabcabcmn" and the substring "abcabcmn".
Insert picture description here
Now that the mismatch is found, we need to move the substring backward according to the KMP idea, and now we solve the problem of how much to move. The concept of the longest equal prefix and suffix mentioned earlier is useful. Because the red part will also have the longest equal prefix and suffix. As shown in the figure below: the
Insert picture description here
Insert picture description here
gray part is the longest equal prefix and suffix of the red part of the string. The result of our substring movement is to align the longest equal prefix of the red part of the substring with the longest equal suffix of the red part of the main string.
Insert picture description here
The string before each character has the longest equal suffix, and the length of the longest equal suffix is ​​the key to our shift, so we use a next array to store the length of the longest equal suffix of the substring. And the value of the next array is only related to the substring itself.
So next[i]=j, the meaning is: the length of the longest equal prefix and suffix of the character string before the subscript i is j.
We can calculate that the next array of substring t= "abcabcmn" is next[0]=-1 (there is no string processing separately)
next[1]=0; next[2]=0; next[3]=0 ;Next[4]=1;next[5]=2;next[6]=3;next[7]=0;
Insert picture description here
also the value that should be saved in the next array next[5] at the character that does not match, which is also a substring The subscript of the corresponding character after the backtracking. and so? =next[5]=2. The next step is to compare the characters s[5] and t[next[5]]. This is also the most wonderful place, and it is also the key to why the code of the KMP algorithm can be so concise and elegant.
Insert picture description here
In the KMP algorithm, there is an additional process of finding an array, which consumes a little more space. We set the length of the main string s as n and the length of the substring t as m. The time complexity of finding the next array is O(m). Because the main string is not backtracked in the subsequent matching, the number of comparisons can be recorded as n, so the total time complexity of the KMP algorithm is O(m+n), and the space complexity is recorded as O(m). Compared with the naive pattern matching time complexity of O(m*n), the speed of KMP algorithm is very large. It is very meaningful for this little space consumption to exchange extremely high time speed, and this idea is also very important. .

Explain the backtracking problem in the construction of the next array
The following bar represents the substring, the red part represents the longest equal prefix and suffix of the current match, and the blue part represents t.data[j].
Insert picture description here
Insert picture description here
Insert picture description here

void GetNextval(SqString t,int nextval[])  
//由模式串t求出nextval值
{
    
    
	int j=0,k=-1;
	nextval[0]=-1;
   	while (j<t.length) 
	{
    
    
       	if (k==-1 || t.data[j]==t.data[k]) 
		{
    
    	
			j++;k++;
			if (t.data[j]!=t.data[k]) 
//这里的t.data[k]是t.data[j]处字符不匹配而会回溯到的字符
//为什么?因为没有这处if判断的话,此处代码是next[j]=k;
//next[j]不就是t.data[j]不匹配时应该回溯到的字符位置嘛
				nextval[j]=k;
           	else  
				nextval[j]=nextval[k];
//这一个代码含义是不是呼之欲出了?
//此时nextval[j]的值就是就是t.data[j]不匹配时应该回溯到的字符的nextval值
//用较为粗鄙语言表诉:即字符不匹配时回溯两层后对应的字符下标
       	}
       	else  k=nextval[k];    	
	}

}
int KMPIndex1(SqString s,SqString t)    
//修正的KMP算法
//只是next换成了nextval
{
    
    
	int nextval[MaxSize],i=0,j=0;
	GetNextval(t,nextval);
	while (i<s.length && j<t.length) 
	{
    
    
		if (j==-1 || s.data[i]==t.data[j]) 
		{
    
    	
			i++;j++;	
		}
		else j=nextval[j];
	}
	if (j>=t.length)  
		return(i-t.length);
	else
		return(-1);
}

Guess you like

Origin blog.csdn.net/LINZEYU666/article/details/111758677