String matching (BF algorithm + KMP algorithm)

The positioning operation of the substring is usually called the pattern matching of Shen (where T is called the pattern string), and it is one
of the most important operations in various string processing systems. The process of string matching is to input two strings and judge the second Whether a string T (also known as a pattern string) is a substring of the first string S, if so, return the subscript of the substring in S

(The following are implemented by C language)

BF algorithm:

Basic idea:

The BF algorithm, also known as the Brute Force brute force algorithm, compares the first character of the main string S with the first character of the pattern T. If they are equal, continue to compare the subsequent characters of the two; otherwise, start from the main string S The second character of the pattern T is compared with the first character of the pattern T , and the above process is repeated until all the characters in T are compared, indicating that the matching is successful; or all the characters in S are compared, indicating that the matching fails .

You can refer to the following picture: (refer to the course notes of Lazy Cat Teacher "Data Structure")

 That is, after each matching failure, the i pointing to the main string moves backward one bit, and the j pointing to the pattern string T always returns to the beginning of the string and re-judgments

Code:

(Here I used two different methods to implement the BF algorithm)

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int isSubstring_BF(char*S,char*T){//这是BF算法的第一种实现
	int i,j,temp;
	if(strlen(S)<strlen(T))//子串长于主串
	return -1;
	for(i=0;i<strlen(S);i++){
		temp=i;
		for(j=0;j<strlen(T);j++){
			if(S[temp]!=T[j]||temp>=strlen(S))//不相等或主串剩余长度小于子串
			break;
			else{
				temp++;
			}
		}
		if(j==strlen(T))
		return temp;
	}
	return -1;
}

int isSubstring_BF1(char*S,char*T){//这是BF算法的第二种实现
	int i=0,j=0,start=0;
	while(S[i]!='\0'&&T[j]!='\0'){
		if(S[i]==T[j]){
			i++;
			j++;
		}else{
			start++;
			i=start;
			j=0;
		}
	}
	if(T[j]=='\0')
	return start;
	else
	return -1;
}


main(){
	char S[50],T[50];
//	while(1){
	printf("请输入第一个字符串S:");
	scanf("%s",S);
	printf("请输入第二个字符串T:");
	scanf("%s",T);
	if(isSubstring_BF1(S,T)!=-1)
	printf("T是S的子串!\n");
	else
	printf("T不是S的子串!\n");
//}
}

KMP algorithm:

Basic idea:

In order to reduce the number of matches, the KMP algorithm adopts a new idea. To put it simply, in the KMP algorithm, when the matching fails, i will not move , only j will be moved , and the method of j moving will be explained below; the only way for i to move is to find the element pointed to by i and j If the first elements pointed to are not the same , move i one bit backward.

Brief description of the process:

It can be seen from the following that when the matching is successful, both i and j are moved; when the matching fails, for example, the main string S and the pattern string T do not match at the subscript No.

Through observation, it can be found that the prefix and suffix in the pattern string T have the same part of length 2

For the D main string , it has been verified that the ab in the yellow box is the same as the suffix ab in the T string, and the T prefix also contains the same part. Then, in the next step, you can directly skip the same prefix part in T. seek directly backwards

 Add here:

Prefix: all substrings containing the first letter and not including the last letter

Suffix: all substrings including the last letter but not the first letter

Unlike the BF algorithm, the KMP algorithm skips the same prefix part in T , and the next step is:

Number of moving digits = number of matched characters of the pattern string T - number of matching characters of the longest prefix before the mismatch position

 

 Of course, if the pattern string D does not have the same prefix and suffix, then the search method is the same as BF

So, how to make the j in the pattern string T accurately find the position to be traced back to? A new array is introduced here, called the prefix table, namely next[]:

The subscript is consistent with the subscript of the pattern string T, and the element in the array is the longest matching prefix before this position , (matching refers to the same prefix and prefix)

The method of calculating the next array:

 It can be understood in this way, because next[] is only an array formed based on the pattern string T, so it is not known in which position it will be mismatched with S, so it is necessary to calculate the position j will return when each position is mismatched

Take the following figure as an example: (The underlines are the same corresponding prefixes and suffixes)

Next array generation function:

void getNext(char*T,int*next){//前缀表
	int j=-1;//前缀
	int i=0;//后缀
	next[0]=-1;
	int len=strlen(T);
	while(i<len){
		if(j==-1||T[i]==T[j]){
			i++;
			j++;//都移动
			next[i]=j;//保存当前位置的最长相同序列串长
		}
		else{
			j=next[j];//前缀回溯,原理也是找到相同的前后缀,直接从他们后面开始找
		}
	}
	printf("前缀表为:");
	for(int k=0;k<len;k++)
	printf("%d ",next[k]);
	printf("\n");
}

And the above application is a principle

 Full code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int isSubstring_KMP(char*S,char*T){
	int i=0,j=0,start=0;
    int len1=strlen(S);
    int len2=strlen(T);
	int next[len2];
	getNext(T,next);//给前缀表赋值
//	printf("%d %d\n",strlen(S),strlen(T));
	while(i<len1&&j<len2){//不能用S[i]!='\0'&&T[j]!='\0',因为j==-1时会越界
		if(S[i]==T[j]||j==-1){//if(j==-1)即两字符串开头第一个元素就不一样
			i++;
			j++;
			start=i;
		}else{
			j=next[j];//滑动到子串的下一个和开头相同序列的判准位置,i位置不变
		}
//		printf("现在i和j的值分别是:%d %d\n",i,j);
	}
	if(j==strlen(T))
	return start;
	else
	return -1;
}

void getNext(char*T,int*next){//前缀表
	int j=-1;//前缀
	int i=0;//后缀
	next[0]=-1;
	int len=strlen(T);
	while(i<len){
		if(j==-1||T[i]==T[j]){
			i++;
			j++;
			next[i]=j;//保存当前位置的最长相同序列串长
		}
		else{
			j=next[j];//回溯
		}
	}
	printf("前缀表为:");
	for(int k=0;k<len;k++)
	printf("%d ",next[k]);
	printf("\n");
}

main(){
	char S[50],T[50];
//	while(1){
	printf("请输入第一个字符串S:");
	scanf("%s",S);
	printf("请输入第二个字符串T:");
	scanf("%s",T);
	if(isSubstring_KMP(S,T)!=-1)
	printf("T是S的子串!\n");
	else
	printf("T不是S的子串!\n");
//}
}

Supplement: The next[] prefix table can be optimized here, for example:

S串:aaabaaaab

T string: aaaab

Prefix table next: -1 0 1 2 3

Optimized nextval: -1 -1 -1 -1 3

After obtaining the prefix table according to the next function above, when comparing T and S strings, a and b are mismatched at the fourth position, and it is necessary to perform i=4,j=2;i= according to the instruction of next[j]. 4,j=1;i=4,j=0 These three comparisons , in fact, because the 1, 2, 3 characters in the pattern string T are equal to the 4th character, so there is no need to compare with the 4th character in the main string Compared with two characters , you can directly slide 4 characters.

So how to achieve it?

If you add a judgment to the prefix table function, if there are consecutive elements in the string, just assign the longest prefix value found at the last character position again.

Code:

void get_nextval(char*T,int*nextval){//前缀表
	int j=-1;//前缀
	int i=0;//后缀
	nextval[0]=-1;
	int len=strlen(T);
	while(i<len){
		if(j==-1||T[i]==T[j]){
			i++;
			j++;
			if(T[i]!=T[j])
			nextval[i]=j;//保存当前位置的最长相同序列串长
			else{
			nextval[i]=nextval[j];
			}//防止重复再判断一次
		}
		else{
			j=nextval[j];//回溯
		}
	}
	printf("前缀表为:");
	for(int k=0;k<len;k++)
	printf("%d ",nextval[k]);
	printf("\n");
}

 Beginner Xiaobai, welcome to correct me if I make mistakes! ~

Guess you like

Origin blog.csdn.net/m0_63223213/article/details/125694656