Data structure: BF algorithm & KMP algorithm

BF algorithm


BF (Brute Force) algorithm, namely brute force algorithm, is an ordinary string pattern matching algorithm, BF algorithm is a brute force algorithm

The idea of ​​the BF algorithm is to match the first character of the target string S (main string) with the first character of the pattern string T (substring). If they are equal, continue to compare the second character of S with the first character of T. Two characters; if they are not equal, compare the second character of S with the first character of T , and compare them in turn until the final matching result is obtained

The efficiency of the BF algorithm is not high, because every time it is not found, the main string has to fall back to the next position from the previous start

BF algorithm time complexity O(m*n) space complexity O(1)

The BF algorithm diagram is as follows:
Insert picture description here

The C code is as follows:

//返回 子串T 在 主串S 中第pos个字符之后的位置
//时间复杂度:O(m*n) 空间复杂度:O(1)

int BF(const char* S, const char *T,int pos)  //S主串,T子串
{
    
    
	if (S == NULL || T == NULL || pos < 0 || pos >= strlen(S))
	{
    
    
		return -1;
	}

	int i = pos;
	int j = 0;
	int len1 = strlen(S);//主串长度
	int len2 = strlen(T);//子串长度

	while(i<len1&&j<len2)
	{
    
    
		if (S[i] == T[j])
		{
    
    
			i++;
			j++;
		}
		else
		{
    
    
			/*回退到本次开始的下一个位置(i-j+1),j回退到0*/
			i = i - j + 1;
			j = 0;
		}
	}

	if (j >= len2)  //子串走完即为查找成功
		return i - j ;
	else  
		return -1;
}

The efficiency of the BF algorithm is not high, and the time complexity is O(m*n) . Each time the matching fails, the main string cursor i must be rolled back to the next position at the beginning of the current time, and the substring cursor j must be rolled back to 0. The KMP algorithm improves the backtracking of i, j, avoids unnecessary backtracking, improves efficiency, and reduces the time complexity to O(m+n)

KMP algorithm


The KMP algorithm is an improved string matching algorithm, proposed by DEKnuth, JHMorris and VRPratt, taking the first letter of their last name is the KMP algorithm. The core of the KMP algorithm is to use the information after the matching failure to minimize the number of matching between the pattern string and the main string to achieve the purpose of fast matching

KMP algorithm time complexity O(m+n) space complexity O(n)

In the BF algorithm, every time there is a mismatch, the main string cursor i must be rolled back to i-j+1, and the substring cursor j must be rolled back to 0

But in fact, when there is a mismatch, you can make i not retreat, and j retreat to the retreat position. As shown in the diagram of the BF algorithm in the following figure, after the first mismatch, i is backtracked to i-j+1, j Backtracking to 0, but we can find that in the subsequent comparison, i has returned to the previous position, and the position of j has changed (the red box in the figure below)

Insert picture description here
The KMP algorithm is to let the main string cursor i not retreat when there is a mismatch, and use the previously failed information to return the substring cursor j to the appropriate position

So the key is to find the position where the substring cursor j should be traced after the mismatch. We use a next array to store the position of j corresponding to each position of the substring.

Copy the above BF algorithm code directly, modify the back of i, j to complete the framework of the KMP algorithm

int KMP(const char* S, const char *T,int pos)  //S主串,T子串
{
    
    
	if (S == NULL || T == NULL || pos < 0 || pos >= strlen(S))
	{
    
    
		return -1;
	}

	int i = pos;
	int j = 0;
	int len1 = strlen(S);//主串长度
	int len2 = strlen(T);//子串长度

	while(i<len1&&j<len2)
	{
    
    
		if (S[i] == T[j])
		{
    
    
			i++;
			j++;
		}
		else//修改i,j的回退
		{
    
    
			//i不回退
			//j要退到k
		}
	}

	if (j >= len2)  //子串走完即为查找成功
		return i - j ;
	else  
		return -1;
}

The next task is to find the position k where j is backed out, that is, find the next array


Find next array

Prerequisite knowledge: prefix and suffix of string

Prefix: a substring starting with the first character.
True prefix: a substring starting with the first character, but not including the original string itself.
Suffix: a substring ending with the last character.
True suffix: a substring ending with the last character, but not including the original string itself.
Example: ababc
prefix: a, ab, aba, abab, ababc
True prefix: a, ab, aba, abab
Suffix: c, bc, abc, babc, ababc
True suffix: c, bc, abc, babc

In the following example,
Insert picture description here
a mismatch occurs when i is 4 and j is 4, but we find that the true suffix of the S string and the true prefix of the T string before the mismatch are equal , and the equal parts do not need to be compared , so start directly after the equality Just continue the comparison (as shown in the figure below), that is, j goes back to the back of the equal string, that is, the subscript of j becomes the length of the equal string , so the key is to find the length k of the equal string for each mismatch

And because in the first comparison, the four characters before the mismatch S string and T string are one-to-one correspondence and equal, so the true suffix of the S string is also the true suffix of the T string, so k is the T string before the mismatch The length value when the true prefix and the true suffix of the string are equal.
Insert picture description here
Therefore, the method to find the value of k is: find the equal true prefix and true suffix before the substring mismatch, the length is k or can be described as follows:

Find the longest two equal true substrings before the substring mismatch, these two true substrings meet the following characteristics:
1. One string starts with the first character
2. The other string ends with the last character before the mismatch
k is The length of the true substring

The k value corresponding to each character of the T string in the above example is as follows:
stipulation: next[0]=-1, next[0]=0

Pattern string T string a b a b c
next -1 0 0 1 2

Note : There is also a commonly used next array representation, next[0] puts 0, next[1] is 1, each subsequent k value is 1 larger than the above table, but the algorithm ideas are the same, two Either, for ease of coding, the method next[0] = -1 is used here

The value of the next array can be calculated manually, and the next step is to use the program to find the next array:

  1. For any string, it can be determined that next[0] = -1, next[1] = 0;
  2. Suppose next[j] = k, that is, the length of the true prefix and the true suffix of the string before the subscript j is equal to k, that is, P0...Pk-1 == Pj-k...Pj-1, find next[j+1]
  3. Case 1: If Pk == Pj, then P0...Pk-1Pk = Pj-k...Pj-1Pj, that is, when the true prefix and true suffix corresponding to j+1 are equal, the value is k+1, ie next[j+1 ] = k+1
  4. Case 2: If Pk != Pj, as shown in the figure below, put P0...Pk under Pk...Pj, then the previous idea will be used again, the main string cursor j does not move, and the substring cursor k moves back to the appropriate value The position of the substring cursor k has been calculated before, which is next[k], so k is returned to next[k], that is, k=next[k] , and then Pk is compared with Pj, so Repeat until Pk==Pj or k back to -1, k back to -1 means that there is no equal true prefix and true suffix, then next[j+1] can be assigned a value of 0, or next[j+1]=k +1, which is also the benefit of putting -1 in the first k value of the next array
    Insert picture description here
  5. C code for the next array:
static void GetNext(const char* T,int * next)  //根据子串T获取它的next数组(用来存放所有的k值)
{
    
    
	int lenT = strlen(T);
	next[0] = -1;
	next[1] = 0;
	int j = 1;
	int k = 0;

	while (j + 1 < lenT)
	{
    
    
		if (T[k] == T[j] || k == -1  )//Pk==Pj,k为-1就没必要回退了
		{
    
    
			/*
			next[j + 1] = k + 1;
			j++;
			k++;
			*/
			next[++j] = next[++k];
		}
		else//Pk != Pj
		{
    
    
			k = next[k];//主串游标j不动,子串游标k往回退
		}
	}
}

At this point, the next array has been calculated, that is, the main string cursor does not move, and the position of the substring cursor j has been calculated. Finally, j = next[j] is put under the mismatch condition, and the KMP algorithm is completed, the KMP algorithm The entire C code is as follows:

static void GetNext(const char* T, int * next);  //声明获取next数组的函数

int KMP(const char* S, const char *T, int pos)  //S主串,T子串
{
    
    
	if (S == NULL || T == NULL || pos < 0 || pos >= strlen(S))
	{
    
    
		return -1;
	}

	int i = pos;
	int j = 0;
	int len1 = strlen(S);//主串长度
	int len2 = strlen(T);//子串长度


	int *next = (int *)malloc(len2 * sizeof(int));
	GetNext(T, next);//求next数组

	while (i < len1&&j < len2)
	{
    
    
		if (S[i] == T[j] || j==-1)
		{
    
    
			i++;
			j++;
		}
		else
		{
    
    
			//i不回退
			j = next[j];//j回退到k
		}
	}
	free(next);
	if (j >= len2)  //子串走完即为查找成功
		return i - j;
	else
		return -1;
}
static void GetNext(const char* T, int * next)  //根据子串T获取它的next数组(用来存放所有的k值)
{
    
    
	int lenT = strlen(T);
	next[0] = -1;
	next[1] = 0;
	int j = 1;
	int k = 0;

	while (j + 1 < lenT)
	{
    
    
		if (T[k] == T[j] || k == -1)//Pk==Pj,k为-1就没必要回退了
		{
    
    
			/*
			next[j + 1] = k + 1;
			j++;
			k++;
			*/
			next[++j] = next[++k];
		}
		else//Pk != Pj
		{
    
    
			k = next[k];//主串游标j不动,子串游标k往回退
		}
	}
}

Guess you like

Origin blog.csdn.net/huifaguangdemao/article/details/108589288