Chapter 4: Strings (KMP Algorithm and KMP Optimization)

concept:

A finite sequence consisting of zero or more characters is called a string. A
subsequence composed of any number of consecutive characters is called a substring of the string. A
string full of substrings is called a main string.

Data structure definition

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct {
    
    
	char ch[10];
	int length;
} SString;

void main()
{
    
    
	SString s;
	s.length = 5;
	char a[5] = {
    
    'H','e','l','l','o'};
	// 复制数组 
	memcpy(s.ch , a,strlen(a));
	
	for(int i = 0; i<s.length; i++){
    
    
		printf("%c\n",s.ch[i]);
	}
	
}

insert image description here

The use of strings is very simple, and the focus is only on the matching algorithm of substrings.

(1) Simple pattern matching

The idea is very simple, start with the first character of the main string with the substring, compare the subscript positions one by one, if one is not matched, continue to compare the next position.
More generally speaking, it is just to compare word by word
1: take the substring as the outer loop, and the main string as the inner loop, take a substring element and start to scan the elements of the main string, and if there is a match, then the substring will go down The subscript pointer and the subscript pointer of the main string move 1 at the same time, that is, compare the next one; if there is no match, the subscript pointer of the main string moves 1, and the substring moves back to the head of the string, ready to re-match;
insert image description here

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct {
    
    
	char ch[10];
	int length;
} SString;

void main()
{
    
    
	SString s;
	s.length = 5;
	char a[5] = {
    
    'H','e','l','l','o'};
	memcpy(s.ch , a, strlen(a));
	char b[2] = {
    
    'l','o'};
	for(int i = 0; i<s.length; i++){
    
    
		int j = i;
		int k = 0;
		while(s.ch[j] == b[k]){
    
    
				printf("匹配成功: %c\n",s.ch[j]);
				k++;
				j++;
				if(k>1){
    
    
					printf("全部匹配成功\n");
					return;
				}
		}
		printf("字符串%c匹配失败\n",s.ch[j]);
	}
}

Among them, there are two layers of loop nesting, one is for looping the main string, and the other is for looping substrings, so the worst time complexity is O(M N), that is, assuming that the last element mismatch is matched every time, that is Almost every time the outer layer is executed, the inner layer runs N times, so M times of the outer layer is M N times of total execution.

From the above example, we can see that if the substring is very long, for example, there are 6000 characters, and the main string has millions of characters, when the first one is matched successfully starting from i every time, then the substring and the pointer of the main string will move backward at the same time. And if there is a 5998-bit mismatch, the pointer of the substring will still move back to 0 and continue to compare with the main string i+1, which is undoubtedly very inefficient

(2) KMP algorithm

According to the above example, the root cause of inefficiency is the backtracking of pointers. Then we imagine that if the first 0 to N elements of the substring and the main string are the same, then when some of the substrings are equal, For example: the substring is aabaac, where aa is an equal prefix and suffix. If the main string is compared to the following aac and found to be unequal, then it proves that the part aa in front of the main string actually matches the substring, then In fact, we can keep the pointer of the main string from moving. We backtrack the pointer of the substring to the position of aa in the prefix, and then compare to see if aab is equal. At the same time, we don’t need to compare the equal part of aa. From the third The elements at the first position start to compare (this part is difficult to understand, it doesn’t matter, after reading the following example, you can understand it)

But the problem is, how do I know where the substring pointer backtracks? Where is the subscript corresponding to the largest prefix on the substring? This requires the Next array to help

The next array actually describes where the pointer of the substring should be traced back to when a mismatch occurs at the Xth position.
How to find the next array?

Example 1: Suppose the substring S='aaab'; find the next array

First of all, if there is a mismatch in the first bit, there is no backtracking at all, so the first bit is fixed at 0.
If there is a mismatch in the second bit, it proves that the first bit matches, so the second bit is fixed back to 1.
If A mismatch occurs in the third digit, which proves that the first two digits of the main string aa match the substring, but the third digit does not match, so it depends on the prefix: Let the
third digit be ?
At the same time, move the substring back one bit, find the largest matching prefix, and get:
insert image description here
use the second a and ? Compare, see? Will it be equal to a, that is to say. Substring subscripts should go back to the 2nd element.
Therefore, the third position of the array does not match, and it is necessary to go back to 2

If there is a mismatch in the fourth position, then it is:
insert image description here
use a and the third position of the substring? Compare, see? Is it a,
that is, the substring pointer goes back to the third position
? If there is no mismatch in the fourth position, then the pairing is successful~
so the value of the next array is: 0123

Summary: The steps to find the next array value are as follows

1: The first two digits are fixed at 0 1
2: Starting from the third position, move the substring backwards to find the largest matching prefix above and below the left line, that is, the two strings on the left line must match completely; The string position is the position of the next value to be compared in the next array, that is, the backtracking position.
3: If the substring completely moves across the middle line, and the matching position of the two strings is not found, the value of the next array is set to 1; that is, none of their known parts can match, and the first one can only be matched from the beginning.

Example 2: Find the next array of ababaaababaa:

insert image description here
There is another way of saying that the first two digits of the next array are fixed at -1 0.
In fact, the next array calculated above is collectively -1.

After the calculation example above, we can have a vague feeling that the value next[i] corresponding to the position of the next array is actually the pointer of the substring backtracking to the position of next[i] when the substring does not match at the i-th position. The main string pointer remains unchanged, and then the next[i] element of the substring is compared with the element that does not match the main string to see if they are equal.

How to compare it with the main string again? It is better to understand with an example:

Example 3 Suppose the string S = 'aabaabaabaac', P = 'aabaac'

(1) Find the next array of P

insert image description here

(2) If S is the main string and P is the pattern string, describe the matching process of the KMP algorithm

insert image description here

Main string S = 'aabaabaabaac'
pattern string P = 'aabaac'
(1) The logic of the first comparison is still the same, if S[j] = = P[i]; i j++ for the next comparison
(2) Comparing to aabaa all match , to the sixth position, a mismatch occurs, this time is i=6, find the next value of i=6, find it is 3, then the substring backtracks to the position of i=3, note that the main string pointer j=6 at this time There is no need to backtrack, that is, use P[3] = b to compare with the sixth bit of the main string, and find that b == S[6], (
3) continue to match, ij ++; until P[6]=c, S [9] = a, and there is a mismatch again. At this time, i is 6 again. Find the next value of i=6 and find it is 3, then the substring goes back to the position of i=3 (4) P[3] =
= S[9] = = b, then ij ++ continues down
(5) this time, there is no mismatch until P[6] = = S[12] = = c. At this time, the end of the substring marks the match success!
This is the whole process of matching

After knowing the understanding and being able to simulate the KMP matching process, you can look at some real questions from the past years excerpted from the book of kings to deepen your impression. In fact, if you understand the process, it is very simple to do the questions

insert image description here
The Next arrays of the eighth and ninth questions are all calculated: 0 1 1 2 2 3
The eighth question, the first mismatch, is s[5] != t[5] that is a != c according to KMP Features, the main string S pointer i does not backtrack, and the substring pointer backtracks to the third 3 position according to the next array, but since the subscript here starts from 0, it should be t[ 2 ] so i = 5; j =
2

The ninth question is to calculate the number of comparisons. The first comparison is: a= =ab= =ba= =aa= =ab= =b all match, and the sixth comparison is a!=c. At this time, the substring goes back to 3, and That is,
the prefix ab of the a substring does not need to be compared anymore
, and then continue to compare: a= =aa= =ab= =bc= =c to complete the match,
so a total of: 6+4 = 10 comparisons

(3) Optimization of KMP

The optimization of KMP is actually the optimization of Next data . After optimization, it is called NextVal array, which has better performance than Next. , then let i perform "secondary backtracking", directly backtracking to the next position pointed to by the serial number corresponding to next.
More generally speaking: if my substring matches i = 6; P[6] = a and the matching fails, according to next array, return to the position of i = 3, but P[3] = a; then I failed to match a, and you asked me to go back and hold a to do the matching that I knew failed, which is a waste of time! Then simply give me i = 3 when there is a mismatch, you have to jump back to i!
According to this principle, if the value corresponding to the backtracking of i is always the same, it will always backtrack! Until the values are not the same, it makes sense to compare them at this time!
Still look at the example:

Example 1: Find the next val array of ababaaababaa:

This array is what we have asked before, and we can directly take its calculation result:
insert image description here
filling steps:
fill from front to back:
1 mismatch, no jump, or 0
2 mismatch, jump back to 1, a != b, value Not equal, do not overwrite
3 mismatch, jump back to 1, at this time p[3] == p[1] = = a ; So when 3 mismatch occurs, directly take the next value of 1, so next val = 0
4 Mismatch occurs, jump back to 2, at this time p[4] = = p[2] = = b; so when 4 has a mismatch, directly take the next value of 2, so next val = 1 5 has a mismatch, jump
back 3. At this time p[5] = = p[3] = = a ; So when there is a mismatch in 5, directly take the next value of 3, so next val = 0 6 has a mismatch, jump back to 4, at this time p
[ 6] != p[4] The characters are not the same, so it must be compared and cannot be skipped; so when there is a mismatch in 6, the value of next val is the same as the value of next, next val = 4
. . . . . . According to this logic, it is very simple to calculate step by step, but the first round must be a bit bumpy, and the final output is:
insert image description here

(4) Time complexity

As can be seen from the above example, the two main operations of KMP are:
(1) cyclically generate Next array/or NextVAL, if the size of the substring is M, then only O(M) is needed
(2) compare the "navigation" of the array Map", the substring and the main string are matched, and the main string does not need to be backtracked. If the data size of the main string is N, only O(N) is required, and the total is: O(M+N
)

The above is some summary and understanding of the string matching algorithm by the younger brother. If there is something wrong, please correct me and discuss it together~