Implementation and understanding of KMP algorithm

KMP algorithm realization process and meaning understanding

1. The use of KMP algorithm

In the pattern matching of strings, a backtracking process occurs when the naive algorithm finds the position of the first occurrence of the substring. In order to reduce the time complexity, the KMP algorithm is introduced. The KMP algorithm matches the main string by moving the pattern string, so the time complexity is linear and there is no backtracking process.

2. Implementation of KMP algorithm

(1) First attach the code of the KMP main algorithm:
int KMP(string s, string t, int pos, int next[]{
    
    
//s为主串,t为模式串,pos为开始查找位置,next数组保存模式串每个字符失配时应该重新开始的位置
//pos可以从0开始
	int i = pos, j = 0;
	int sLen = s.length(), tLen = t.length();
	while(i < sLen && j < tLen){
    
    
		if(j == -1 || next[i] == next[j]){
    
    
			i++;
			j++;
		}else{
    
    
			j = next[j];
		}
	}
	if(j < tLen){
    
    //说明模式串并没有匹配成功
		return -1;
	}
	//返回下标位置
	return i - j;
}

The search procedure of the KMP algorithm is not difficult to understand. The main idea is: when the characters in the pattern string are the same as the main string, the two pointers move backward at the same time; when there is no match, the pointer of the main string does not backtrack, but By shifting the pattern string to the right, adjust the pointer of the pattern string to compare with the main string. Therefore, knowing how to adjust the pointer when the pattern string is mismatched at a certain position is the core of the KMP algorithm. The next array is used to store the information of which position the pointer should be adjusted to when each position of the pattern string is mismatched .

(2) How to find the next array
  • First give an example:
Subscript 0 1 2 3 4 5
String a b a a b c
next -1 0 0 1 1 2

The idea here is to perform pattern matching in the pattern string. Initialize next[ 0] = -1 as a sign. Use the pointer i to traverse the pattern string. At the position of t[i], we look at the previous character, namely t[i-1], to see if it is equal to the character whose next value is the subscript, that is, t[i- 1] == t[ next[i-1] ]. If it is equal or next[i-1] is -1, then next[i] = next[i-1] + 1, the pointer goes backward; if not, we use another pointer k to record next[i-1] (The subscript of the character currently compared with t[i-1]), let k = next[ k ], that is, let the t[i-1] character continue to be compared with t[ k] until it is equal to itself Or until k is -1 (to the beginning). In fact, it is to compare the result of each character, and then fill in the next value of the next character.

  • Attach the implementation code:
void getNext(string t, int next[]){
    
    
	int len = t.length();
	int i = 0, k = -1;
	while(i < len - 1){
    
    //防止越界
		if(k == -1 || t[i] == t[k]){
    
    
			i++;
			k++;
			next[i] = k;
		}else{
    
    
			k = next[k];
		}
	}
}

3. Understanding of next array

(1) The most basic meaning is: when the i-th character of the pattern string is mismatched, where should the pattern string be moved to the right (where the pointer is adjusted to), and then continue to match the main string at the current mismatch position.
(2) The value of each bit of the next array represents the length of the substring repeated before this.
  • For example, here we expand the next array by one bit and modify the function for finding the next array.
Subscript 0 1 2 3 4 5 6
String a b c a b c
next -1 0 0 0 1 2 3
void getNext(string t){
    
    
	int len = t.length();
	int *next = new int[len + 1];
	int i = 0, k = -1;
	while(i < len){
    
    
		if(k == -1 || t[i] == t[k]){
    
    
			i++;
			k++;
			next[i] = k;
		}else{
    
    
			k = next[k];
		}
	}
}

So what does next[6] mean here? next [6] = 3 means that there have been substrings of length 3 that are exactly the same as the prefixes of pattern string length 3. That is, "abc" starting from subscript 3 is equal to "abc" (prefix) starting from subscript 0.

  • Understanding this meaning is very useful. Enlarging the next array by one bit can be used to solve the problems of seeking the maximum repeated substring of a string, substring looping, and so on.
  • Further discussion: In fact, next[ i] represents the longest substring repeated before the current position. There are three cases:
    (1) next[ i] = 0, indicating that there is no substring equal to the prefix before the position, that is, the current position There is no repeated substring in the string formed by the character before the position. Examples are as follows:
Subscript 0 1 2 3 4
String a b c d
next -1 0 0 0 0

next[4]=0 means that the string before this has no repeated substrings.

(2) next[ i] = s.length()-1, indicating that the entire string is composed of the same character. Examples are as follows:

Subscript 0 1 2 3 4
String a a a a
next -1 0 1 2 3

next[4]=3 indicates that the string is composed of a single character.

(3) next[ i] >= s.length() / 2, indicating that there is a repeated substring before the current position, and the calculation of the next value overlaps.

Subscript 0 1 2 3 4 5 6 7 8 9
String a b c a b c a b c
next -1 0 0 0 1 2 3 4 5 6

First of all, popularize a concept: the minimum cycle section is the shortest substring that repeats in a string. The calculation method is to expand the next array by one bit, min_circle = len-next[ len ].

In this example, min_circle = "abc" and the length is 3. And next[9]=6, it means that there is an overlapping part in the calculation of repeated substrings, which is more than counted. In addition, if the total length of the string len% min_circle == 0, it means that the entire string is constituted by the repetition of the smallest loop section; min_circle == len, it means that the string has no repetitive parts.

Guess you like

Origin blog.csdn.net/weixin_45688536/article/details/109296350