String Matching Algorithm - KMP Algorithm

Introduction to KMP Algorithm

The KMP algorithm (Knuth-Morris-Pratt algorithm) is a common string matching algorithm used to find the occurrence of another string in a string. Its time complexity is O(m+n), where m is the length of the target string and n is the length of the pattern string.

In the traditional string matching algorithm, it is generally necessary to compare each character of the target string and the pattern string one by one, and the time complexity of this is O(m*n), and the efficiency is low. The KMP algorithm uses the internal information of the pattern string to avoid unnecessary comparisons, thereby improving efficiency.

KMP algorithm steps

The key of the KMP algorithm is to find the mismatch function of the pattern string, and the realization of this function needs to use dynamic programming. Specifically, taking the pattern string "ABABCABD" as an example, the process of solving the mismatch function is as follows:

First, set an array next whose length is the length of the pattern string, where next[0]=-1. Next, calculate the values ​​of the next array one by one starting from the second character. First assign the value of next[1] to 0, because no matter whether the characters in the target string match the first character of the pattern string, the target string can only be moved backward one bit.

Next, calculate the value of next[2]. Assuming that the value of next[1] is known to be 0, if the character in the target string does not match the second character of the pattern string, then the pattern string should be moved backward by 1 position, that is, the prefix "AB" of the pattern string Not equal to suffix "A", so set next[2]=0. If the character in the target string matches the second character of the pattern string, then the pattern string should be moved backward by 2 positions, ie next[2]=1.

And so on until the value of next[8] is calculated (note that although the pattern string has 9 characters, the array subscript starts from 0, so the maximum subscript is 8). The resulting next array is: [-1,0,0,1,0,1,2,0]. Among them, next[i] indicates the position where the pattern string should move when the i-th character of the pattern string does not match a certain character of the target string.

Next, take the target string "ABCABABCABABD" and the pattern string "ABBABCABD" as examples to introduce the specific matching process of the KMP algorithm.

First, compare the first character of the target string with the pattern string, find a match, so continue to compare the second character. At this time, the second character of the target string is "B", and the second character of the pattern string is "A", so a mismatch occurs. According to the value of the mismatch function, it can be known that the pattern string should move backward 1 position.

Next, the entire pattern string is moved backward by 1 position, and then the third character in the target string is compared with the second character in the pattern string to find a match. Then compare the fourth character in the target string with the third character in the pattern string and find a match. By analogy, until the ninth character in the target string corresponds to the pattern string one by one, a mismatch occurs again.

According to the value of the mismatch function, it can be known that the pattern string should be moved backward by 2 positions, so the entire pattern string is moved backward by 2 positions, and continues to compare with the target string. And so on, until a complete matching substring is found, or the target string is matched.

In general, the KMP algorithm avoids useless comparisons by establishing a mismatch function, thereby improving the efficiency of string matching. Although the implementation of this algorithm is relatively complicated, it has significant advantages for large-scale text string matching, and it is a relatively classic string matching algorithm.

KMP algorithm implementation

The following is the implementation of the KMP algorithm in the Java language:

public class KMP {
    
    
    public int[] getPrefixTable(char[] pattern) {
    
    
        int[] table = new int[pattern.length];
        int index = 0;
        for (int i = 1; i < pattern.length; ) {
    
    
            if (pattern[i] == pattern[index]) {
    
    
                table[i] = ++index;
                i++;
            } else {
    
    
                if (index > 0) {
    
    
                    index = table[index-1];
                } else {
    
    
                    index = 0;
                    i++;
                }
            }
        }
        return table;
    }

    public int search(char[] text, char[] pattern) {
    
    
        int[] prefixTable = getPrefixTable(pattern);
        int i = 0;
        int j = 0;
        while (i < text.length && j < pattern.length) {
    
    
            if (text[i] == pattern[j]) {
    
    
                i++;
                j++;
            } else {
    
    
                if (j > 0) {
    
    
                    j = prefixTable[j-1];
                } else {
    
    
                    i++;
                }
            }
        }
        if (j == pattern.length) {
    
    
            return i - j;
        } else {
    
    
            return -1;
        }
    }
}

The above is a complete implementation of the KMP algorithm, in which getPrefixTablethe method is used to generate the prefix table of the pattern string (also called a partial match table), and searchthe method is used to find the position of the pattern string in the text string.

Guess you like

Origin blog.csdn.net/weixin_44008788/article/details/129659089