Revisit the KMP algorithm of data structure and algorithm

foreword

KMP algorithm is a string matching algorithm, which can find the occurrence position of a pattern string in a main string. In practical applications, string matching is a very common problem, such as searching keywords in search engines, finding strings in text editors, and so on.

​ The inventors of the KMP algorithm are Donald K nuth, James H. Morris and Vaughan Pratt , who published a paper "Fast Pattern Matching in Strings" in 1977, of which Donald Knuth is also the author of "The Art of Computer Programming" .

​Compared to the time complexity of the brute force matching algorithm O ( nm ) O(nm)O ( nm ) , the advantage of the KMP algorithm is that its time complexity isO ( n + m ) O(n+m)O ( n+m ) , where n is the length of the text string and m is the length of the pattern string. In addition, the KMP algorithm can also handle some special cases, such as repeated substrings in the pattern string.

1. Principle

1.1 Law of Violence

String matching means that there are two strings, namely the text string s and the pattern string p, calculate the position where p appears for the first time in s, and return -1 if it does not appear.

The brute force method is the simplest string matching algorithm. Its idea is very simple: start from the first character of the main string, and compare it with each character of the pattern string in turn. If the match is successful, continue to compare the next character, otherwise Re-match starting from the next character in the main string.

Brute Force Time Complexity O ( m ∗ n ) O(m*n)O(mn)

public int bruteForce(String s, String p) {
    
    
	int lenS = s.length();
	int lenP = p.length();
	if (lenS < lenP) return -1;

    for (int i = 0; i <= lenS - lenP; i++) {
    
    
        int pos = 0;
        while (pos < lenP) {
    
    
            if (s.charAt(i + pos) != p.charAt(pos)) {
    
    
                break;
            }
            pos++;
        }
        if (pos == lenP) return i;
    }
    return -1;
}

1.2 Longest common suffix

kmp_01. png

Using the above example as a reference, you will find that there are many redundant comparisons in the brute force method, because once there is no match, the comparison will be restarted directly from the next digit. KMP is optimized in this area, no longer the next digit, but The number of jumps is determined by a next array.

The core idea of ​​the KMP algorithm is to use the information that has been matched to minimize the number of matches between the pattern string and the main string. Specifically, the KMP algorithm maintains a next array, which is used to record the length of the longest common prefix and suffix in front of each character in the pattern string. During the matching process, if the current character fails to match, the position of the pattern string is adjusted according to the value of the next array, so that the pattern string moves to the right as much as possible, thereby reducing the number of matches.

Why is it the longest common suffix? In fact, it can be well understood with the following example

kmp_02. png

When the match fails at j=6, then you will know that the first 6 characters match correctly, so how to get the next jump position?

  • The naked eye can see that the first letter is a, that is, the prefix is ​​a, so it should jump to the next a character position (i=2)

  • It is also found that there is the same as the prefix ab, so the better position should be the position of the next ab (i=4)

  • By analogy, it seems to be looking for the longest string with the same prefix and jumping to the corresponding position

  • It is actually found that the position corresponding to the longest string with the same prefix is ​​not the optimal solution. The optimal solution is related to the suffix, and there are more possibilities after the suffix.

  • If there is the longest character string t that is the same as the prefix and is not a suffix, then the prefix + the last 1 character will definitely not match t + the last 1 character, which is a waste of a matching opportunity and is not the best position

  • So the best position should be the longest common prefix and suffix part. Take the above as an example: it is the longest common part of the suffix of s[0:5] and the prefix of p[0:5]. Since the two are the same, that is, the longest common prefix and suffix ab of abacab

Note: The KMP algorithm can exert powerful skills only when the pattern string has highly repetitive characters. If it is completely different characters, it will degenerate into a violent algorithm

2. Code implementation

2.1 next array

The calculation of the next array is the key to the KMP algorithm, which is defined as follows:

next[i] indicates the length of the longest identical prefix and suffix before each position in the pattern string, that is, in the substring p[0:i-1] ending with the i - 1th character, the last that is both a prefix and a suffix The length of the long string.

Specifically, we can calculate the next array recursively, and we can follow the steps below:

  1. next[0]=-1
  2. Define i to represent the current calculated position, and j to represent the length of the longest identical prefix suffix before the current position
  3. The subsequent goal is to find p[0 : j-1] == p[ij : i-1] If such j is found, then the value of next[i] is j. If no such j is found, the value of next[i] is 0
  4. If p[i] == p[j], just set next[i+1] = next[i] + 1, and j is the length of the previous longest prefix and suffix, which is next[i], that is next[++i] = ++j
  5. If p[i] != p[j], j needs to be traced back to the previous position next[j]
public static int[] getNext(String pattern) {
    
    
    int[] next = new int[pattern.length()];
    next[0] = -1;
    int i = 0, j = -1; // j为当前已经匹配的前缀的最长公共前后缀的长度
    while (i < pattern.length() - 1) {
    
    
        if (j == -1 || pattern.charAt(i) == pattern.charAt(j)) {
    
    
            next[++i] = ++j; // 长度加1,并且将指针移向下一位
        } else {
    
    
            j = next[j]; // 回溯
        }
    }
    return next;
}

2.2 Visualization next

The generation of the next array is the most important thing in understanding KMP. Visualizing how to generate it is very helpful for understanding KMP.

import tkinter as tk
import time

def changeColor(canvas, rects, color):
    for rect in rects:
        canvas.itemconfig(rect, fill=color)
def visualize_next(pattern):
    next = [-1] * len(pattern)
    root = tk.Tk()
    root.title("KMP Next Visualization")
    canvas_width = 800
    canvas_height = 600
    canvas = tk.Canvas(root, width=canvas_width, height=canvas_height)
    canvas.pack()
    block_width = 50
    block_height = 50
    x_margin = 50
    y_margin = 50
    nextbox = []
    pbox = []


    x = x_margin
    y = y_margin
    canvas.create_text(x-block_width/2, y+block_height/2, text="索引", font=("Arial", 16))
    for i in range(len(pattern)):
        canvas.create_rectangle(x, y, x+block_width, y+block_height, outline="black")
        canvas.create_text(x+block_width/2, y+block_height/2, text=str(i), font=("Arial", 16))
        x += block_width

    x = x_margin
    y += block_height
    canvas.create_text(x-block_width/2, y+block_height/2, text="p", font=("Arial", 16))
    for i in range(len(pattern)):
        pbox.append(canvas.create_rectangle(x, y, x+block_width, y+block_height, outline="black"))
        canvas.create_text(x+block_width/2, y+block_height/2, text=str(pattern[i]), font=("Arial", 16))
        x += block_width

    x = x_margin
    y += block_height
    canvas.create_text(x-block_width/2, y+block_height/2, text="Next", font=("Arial", 16))

    for i in range(len(pattern)):
        canvas.create_rectangle(x, y, x+block_width, y+block_height, outline="black")
        nextbox.append(canvas.create_text(x+block_width/2, y+block_height/2, text="", font=("Arial", 16)))
        x += block_width
    
    i = 0
    j = -1
    x = x_margin
    y += block_height
    i_rect = canvas.create_rectangle(x, y, x+block_width, y+block_height, fill="red")
    i_text = canvas.create_text(x+block_width/2, y+block_height/2, text=str("i"), font=("Arial", 16))
    y += block_height
    j_rect = canvas.create_rectangle(x - block_width, y, x, y+block_height, fill="blue")
    j_text = canvas.create_text(x- block_width/2, y+block_height/2, text=str("j"), font=("Arial", 16))

    canvas.itemconfig(nextbox[0], text=str("-1"))
    time.sleep(1)
    while i < len(pattern) - 1:
        changeColor(canvas, pbox, '')
        if j == -1 or pattern[i] == pattern[j]:
            i += 1
            j += 1
            canvas.move(i_rect, block_width, 0)
            canvas.move(i_text, block_width, 0)
            canvas.move(j_rect, block_width, 0)
            canvas.move(j_text, block_width, 0)
            canvas.itemconfig(nextbox[i], text=str(j))
            changeColor(canvas, pbox[0:j], 'blue')
            changeColor(canvas, pbox[i-j:i], 'red')
            canvas.update()
            time.sleep(1)
        else:
            tmp = j
            j = next[j]
            canvas.move(j_rect, (j - tmp)*block_width, 0)
            canvas.move(j_text, (j - tmp)*block_width, 0)

            canvas.update()
            time.sleep(1)
    root.mainloop()

if __name__ == "__main__":
    pattern = "abacabb"
    visualize_next(pattern)

kmp_next.gif

2.3 KMP

Transform the brute force algorithm:

public static int kmp(String s, String p) {
    
    
    int[] next = getNext(p);

    int lenS = s.length();
    int lenP = p.length();
    if (lenS < lenP) return -1;
    int i = 0;

    while (i <= lenS - lenP) {
    
    
        int pos = 0;
        while (pos < lenP) {
    
    
            if (s.charAt(i + pos) != p.charAt(pos)) {
    
    
                break;
            }
            pos++;
        }
        if (pos == lenP) return i;
        i += pos - next[pos];
    }
    return -1;
}

3. Summary

3.1 Advantages

  • The time complexity of the KMP algorithm is O(n+m), where n is the length of the main string and m is the length of the pattern string. Compared with the O(n*m) of the violent method, the KMP algorithm is more efficient.

  • The re module in Python is implemented in C language, and the bottom layer uses the KMP algorithm to match regular expressions. Regular expressions are usually used to process a large amount of text data, so the performance requirements for regular expression matching are relatively high. Using the KMP algorithm can improve the efficiency of regular expression matching, so using the KMP algorithm in the re module in Python to achieve regular expression matching can improve the performance of the program.

  • The KMP algorithm can handle the situation that there are repeated substrings in the pattern string, while other string matching algorithms cannot handle it.

3.2 Disadvantages

  • There are not many applications in the case of small strings. The String.indexOf method in JDK uses an algorithm based on brute force matching instead of KMP. This is because in practical applications, the length of the string is usually not very long, and the KMP algorithm needs additional space to store the prefix table, which will increase the memory usage. Therefore, for shorter strings, better performance can be obtained using a brute-force matching algorithm. In addition, the String.indexOf method in the JDK also uses some optimization techniques, such as skipping a certain length of characters when the match fails to reduce the number of comparisons. These optimization techniques can improve the efficiency of the algorithm in most cases, so as to meet the needs of practical applications.
  • The limitation of the KMP algorithm is that it needs to preprocess the pattern string, and the time complexity of this preprocessing is O ( m ) O(m)O ( m ) , so the efficiency of the KMP algorithm may be affected when the pattern string is very long.
  • The KMP algorithm can only be used to match a single pattern string, and cannot handle the matching of multiple pattern strings.

reference

  1. (Original) Detailed explanation of KMP algorithm
  2. KMP Algorithm for Pattern Searching

Supongo que te gusta

Origin blog.csdn.net/qq_23091073/article/details/131423272
Recomendado
Clasificación