[Algorithm Notes] BF and KMP Algorithms

1. BF algorithm

1.1 Introduction

BF algorithm, or brute force algorithm, is a common pattern matching algorithm. BF algorithmThoughtIt is to match the first character of the main string S with the first character of the substring T. If they are equal, continue to compare the second character of S and T; if not, compare the second character of S with the first character of T. The comparison is continued until the final matching result is obtained. BF algorithm is a brute force algorithm.

1.2 Simulation process (idea)

We can first assume that the main string is:ababcabcdabcde

Suppose the substring is:abcd

Next, we will simulate how to find the string matching the substring in the main string and return its position

insert image description here

In the above string, i and j are the indices of the main string and the substring respectively, both starting from 0.

  • i == jWhen , i and j each take one step back.

    As in the above string, i == j == 0when , the substring characters are equal to the main string characters, so i and j go one step further. i == j == 1When , the substring characters are equal to the main string characters, so i and j each take one step back.

  • i != jWhen , at this time j == 0, i = i - j + 1(the reason is that when i == j, i and j take one step together each time)

    As in the above string, i == j == 3when , the substring character is not equal to the active character, and the substring must be traversed from the beginning, ie j == 0. The position of the main string traversal will return to the next index position of the first index in this traversal. Since the substring and the main string are all taken step by step, the position of the index when returning to this traversal is ij , and the next index to this position is i-j+1.

  • At the end of the traversal, j == 子串长度when , the substring is found in the main string, and the position of the substring found in the main string is returned i = ij

  • At the end of the traversal, j != 子串长度when , no substring is found in the main string, and -1 can be returned to indicate that the query fails

1.3 Sample code

We can write a BF algorithm according to the above simulation process and its ideas

public static int BF(String str,String sub,int pos){
    
    
    // 首先对主串和子串的特殊条件,以及主串初始查询的位置进行处理
    if(str==null || sub==null){
    
    
        return -1;
    }
    int lenStr=str.length();
    int lenSub=sub.length();
    if(lenStr==0 || lenSub==0){
    
    
        return -1;
    }
    if(pos<0 || pos>=lenStr){
    
    
        return -1;
    }
    int i=pos;  // i 表示主串的起始坐标
    int j=0;    // j 表示子串的起始坐标
    // 遍历,进行主串对子串的查询
    while(i<lenStr && j<lenSub){
    
    
        if (str.charAt(i) == sub.charAt(j)) {
    
    
            i++;
            j++;
        }else{
    
    
            i=i-j+1;
            j=0;
        }
    }
    if(j==lenSub){
    
    
        return i-j;
    }
    return -1;
}

Replenish:

charAt(index)The method is a method in the String class, and its function is to return the character at the specified index of the string.

1.4 Time Complexity

The worst time complexity of the BF algorithm is: O(m*n), where m is the length of the main string and n is the length of the substring

2. KMP algorithm

2.1 Introduction

KMP algorithm is an improved string matching algorithm proposed by DEKnuth, JH Morris and VRPratt, so people call it Knuth-Morris-Pratt operation (referred to as KMP algorithm). KMP algorithmcoreIt uses the information after the matching failure to minimize the number of matches between the substring and the main string, and achieves the purpose of fast matching. The specific implementation is through anext arrayImplementation, the array itself contains the matching local information of the substring

2.2 Simulation process (idea)

2.2.1 Overall idea

We can first assume that the main string is:abcababcabc

Suppose the substring is:abcabda

Next, we will use the idea of ​​the KMP algorithm to simulate the query of substrings in the main string.

insert image description here

In the above string, i and j are the indices of the main string and the substring respectively, both starting from 0.

  • i == jWhen , i and j take a step backward respectively (the same as the BF algorithm here)

    As in the above string, i == j == 0when , the substring characters are equal to the main string characters, so i and j go one step further. i == j == 1When , the substring characters are equal to the main string characters, so i and j each take one step back.

  • i != jat that time , at this timeThe substring index j falls back to a specified value, and the main string index i does not fall back (the only difference between KMP and BF)

    The above is the difference between the KMP algorithm and the BF algorithm. As for why the index of the main string does not roll back, and why the index of the substring returns a specified value, the following is a detailed analysis.

  • At the end of the traversal, j == 子串长度when , the substring is found in the main string, and the position of the substring found in the main string is returned i = ij

  • At the end of the traversal, j != 子串长度when , no substring is found in the main string, and -1 can be returned to indicate that the query fails

2.2.2 The index of the main string is not rolled back

According to the analysis of the above figure, i == j == 5when , the main string cannot be successfully matched with the substring. At this time, according to the idea of ​​violent solution, the index i of the main string will fall back to the next position of the starting position at the time of the match, and also just want iti == 1

But i actually does not need to be rolled back. The analysis is as follows:

  • i == j == 5When , the previous value of the main string and the substring is matched, then as long as the matching succeeds, the successful matching positions must also be covered ori == 5 behind , so the value of i can be unchanged at this timei == 5
  • Since the value of i does not roll back, then we need to mobilize the value of the substring j, but the value of j cannot be directly rolled back to the initial 0 position at this time. One is because this is the impact of the main string not rolling back. , and secondly, there is a more ingenious way

The impact of the main string not being rolled back:

Since the main string does not roll back, the main string only needs to be traversed once, which is a certain improvement compared to the BF algorithm!

2.2.3 Substring index fallback position

In fact, each position of the substring has a fixed position to fall back after the match fails, and these fallback positions can be stored in a next array.The essence of KMP is this next array

Next, let's explore how to find the position of the fallback after each position of the substring fails to match . The following figure is an example

insert image description here

  • First i == j == 5, when , the first match fails, so we need to find a fallback position of the substring, and when j falls back, the characters in the part of the position before the index i of the main string can match the newly moved corresponding position of the substring. matches the characters of

  • When we fail to match, because the characters before the matching failure position are all matched successfully, therefore, we need toStart with the 0 position of the substring as the starting point to find a string C, and find a string B at the end of the position before the substring matching fails. If we can find that the string C is equal to the string B, then we can use The lengths of these two strings are equal, as the position of j fallback

  • As shown in the figure above, string C and string B are the two strings we found, so when the substring position is 5 and the match fails, we let the substring return to the position with the subscript 2

  • When we find the return position, we can find that the string C of the substring is the same as the string A part of the main string, so we do not need to force the index of the main string back and the index of the substring back Return to 0 [External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-oxEuvFKP-1640202191031) (C:\Users\bbbbbge\Pictures\Demon King Love Algorithm\Snipaste_2021-12 -23_02-12-26.png)]

  • According to the above rules, we continue to use this idea to perform string matching. At this time i == 5,j == 2, the characters of the corresponding strings are actually not equal. We have to continue to find a string C at the 0 position of the substring as the starting position, and find a string C at the starting position of the substring, and in the matching failed The previous position finds a string B at the end, but this time weCan't find two strings that are equal to string C and string B, so j returns to the starting 0 position insert image description here

2.2.4 Implementation method of next array

Through the above, we will find that the only difference between the KMP algorithm and the BF algorithm is that i does not fall back, and the position where j falls back is specified

And the position where the substring match fails to fall back can actually be found by the substring itself, and placed in a next array in advance

The next array implementation method is as follows:

  • Set next[0] = -1, next[1] = 0
  • Define a k = 0, indicating the next value of the position before j == 2
  • j == 2Starting from , if the character at position j - 1 of the substring is equal to the character at position k, then next[j] = k + 1, otherwise the character at position next[j] is compared with the character at position k at position k... until equal, or the position of the latter is 0

2.2.5 Optimization of next array

Suppose a substring is: aaaaaaaab, then its next array is: {-1, 0, 1, 2, 3, 4, 5, 6, 7}

When the character in the i position of the main string is not equal to the last a character in the substring, the matched substring character will fall back to the position with the subscript 6. At this time, the character of the substring is still a, which is still different from the main string. Match, you need to go back, you can't end until you go back to the character a in the first position

In this case, we can do an optimization. When the character x1 in the fallback position is the same as the current character x2, then continue to fall back to the next value of x1 until the fallback is not equal to the character x2 or is 0. Finish

Therefore, an optimized nextVal array can be obtained: {-1, -1, -1, -1, -1, -1, -1, -1, 7}, which can improve efficiency again

2.3 Sample code

We can write a KMP algorithm according to the above simulation process and its ideas

public static int KMP(String str,String sub,int pos){
    
    
    if(str==null || sub==null){
    
    
        return -1;
    }
    int lenStr=str.length();
    int lenSub=sub.length();
    if(lenStr==0 || lenSub==0){
    
    
        return -1;
    }
    if(pos<0 || pos>=lenStr){
    
    
        return -1;
    }
    int i=pos;
    int j=0;
    int[] next=new int[lenSub];
    getNext(sub,next);
    while(i<lenStr && j<lenSub){
    
    
        if((j==-1)||str.charAt(i)==sub.charAt(j)){
    
    
            i++;
            j++;
        }else{
    
    
            j=next[j];
        }
    }
    if(j==lenSub){
    
    
        return i-j;
    }
    return -1;
}
public static void getNext(String sub,int[] next){
    
    
    if(sub==null || sub.length()==0){
    
    
        return;
    }
    next[0]=-1;
    next[1]=0;
    int i=2;
    int k=0;    // 前一项的返回值,即next[1]返回值
    while(i<sub.length()){
    
    
        if((k==-1) || sub.charAt(i-1)==sub.charAt(k)){
    
    
            next[i]=k+1;
            i++;
            k++;
        }else{
    
    
            k=next[k];
        }
    }
}

2.4 Time complexity

The time complexity of the KMP algorithm is: O(m+n), where m is the length of the main string and n is the length of the substring

Guess you like

Origin blog.csdn.net/weixin_51367845/article/details/122098687