Graphic KMP algorithm

Preface: This article I believe you point to open may already have some knowledge of the KMP algorithm, of course, did not understand what we have to elaborate on what is today KMP algorithm, so that you understand this algorithm works with a real sense of application;

First, what is KMP algorithm

  • KMP algorithm called the (Knuth-Morris-Pratt), is an improved string matching algorithm, was first proposed by the DEKnuth, JHMorris and VRPratt, its appearance will undoubtedly bring spring to string matching.
  • In fact, the essence of KMP algorithm is in the matching process, in case of link mismatch, will not immediately head start and from the pattern string matching, but will string pattern string from the current sub-string matching part of the largest public matches the beginning, that is, we want to make use of the information has already been matched, so discard unnecessary processes, minimum match time. Road is a little unclear speech, or see the following explanation.
  • The birth of all algorithms are designed to meet our needs, KMP is no exception, such as the following requirements KMP algorithm is a classic scenario:
    currently I have a string str, and another string pattern (let us call mode string it)
    I would like to ask is whether there is a pattern in str, or pattern appeared many times in str;

Second, violent match

The face of the demands raised above most certainly did not come into contact with the junior partner of KMP algorithm also would first think of this algorithm (BF) match violence. What is this idea is nothing more violent match, defines a
i = 0, str it points to the i th character, i is then traversed from 0 to str.length -. Pattern.length again in the process of each cycle it is defined a variable j is represented by points in the pattern j-th element, j values it is 0 to pattern.length, if this cycle for each layer j expression str [i + j] == pattern [j] are set up so that is a successful match, until the entire outer loop traversing all done, they did not match the success, then it is judged that there is no pattern in str.
Look at the code, which should be the method most people think of first:

bool BF(string str,string pattern){
    int i=0,j=0;
    while(i < str.length() && j < pattern.length()){
        if(str[i] == pattern[j]){
            i++;
            j++;
        }else{
            i = i-j+1;
            j=0;
        }
    }
    return j == pattern.length();
}

The implementation of this algorithm can be used to indicate the following figure:
Here Insert Picture Description
It is that simple and crude, the advantage of this algorithm is simple, there is not likely to leak wrong, but it is a drawback of the most deadly time expenses are too high, a look at the code analysis can immediately obtain the time complexity of O (m * n) m, n are the length and pattern of str, str and when the length of the pattern are large, the use of this algorithm is undoubtedly a disaster, the big data in power today we want to explore a more elegant and efficient algorithm, that is today's hero KMP algorithm.

Three, KMP algorithm

  • To understand the KMP algorithm algorithm, we must first turn the concept of a string of prefixes and suffixes, what time the string prefix and suffix, look at the following example I believe that smart you immediately understand:
    Example: String = "abab"
Prefix suffix length
a b 1
from from 2
aba chapter 3

You saw an example of the result of the prefix and suffix string at a glance what time, and now we wish to simulate what people are thinking, our human brain to process how to match the two strings, see below this figure:
Here Insert Picture Description
we can see for the first time in the match to c when we had failed to match the phenomenon, however, we see the prefix string has already been successful in matching pattern string aba is it not right, then we'll be able to mode prefix aba move to the matching string aba pairing, which is not zero eliminating the need for operating such a waste of time, where you can see the core is the use of the information has already been matched before, and then use it to find the next best starting position;
we can look to push the island formula, P is the first community string pattern, string matching is T, i T for the current index, j is the current subscript P:
when a mismatch occurs:

1, we know that P [0 to j - 1] == T [i - j to i - 1] This expression is established it
Here Insert Picture Description
may look at the formula did not understand, but should you fancy the figure below to clear out.

2, and then we go again this analysis has been successfully matched string (green period) structure:
we can see that it has a very strange phenomenon:
Here Insert Picture Description
that is, after the prefix string suffix must have the same place here introducing the concept of maximum public string

String Prefix suffix The maximum length of the common prefix and suffix
a ^ ^ 0
aa a a 1
aba this is this is 2
abcabc abc abc 3

So now we can derive a formula, we know that the front portion P has successfully matched string the string is
P [0 to j - 1], if we assume that the maximum length of the common prefix and suffix string is k, then is not a
P [0 to k - 1] == P [j - k to j-1];
Figure:
Here Insert Picture Description

3, with the top two pushed out a formula we can put them together to see, would come:

(1) Since P [0 to j - 1] == T [i - j to I - 1]
(2) and since P [0 to k - 1] == P [j - k -1 to J]
( 3) with (1) (2) there are: T [i - k to i -1] == P [0 to k-1]

Still do not understand can look:
Here Insert Picture Description
4, through the first three steps of the derivation principle should be to get to know, then find the law, how do we achieve it in a computer program? I do not know you back now is not, in fact, this method is unique and different from where we started when it comes to violence law is change mismatching links i and j are not the same;
how a different method: BF (Violence) we lost when equipped with zero processing is carried out in turn j, i went back to the head they have matched substring.
And in fact, when KMP algorithm mismatch here, he's not j is set to zero, but became k, and k is the P [0 to j-1] The maximum length of this segment of the common prefix and suffix; if we k can P (pattern mode string) that all positions are required to that string is not on line yet? Let us use a named next container to hold this position for each value of k
it can be expressed as K j-th position P (pattern string pattern) next value [I];
then we can first write programs that read:

bool BF(string str,string pattern){
    int i=0,j=0;
    while(i < str.length() && j < pattern.length()){
        if(str[i] == pattern[j]){
            i++;
            j++;
        }else{
        	j = next[j]				// j变next的值,而i不变
        }
    }
    return j == pattern.length();
}

Some fear small partner dizzy, I want to emphasize the value of k is not from the next P [0 to j] this period, and it represents the P [0 to j-1] This is the largest section of the public before the suffix length must pay attention! ! !

Well now it becomes imperative to find next array, as long as the next array to find that everything is solved.

Fourth, look for next array

Looking next first array, the first to find out the next following example of the array, in which to find the law:
Here Insert Picture Description
According to our human eye can easily see the value in an array next, but for the program, you need the algorithm, it may be Let us analyze the process of the human brain to explore the next array of values, I put this process seek next array, dynamic programming problem boils down to, do not understand the dynamic programming is also no problem, let me see the following analysis:
never learned dynamic programming you it can be used when the point of view of mathematical induction:
so assume: we have calculated the value before the next j-bit array, and next [j] = k then we now have an array:
Here Insert Picture Description
now we would like to ask j + 1 how to find it, since we all know that next [j] = k, which represents the string (0 to j - 1)
the maximum length for the public before the suffix k, then there is the following figure:
Here Insert Picture Description
so we seek to j + 1
the first step is to compute this is nothing more than a string of characters k + 1 is equal to (j + 1) - 1 character can be understood as following:
Here Insert Picture Description
If the k + 1-bit string of characters equal to the bit j then there is the character next [j + 1] = k + 1 (That is the figure that the length of the yellow and green part of the string)
in the case of equal better understanding, if not equal when, next [j + 1] how should we seek it, do not worry we are talking about planting case when the first picture below to read:
Here Insert Picture Description
well, we have a chart diagram can be derived about the follow:

1. The set mode sequence is P, the next array has its initialization completion
2. then there is k = nex [j], can be seen in FIG string length a1 to a length of k;
because the next value of the array P [0 to j - the maximum length the same as the prefix and suffix 1], then there A1 (prefix) = a2 (suffix);
3. there are k '= next [k], so Similarly available B2 = B1,
4. 2 with 3 dots and principles as available length c1 next [k '], and C2 = c1;
. 5, then the introduction of the 2,3,4 point equation can be obtained:
∵ B1 = B2; C2 = c1 ;
∴ C1 = C2 = C3 = C4;
and A1 = A2 ∵;
∴b1 = B2 = B3 = B4;
∴ C1 = C2 = C3 = C4 = C5 = C6 = C7 = C8;
tidy to obtain the following results :
(. 1) A2 = A1;
(2) = B1 B4;
C1 = C8 (. 3);
......, ......
(n-) N1 = NN; (until P [0, n1] is not so far prior to the common suffix that is (next [n] = 0 and n = 0))
still do not wish to follow the following animation think again;
Here Insert Picture Description
then we can draw a preliminary request next array method in accordance with these laws, as follows:

vector<int> initNext(string pattern){
    vector<int> next(pattern.length());
    int k;
    next[0] = 0;

    for(int j=0;j<pattern.length()-1;j++){
        k = next[j];
        
        while((k != 0 || next[k] != 0) && pattern[k] != pattern[j]){ // k不能到达0,且两个下标对应的字符不相等时才能循环
            k = next[k];
        }

        if(pattern[k] == pattern[j] && j != k){  // 两个下标指向的字符相等 且 下标不能一样
            next[j+1] = k+1;
        }else{
            next[j+1] = 0;
        }
    }
    return next;
}

This code is written mess, because I want to try to meet the logic of our analysis to the above, it is a patchwork, but this order should be read down to better understand the method for finding next array.
Since the code is read, which is derived from the preliminary analysis of the above ideas we follow the code, if the above manner is clearly seeking nexe words are not elegant, that we analyze the above code to make some simplifications:
First look at this paragraph:

Here Insert Picture Description
A condition where the previous cycle is k = 0 || next [k] = 0, what is the meaning of it, in no hurry to clear Figure!!:

Here Insert Picture Description
Obviously the figure has emerged mismatch phenomenon, but at this time of k = 0, and the next [k] the k still equal to zero so there will still be labeled a state, that is, k already points to the head of the pattern, so we might as well be next array 0 -1 with a tagging coming to an end, has no representatives in front of it, the code can be simplified to:
Here Insert Picture Description
Haha still a mess, look at that figure if they can simplify this one : because if k == -1 i.e. the else when triggered, then the next [j +1] = 0;
this statement is not equal to is next [j + 1] = k + 1 ( k is -1 add this case anyway had a 1 is equal to 0);
then becomes this:
Here Insert Picture Description
Unfortunately or not, here are two cycles, we put him compressed into a job, we found that while the inner loop has a very strange namely, that if we do not have to match, it will have been iteration K value, but in the process we are j has not changed, then why do not we own to manually control the value of j? Here Insert Picture Description
I do know this is a little silly, but also to process and procedure, (where j- - in order to offset the increase since the end of each part of the cycles j, j values so forcefully that it does not change), you can see the above code that we have to be optimized into a cycle of,
but still very pretty, and finally sort out then the following code:

vector<int> initNext(string pattern){
    vector<int> next(pattern.length());
    int k=-1,j=0;
    next[0] = -1;

    while(j < pattern.length() - 1){
        if(k == -1 || pattern[k] == pattern[j]){
            next[++j] = ++k;
        }
        else{
            k = next[k];
        }
    }
    return next;
}

This is the version we loved, in fact, and that figure is for change becomes a while, if and else done a deal with a negated (ie, reversal;

Fifth, the final processing

After the previous explanation, you already know the principles and solving the next array of KMP algorithm, for the spirit of excellence, we need them to do the final processing because the above code there is a BUG, but not fatal, let me see below match this case:
Here Insert Picture Description
this case is not difficult to see on the map, we mismatch, pattern [j] = pattern [ i], in this particular pattern string matching below, clearly belongs to AA-string string! , which is the first half and the second half of the same, that occurs:
when such a situation pattern [next [j]] = pattern [j]:
prove:
! ∵pattern [J] = pattern [i]
k = the Next [J];
the nature of the next available: pattern [0: k] == pattern [jk, j]
and pattern ∵ [K] = pattern [J]
∴ pattern [K] = pattern [I]!

Then it is clear, that is, in this case, we are doing useful work rollback (k immediate fallback position, but could not match the success), then we have to resort to a patch when constructing the next array:

vector<int> initNext(string pattern){
    vector<int> next(pattern.length());
    int k=-1,j=0;
    next[0] = -1;

    while(j < pattern.length() - 1){
        if(k == -1 || pattern[k] == pattern[j]){
            if(pattern[++j] == pattern[++k]){
                next[j] = next[k];
            }else{
                next[j] = k;
            }
        }
        else{
            k = next[k];
        }
    }
    return next;
}

bool serarch(string pattern,string text){
    vector<int> next = initNext(pattern);
    int i=0,j=0;
    const int pLen = pattern.length();
    const int tLen = text.length();

    while((j < pLen) && (i < tLen)){
        if(j == -1 || text[i] == pattern[j]){
            i++;
            j++;
        }
        else{
            j = next[j];
        }
    }
    return j == pLen;
}
  • Conclusion:
    see KMP algorithm removed next time array configuration, load time and accuracy to O (n + m), is in marked contrast with techniques violence; in fact, among the thousands of matching algorithm, but also a drop in the ocean KMP algorithm, algorithm is not good or bad, right and wrong only with the points, different algorithms using different scenarios, if you are interested in string matching algorithm, the algorithm might go look at sunday, anyway, learning is endless suffering for the boat it, to learn a lot, filling it in show!

Hey ~ plot is not easy, this article helpful if you point a praise chant walk, there are problems to be small partners can whisper or comment Guest Book discussion;

Published 27 original articles · won praise 62 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/105242127
Recommended