algorithm kmp algorithm

The kmp algorithm is an improved character matching algorithm. The difference between it and the bf algorithm is that each time the secondary string fails to match the primary string, the position where the secondary string matches the primary string.

The difference between the two algorithms is described below:

Main string: BABCDABBCDABCED

From the string: ABCDABCED

BF algorithm:

first step:

B A B C D A B A B C D A B C E D
A                              

Start from the first character position of the main string to match the first character position of the substring, and the match fails

Step 2:

B A B C D A B A B C D A B C E D
  A B C D A B C                

Start from the second character position of the main string to match the second character position of the slave string, the match fails

third step:

B A B C D A B A B C D A B C E D
    A                          

Start from the third character position of the main string to match the first character position of the slave string, the match fails

...........

Step 7:

B A B C D A B A B C D A B C E D
            A                  

Match from the seventh character position of the main string to the first character position of the substring, and the match fails

Step 8:

B A B C D A B A B C D A B C E D
              A B C D A B C E D

If the match is successful, this is the BF algorithm. After each match fails, the next character position of the main string is matched with the first character of the substring.

kmp algorithm:

first step:

B A B C D A B A B C D A B C E D
A                              
The primary string does not match the first character of the secondary string

Step 2:

B A B C D A B A B C D A B C E D

A B C D A B C







These two steps are the same as bf

third step:

B A B C D A B A B C D A B C E D





A B C







It is different from bf here. The reason is that the characters in the string have the same part (AB). First, let’s know the difference. I will talk about its implementation process later.

the fourth step:

B A B C D A B A B C D A B C E D
              A B C D A B C E D

If you don't understand it, I will explain it clearly below, and then find the subscript according to the program to know the reason.

The kmp algorithm is divided into two parts: ①next array (the core of the kmp algorithm is the above transfer work)

                              ② string matching

Before explaining the next array, first explain the knowledge of prefixes and suffixes (by the way, the longest length of the matching between prefixes and suffixes is also explained).

For example, the above substring: ABCDABCED (the prefix is ​​counted from the first character, but the suffix is ​​not counted backwards, it is also counted backwards, see the writing below)

Before the first character, it has no string, so there is no prefix and suffix, so it also has no maximum length, I initialize its length to 0.

Before the second character, its string is A, and the prefix and suffix cannot contain the string itself, so it has no prefix and suffix, and its length is 0.

Before the third character, its string is AB, prefix is ​​A, suffix is ​​B, does not match, length is 0.

Before the fourth character, its string is ABC, the first type, the prefix is ​​A, the suffix is ​​C, does not match, the length is 0,

                                               The second, the prefix is ​​AB, the suffix is ​​BC, does not match, the length is 0,

So the longest length is 0.

Before the fifth character, its string is ABCD, the first type, the prefix is ​​A, the suffix is ​​D, does not match, the length is 0,

                                                  The second, the prefix is ​​AB, the suffix is ​​BC, does not match, the length is 0,

                                                  The third type, the prefix is ​​ABC, the suffix is ​​BCD, does not match, the length is 0,

So the longest length is 0.

Before the sixth character, its string is ABCDA, the first type, prefix is ​​A, suffix is ​​A, match, length is 1,

                                                    The second, the prefix is ​​AB, the suffix is ​​DA, does not match, the length is 0,

                                                    The third type, the prefix is ​​ABC, the suffix is ​​CDA, does not match, the length is 0,

                                                    The fourth type, the prefix is ​​ABCD, the suffix is ​​BCDA, does not match, the length is 0,

So the longest length is 1.

Before the seventh character, its string is ABCDAB, the first type, the prefix is ​​A, the suffix is ​​B, does not match, the length is 0,

                                                      The second, the prefix is ​​AB, the suffix is ​​AB, matching, the length is 2,

                                                      The third type, the prefix is ​​ABC, the suffix is ​​DAB, does not match, the length is 0,

                                                      The fourth type, the prefix is ​​ABCD, the suffix is ​​CDAB, does not match, the length is 0,

                                                      The fifth type, the prefix is ​​ABCDA, the suffix is ​​BCDAB, does not match, the length is 0,

So the longest length is 2.

Before the eighth character, its string is ABCDABC, the first type, the prefix is ​​A, the suffix is ​​C, does not match, the length is 0,

                                                         The second, the prefix is ​​AB, the suffix is ​​BC, does not match, the length is 0,

                                                         The third type, the prefix is ​​ABC, the suffix is ​​ABC, matching, the length is 3,

                                                         The fourth type, the prefix is ​​ABCD, the suffix is ​​DABC, does not match, the length is 0,

                                                         The fifth type, the prefix is ​​ABCDA, the suffix is ​​CDABC, does not match, the length is 0,

                                                         The sixth type, the prefix is ​​ABCDAB, the suffix is ​​BCDABC, does not match, the length is 0,

So the longest length is 3.

Before the ninth character, its string is ABCDABCE, the first type, the prefix is ​​A, the suffix is ​​E, does not match, the length is 0,

                                                          The second, the prefix is ​​AB, the suffix is ​​CE, does not match, the length is 0,

                                                          The third type, the prefix is ​​ABC, the suffix is ​​BCE, does not match, the length is 0,

                                                          The fourth type, the prefix is ​​ABCD, the suffix is ​​ABCE, does not match, the length is 0,

                                                          The fifth type, the prefix is ​​ABCDA, the suffix is ​​DABCE, does not match, the length is 0,

                                                          The sixth type, the prefix is ​​ABCDAB, the suffix is ​​CDABCE, does not match, the length is 0,

                                                          The seventh, the prefix is ​​ABCDABC, the suffix is ​​BCDABCE, does not match, the length is 0,

So the longest length is 0.

Then according to the longest length we found, we can draw the next table,

subscript 1 2 3 4 5 6 7 8 9
from string A B C D A B C E D
next 0 0 0 0 0 1 2 3 0
Maybe other people's next and kmp are different from what I wrote, but that's how I understand it. If there is something wrong, you can bring it up
Of course, this next array is suitable and can be improved. Think about it, the above is comparing the bf and kmp algorithms. Is the third step of kmp redundant? Why? Because the second C in the ABCDABC of the second step is the same as the main The A in the string does not match, and the third step is ABC. The C in it must also not match A, so there is an improvement in the next array. How to improve it is very simple. As long as it is equal, let it be equal to the previous one. The next array subscript of the same character is enough, see the table

subscript 1 2 3 4 5 6 7 8 9
from string A B C D A B C E D
next 0 0 0 0 0 0 0 3 0

The second B has the same next value of 0 as the first B, and the second C has the same 0 next value as the first C.

This is the next array and its improved next array.

next array program:

void Get_next(int *next, string s2)
{
    int i = 1;
    int j = 0;
    next[i] = 0;
    while(i <= s2.length())
    {
        if(j == 0 || s2[i - 1] == s2[j - 1])
        {
            i++;
            j++;
            if(s2[i - 1] == s2[j - 1])
                next[i] = next[j];
            else
                next[i] = j - 1;
        }
        else
            j = 0;
    }
}

If you understand the evaluation method of the next array, you should be able to write the kmp character matching function according to the idea of ​​the bf algorithm.

The following is the overall program of the kmp algorithm:

#include <iostream>
#include <string>
using namespace std;

/ / Find the next array subscript
void Get_next(int *next, string s2)
{
    int i = 1;
    int j = 0;
    next[i] = 0;
    while(i <= s2.length())
    {
        if(j == 0 || s2[i - 1] == s2[j - 1])
        {
            i++;
            j++;
            if(s2[i - 1] == s2[j - 1])
                next[i] = next[j];
            else
                next[i] = j - 1;
        }
        else
            j = 0;
    }
}
// match part
int kmp(string s1, string s2, int pos)
{
    int next[100];
    Get_next(next, s2);
    int i = 0;
    int j = 0;
    int k = -1;
    while(i < s1.length() && j < s2.length())
    {
        if(j == 0 || s1[i] == s2[j])
        {
            i++;
            j++;
            k = j;
        }
        else
        {
            k = next[j + 1];
            j = k;
        }
    }
    if(j >= s2.length())
        return i - s2.length();
    return -1;
}
intmain()
{
    string s1= "babcdababcdabced";
    string s2 = "abcdabced";
    int pos;
    cin >> pos;
    cout << kmp(s1, s2 ,pos) << endl;
    return 0;
}

This program is just my personal thoughts and ideas. If there are any mistakes, please suggest them. If there is something you don't understand, please comment below. Thank you!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324857138&siteId=291194637