Luo Gu P3375 [template] KMP string matching

Topic Portal: poke me into

 

KMP string matching algorithm is used to deal with the problem, that is, to give you two strings, you need to answer is whether the B string sub-string A string, there have been several strings B, B string A string in A location string arise.

Meaning KMP algorithm is that if you made something in Luo Valley, kkksc03 you can find out if you said something discordant word according to KMP algorithm, and block out your sentence of discordant words ( such as chicken you cxk so beautiful it will be shielded to **** CXK ), you will be punished according to the number of discordant words appear in your sentence

 

For chestnut: A: GCAKIOI B: GC, then we call B string is a substring A string

We call waiting A string matching the main string to match the B-string is a string mode.

The simple approach is to enumerate general position of the first letter B string appearing in the A string and determine the suitability of this approach and the time complexity is O (mn), when you deal with a long article obviously time will time out.

We find that during the string matching, most attempts fail, then there is no algorithm to take advantage of these failures do?

KMP algorithm is

KMP algorithm using the key information after the matching fails, to minimize the number of pattern matching string and the main strings to achieve rapid match

Provided main string (hereinafter referred to as T)

Set pattern string (hereinafter referred to as W)

Violence during string matching algorithm, we will T [0] with W [0] match, if the next character is the same match, until the situation is not the same, then we discard the foregoing matching information, and then T [1] with W [0] matching cycle, until the end of the main train, or the successfully matched situation. This method of matching information is discarded front, matching efficiency is greatly reduced.

Let's look at how this works KMP

In KMP algorithm, a pattern for each string we calculated in advance information pattern string matching internal (that is the only thing relevant string and patterns, may be pretreated, this process we will be mentioned later), the matching fails the maximum movement pattern string, in order to reduce the number of matches.

For example, after a simple match fails, we will want to try the pattern string and the right main string to match. Right distance KMP algorithm is thus calculated: in the matched sub-string pattern string, find the longest prefix and suffix of the same, and then move so that they overlap.

We use two pointers i and j represent A [i-j + 1 ...... i] and B [1 ...... j] exactly equal, i.e. i is increasing, and with the increase of i, j changes accordingly, and j satisfy the length of the end of the a [j] is j string matches exactly before the j-th character string B, are now required to see a [i + 1] and B [ j + 1] relationship

 

  • When A [i + 1] = B [j + 1], we increase the i and j are each 1
  • Otherwise, we reduce the value of j such that A [i-j + 1 ...... i] and B [1 ...... j] holding a new match and try to match the A [i + 1] and B [j + 1]

For chestnut:

T: a b a b a b a a b a b a c b

W:a b a b a c b

When i = j = 5, this time T [6]! = W [6], suggesting that this case can not j is equal to 5, this time we want to change the value of j such that W [1 ... j] in front j 'and the letters j' the same letters, because j to j 'after (i.e. the right W j' length) in order to keep the nature of i and j. The j 'is obviously the bigger the better. Where W [1 ... 5] are matched, we have found that the first three letters ababa and ABA are the three letters, so j '3 is the maximum, then this is the case

T: a b a b a b a a b a b a c b

W:      a b a b a c b

Then the time i = 5, j = 3, we find that T [6] and W [4] are equal, then T [7] and W [5] are equal (here two-step)

So now the case: i = 7, j = 5

T: a b a b a b a a b a b a c b

W:      a b a b a c b

This time has emerged T [8]! = W [6] of the situation, so we continue. Because just now seeking out when j = 5 Shi, j '= 3, so we can directly use (by where we can find j' is how much and nothing to do with the main string, string and pattern only has a relationship)

So it has become such

T: a b a b a b a a b a b a c b

W:            a b a b a c b

At this time, the new still can not satisfy j = 3 A [i + 1] = B [j + 1], so we also need to take j '

We found that when j = 3 aba the first letter and the last letter is a, so when j '= 1

The new situation:

 

T: a b a b a b a a b a b a c b

 

W:                  a b  a b a cb

Still not satisfied, so needs to be reduced to j j 'is 0 (when we set when j = 1, j' = 0)

T: a b a b a b a a b a b a c b

W:                      a  b  a b a cb

Finally, T [8] = B [1], i becomes 8, j to 1, a next one we found are equal, when the last further satisfies the condition j = 7, the following we can Conclusion: W is a substring of T, and can also be found in the main string substring positions (i + 1-m + 1, because the index starts from 0)

This part of the code is very short, because with a for loop

inline void kmp()
{
    int j=0;
    for(int i=0;i<n;i++)
    {
        while(j>0&&b[j+1]!=a[i+1]) j=nxt[j];
        if(b[j+1]==a[i+1]) j++;
        if(j==m) 
        {
            printf("%d\n",i+1-m+1);
            j=nxt[j]; 
    //当输出第一个位置时 直接break掉 
    //当输出所有位置时 j=nxt[j]; 
    //当输出区间不重叠的位置时 j=0 
        }
    }
}

 

这里就有一个问题:为什么时间复杂度是线性的?

我们从上述的j值入手,因为每执行一次while循环都会使j值减小(但不能到负数),之后j最多+1,因此整个过程中最多加了n个1.于是j最多只有n个机会减小。这告诉我们,while循环最多执行了n次,时间复杂度平摊到for循环上后,一次for循环的复杂度是O(1),那么总的时间复杂度就是O(n)的(n是主串长度)。这样的分析对于下文的预处理来说同样有效,也可以得到预处理的时间复杂度是O(m)(m是模式串长度)

接下来是预处理

预处理并不需要按照定义写成O(m2)甚至O(m3),窝们可以通过nxt[1],nxt[2]....nxt[n-1]来求得nxt[n]的值

举个栗子

 W :a b a b a c b

nxt:0 0 1 2 ??

假如我们有一个串,并且已经知道了nxt[1~4]那么如何求nxt[5]和nxt[6]呢?

我们发现,由于nxt[4]=2,所以w[1~2]=w[3~4],求nxt[5]的时候,我们发现w[3]=w[5],也就是说我们可以在原来的基础上+1,从而得到更长的相同前后缀,此时nxt[5]=nxt[4]+1=3

W :a b a b a c b

nxt:0 0 1 2 3?

那么nxt[6]是否也是nxt[5]+1呢?显然不是,因为w[nxt[5]+1]!=w[6],那么此时我们可以考虑退一步,看看nxt[6]是否可以由nxe[5]的情况所包含的子串得到,即是否nxt[6]=nxt[nxt[5]]+1?

事实上,这样一直推下去也不行,于是我们知道nxt[6]=0

那么预处理的代码就是这样的

inline void pre()
{
    nxt[1]=0;//定义nxt[1]=0 
    int j=0;
    rep(i,1,m-1)
    {
        while(j>0&&b[j+1]!=b[i+1]) j=nxt[j];
        //不能继续匹配并且j还没有减到0,就退一步 
        if(b[j+1]==b[i+1]) j++;
        //如果能匹配,就j++ 
        nxt[i+1]=j;//给下一个赋值
    }
}

 

完整的代码:

#include<iostream>
#include<cstdio>
#include<cstdlib>
#include<cstring>
#include<string>
#include<cmath>
#include<queue>
#include<algorithm>
#include<iomanip>
using namespace std;
#define rep(i,a,n) for(int i=a;i<=n;i++)
#define per(i,n,a) for(int i=n;i>=a;i--)
typedef long long ll;
ll read()
{
    ll ans=0;
    char last=' ',ch=getchar();
    while(ch<'0'||ch>'9') last=ch,ch=getchar();
    while(ch>='0'&&ch<='9') ans=ans*10+ch-'0',ch=getchar();
    if(last=='-') ans=-ans;
    return ans;
}

char a[1000005],b[1000005];
int nxt[1000005],n,m;

inline void pre()
{
    nxt[1]=0;
    int j=0;
    rep(i,1,m-1)
    {
        while(j>0&&b[j+1]!=b[i+1]) j=nxt[j]; 
        if(b[j+1]==b[i+1]) j++;
        nxt[i+1]=j; 
    }
}

inline void kmp()
{
    int j=0;
    for(int i=0;i<n;i++)
    {
        while(j>0&&b[j+1]!=a[i+1]) j=nxt[j];
        if(b[j+1]==a[i+1]) j++;
        if(j==m) 
        {
            printf("%d\n",i+1-m+1);
            j=nxt[j]; 
        }
    }
    rep(i,1,m) printf("%d ",nxt[i]);
    
}

int main()
{
    scanf("%s%s",a+1,b+1);
    n=strlen(a+1),m=strlen(b+1);
    pre();
    kmp();
    return 0;
}

 

Guess you like

Origin www.cnblogs.com/lcezych/p/11002026.html