String pattern matching algorithm -EMP

String pattern matching algorithm -EMP

Foreword

World algorithms is wonderful, recently learned mode data structure string Yan Wei Min match; struggling for two days, maybe it's a simple question, I may be dumb, ha ha, this section looked back and forth several times, finally a thorough understanding of a little bit, so he wanted to write down what I understand, where if there is an error or do not understand, please leave a message.

First look at the traditional algorithm

principle

To book the string as an example:

   i->  0 1 2 3 4 5 6 7 8....
主串:   a c a b a a b a a b c a c a a b c
 
   j->  0 1 2 3 4 5 
模式串:  a b a a b c      

Definition of s [i] point to the main strings, p [j] string pointing mode; beginning, j = 0, i = 0; p [j] is compared with s [i], if s [i] == p [j ], the j, i the pointer moves ,, p [j + 1] and s [i + 1] Comparative until all pattern strings of characters from the main stream s [i] to s [i + j] of characters are equal, if not equal occurs, i back to the second position of the main strings (i.e., i == 1), the first character string to continue with the start of the comparison pattern (i.e., position j == 0) until the completion of the comparison;

The above code is seemingly without any problems, it seems perfect, assume that the primary length of the string is n, the pattern string length m, a look at the following example of the time complexity;

   i->  0 1 2 3 4 5 6 7 8....
主串:   a b c d e f g h i j k l m n
 
   j->  0 1 2 3 4 5 
模式串:  i j k l m n

Comparative Examples described above only to n times, the time complexity is O (n), the efficiency is very high visible; look at the following example;

   i->  0 1 2 3 4 5 6 7 8....
主串:   a a a a a a a a a a b
 
   j->  0 1 2 3 4 5 
模式串:  a a a a a b

In the above case, each time a result of the comparison is the last character does not match, the comparison after the first round, i actually do not need to go back to s [1], because the first main string 9 is a, face this case, the traditional algorithm becomes bloated, which put EMP improved algorithm;

Code

Note: The following codes and some differences books, the default is not the first space to store the length of the string, i.e., the string starting from 0, and the book on a string, the length of the first storage space

int  Index(HString *T,HString *S, int pos)
{
    int j=0;
    while(pos<T->length&&j<S->length)
    {
        if(T->ch[pos]==S->ch[j])
        {
            pos++;j++;
        }
        else
        {
            pos=pos-j+1;
            j=0;
        }
    }
    if(j>=S->length)
    {
        int len;
        len=pos-S->length+1;
        printf("%d",len);
        return len;
    }
    else
    {
       return 0;
    }
}

EMP algorithm

principle

EMP is the basic idea of ​​the algorithm, the pointer i does not need to backtrack main string, by moving the pattern string matching, calculates the movement position by the string has a good matching promoter;

To the book as an example:

   i->  0 1 2 3 4 5 6 7 8....
主串:   a b a b c a b c a c b a b
 
   j->  0 1 2 3 4 
模式串:  a b c a c

The matching process is as follows:

第一轮:
   i->  0 1 i 3 4 5 6 7 8....
主串:   a b a b c a b c a c b a b
   j->  0 1 j
模式串:  a b c

解析:当匹配到第三个时,s[2]!=p[2],此时,i 不必移动,通过移动j到模式串的第一个,即s[2]和p[0]比较

第二轮:
   i->  0 1 2 3 4 5 i 7 8....
主串:   a b a b c a b c a c b a b
   j->      0 1 2 3 j
模式串:      a b c a c

解析:第二轮是从s[2]与p[0]开始匹配,当i移动到6,j移动到5时,有出现不匹配的情况,此时又需移动模式串,可是问题来了,j要移动到模式串的哪个位置呢,通过观察我们发现,s[i]==p[1]、s[5]==p[0],因此我们仅需将j移动到p[1]的位置,此时相当于已经排好了两个,再接着比较下一个即可;

第三轮:
   i->  0 1 i 3 4 5 6 7 8....
主串:   a b a b c a b c a c b a b
   j->            0 1 2 3 4
模式串:            a b c a c

第三轮即匹配完成

As can be seen from the above matching procedure, when a mismatch occurs, only need to move to j, we can use a next array [] to store each character string pattern does not match occurs, the corresponding movement required to position, so the code can be modified to look like this;

//T是注主串,S是模式串
int  Index(HString *T,HString *S, int pos,int *next)
{
    int j=0;
    while(pos<T->length&&j<S->length)
    {
        if(j==-1||T->ch[pos]==S->ch[j])
        {
            pos++;j++;
        }
        else
        {
            j=*(next+j);  //此处是与传统算法的主要区别
        }
    }
    if(j>=S->length)
    {
        
        int len;
        len=pos-S->length+1;
        printf("%d\n",len);
        return len;
    }
    else
    {
        return 0;
    }
}

Next seeking [] value

Through the above we already know, want to make the above code to run properly, we need to claim the value of next [j] of;

E.g:

   j->   0 1 2 3 4 5 6 7
模式串:   a b a a b c a c
next[]:-1 0 0 1 1 2 0 1
  • When the first pattern string and the main strings to s [i] does not match, this time should i + 1, i.e., s [i + 1] to continue the first matching pattern string, so we make the next [0 ] == --1;
  • Then there are two cases, the first: if the current is equal to s [j] and the corresponding Next j [j] corresponding to the character string, i.e., s [j] == s [next [j]], at this time s [j + 1] corresponding to the next [j + 1] is equal to next [j] +1; Moreover, s [0] corresponding to the next [j] == - 1, and the string starts from index 0 therefore s [1] corresponding to a certain value is 0, i.e., next [] == 0;
  • The second: If the current when the S [j] and the corresponding Next j [j] corresponding to the strings are not equal, then j should return to s [next [j]] corresponding to the Next [] value before continuing the line comparison;

The evaluation process:

第一个:
next[0]==-1;

第二个:
next[1]==0;

第三个:
由于s[1]!=s[next[0]],所以next[2]==0;

第四个:
因为s[2]==s[next[2]],所以,无论s[3]匹配还是不匹配,next[3]都等于next[2]+1,即s[3]=1;

第五个:
s[3]!=s[next[3]],然后接着向前比较,s[next[3]]==s[next[s[next[3]]]],所以此时,s[4]==next[s[next[3]]]+1;

第六个:
同样,s[4]!=s[next[4]],所以,s[5]==next[4]+1;

后面几个也是同理

Code:

void Get_next(HString *T,int *next)
{
    
    int i=0,j=-1;
    *(next +0)=-1;
    while(i<T->length)
    {
      
        if(j==-1||T->ch[i]==T->ch[j])
        {
            ++j;++i;
            *(next+i)=j;
        }
        else
        {
            j=*(next+j);
        }
    }
    /*  测试数组next[]的值,是否正确
    for(int t=0;t<T->length;t++)
    {
        printf("%d ",*(next+t));
    }
     */
}

Complete code:

#include<stdio.h>
#include<stdio.h>
#include <stdlib.h>

typedef struct
{
    char *ch;
    int length;
}HString;

void StrAssign(HString *T,char *chars)
{
    
    int len = 0;
    while(*(chars+len)!='\0')
    {
        len++;
    }
    if(len==0)
    {
        T->ch=NULL;
        T->length=0;
    }
    else
    {
        T->ch=(char *)malloc(len * sizeof(char));
        for(int i=0;i<len;i++)
        {
            T->ch[i]=*(chars+i);
        }
        T->length=len;
    }
   // printf("%d\n",T->length);
}

void Get_next(HString *T,int *next)
{
    
    int i=0,j=-1;
    *(next +0)=-1;
    while(i<T->length)
    {
      
        if(j==-1||T->ch[i]==T->ch[j])
        {
            ++j;++i;
            *(next+i)=j;
        }
        else
        {
            j=*(next+j);
        }
    }
    /*
    for(int t=0;t<T->length;t++)
    {
        printf("%d ",*(next+t));
    }
     */
}

int  Index(HString *T,HString *S, int pos,int *next)
{
    int j=0;
    while(pos<T->length&&j<S->length)
    {
        if(j==-1||T->ch[pos]==S->ch[j])//由于数组是从0开始,当j==-1时,直接运行
        {
            pos++;j++;
        }
        else
        {
            j=*(next+j);
        }
    }
    if(j>=S->length)
    {
        
        int len;
        len=pos-S->length+1;
        printf("%d\n",len);
        return len;
    }
    else
    {
        return 0;
    }
}

int main()
{
    int next[20];
    char s1[100];
    char s2[100];
    printf(" 请输入字符串s1:\n");
    gets(s1);
    printf("请输入字符串s2:\n");
    gets(s2);
    
    HString S,S1,S2,*p,*p1,*p2;
    p=&S; p1=&S1; p2=&S2;
    StrAssign(p1, s1);
    StrAssign(p2, s2);
    Get_next(p2, next);
    Index(p1, p2, 2, next);
}
/*  测试
 请输入字符串s1:
warning: this program uses gets(), which is unsafe.
ababcabcacbab
请输入字符串s2:
abcac
6
Program ended with exit code: 0
*/

Postscript: Perhaps I write more vague, really tried

Guess you like

Origin www.cnblogs.com/zhulmz/p/11707354.html