--KMP search string matching algorithms (Knuth-Morris-Pratt string-searching) C language and explanation

I. Introduction

  In computer science, Knuth-Morris-Pratt string search algorithm (simply referred to as KMP algorithm) to find a word occurrence position W in a main text string S. This algorithm does not match the word itself contain sufficient information to determine the next by using a match will start finding where to avoid re-examine previously matched characters. This algorithm was conceived in 1974 by the Gartner and Vaughan Pratt, the same year James · H · Morris also independently devised the algorithm, jointly published by the final three in 1977. (From: wikipedia)

  Search KMP (Knuth-Morris-Pratt string- searching) is a string matching algorithm more efficient algorithms, which make up some of the shortcomings of violence matching algorithm by avoiding unnecessary back in the character string matching step, matching shortened time, it is only the time complexity of O (n-m +) , suitable for use in time-critical situations, but also some test sites game, quite useful. But this method is essentially a special case of AC automaton, there are some difficulties understanding . This article will explain how to understand and implement kmp algorithm, the instructions on the relevant mathematics can refer to the "Introduction to Algorithms" string matching the relevant sections.

Second, the code

The following is the implementation code, you can review and then do analysis.

#include <stdio.h>
#include <string.h>
void getnext(char *t);        //计算子串的状态转移数组的函数
int kmp(char *s,char *t);     //kmp算法的主要匹配搜索函数
int next[255];                //全局next数组更方便调用,大小根据实际情况更改
int main(void)
{
    int n;
    char s[255],t[255];
    printf("母串:");
    scanf("%s",s);
    printf("子串:");
    scanf("%s",t);
    n=kmp(s,t);
    if(n==0)
        printf("匹配失败\n");
    else
        printf("在第%d位匹配成功",n);
    return 0;
}

void getnext(char *t)
{
    int i=0,j=-1,l=strlen(t);       //j初始化为-1只是方便计算,更易于理解,无特殊含义。
    next[0]=-1;                     //这里如果用next[i]=j后续有可能出现死循环,故单独赋值。
    while(i<l)
    {
        if(j==-1||t[i]==t[j])       //t[i],t[j]分别表示前缀子串单个字符和后缀子串单个字符,若匹配成功则以一种累加
        {                           //的方式继续向后匹配,所以每次比较一个字符,可以动手尝试分步理解
            ++i,++j;                
            if(t[i]!=t[j])          //这里是针对原先方法的一些优化,后续会将
                next[i]=j;
            else
                next[i]=next[j];
        }
        else
            j=next[j];              //字符不相同时进行回溯
    }
}

int kmp(char *s,char *t)
{
    int i=0,j=0;
    int sl=strlen(s),tl=strlen(t);
    getnext(t);            
    while(i<sl&&j<tl)
    {
        if(j==-1||s[i]==t[j])
            ++i,++j;
        else
            j=next[j];        //字符串失配时回溯到正确位置再次匹配
    }
    if(j==tl)
        return i-tl+1;
    else
        return 0;
}

Third, specific analysis

1. The method of seeking the next transfer array and Analysis

There are sub-string letter string s and t

s="abcdefgab"
t="abcdex"。

We can see that, the first five characters of the two strings are equal respectively, only in the sixth mismatch. If a match is required in accordance with violence in order to match it again. But we can see by observing, substring six letters are different, s string of initials and acronyms t strings are the same, then it means that the first character of the substring t is not possible with the parent string of 2-5 characters between a match, this time, there are some violent matching step is completely omitted, then the character can be seen the same way skip. Because even if we know the s [5]! = T [5], t [0]! = T [5], we can not be sure t [0] certainly does not mean that s [5], so they need to keep two match of that time.

T [I] == S [I] (I = 0,1,2,3,4)
! T [0] = T [J] (J = 1,2,3,4)
can be introduced: t [0 ]! = s [j] ( j = 1,2,3,4)

Through the above example, we can see kmp algorithm is based on what specific backtracking, we can see that such a retrospective way than violence match where the good news. Since we are going to take the substring matching the parent string, then it is surely points to a substring of digital backtracking, that is to say, each corresponding to the next string value has nothing to do with the parent string. We can now continue to verify the character repeat of the situation we now have substring t

t="abcabx"

We first need to understand two concepts: "prefix" and "suffix." "Prefix" refers to the total head assembly except the last character of a string. "Suffix" means except for the first character of a string of all the tail combination. Public value is the maximum length "prefix" and "suffix" longest total elements. Secondly, J next index points to an array of n bits when the calculated value is the greatest common character string consisting of characters before the n-1, as is described next string array when a mismatch bit n metastasis, it does not consider the bit n. We can find "ab" there have been repeated, so at the corresponding x value of 2 is the maximum value of the next array of public, which is located next [0] = - 1 benefits , easier to understand, more vivid. If after mismatch at x, we can integrally move rearward so that the first position corresponds ab move after move before the second ab continue onward from the matching c.
  Continue to think, we will find the string just t contain two a, B two, in fact, if the value of the first time to replace the next subsequent value of the same character, can then seek to avoid some cases before the next array method repeating defects match, this defect will occur in some consecutive string of the same letter. Reason not start speaking, you can use the previous method to analyze the string "aaaabcde" and "aaaaax" to get results. Finally, t FIG string next following array, can try to seek themselves.

At this point, we get the next shift an array of sub-strings.

Analysis 2.kmp matching functions

kmp search function is relatively simple, the main difficulty in understanding the next function, the binding string and next handle letter string array to match the line. Returns 0 if the match fails, the match is successful successfully matched position return. Moreover, this is just kmp simplest usage, you can increase his function as required, for example, for the most kid string, string Praying areas in which the master string appears and so on.

Fourth, the end of the

In fact, there are many other string matching algorithm, such as more excellent string pattern matching algorithm Sunday algorithm, and somewhat higher efficiency than kmp, but understand kmp algorithm can also help us better understand other algorithms.

Guess you like

Origin www.cnblogs.com/comixH/p/12232712.html