String matching algorithm of small note

Brute Force algorithm

BF algorithm, also known as simple pattern-matching algorithm was proposed by Bruce Force to the algorithm basic idea is simple and crude comparison of a problem. C code algorithm implementation is simple, but encountered some problems in detail operation.
Code issues

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define LEN 100

int B_FIndex(char S[],char T[])
{
    int i = 0,j = 0;
    int Len_S = strlen(S); // 使用strlen函数获取字符串长度
    int Len_T = strlen(T);

    if(S == NULL || T == NULL || Len_S < Len_T)     // Error Input
        return -1;

    while(i < Len_S && j < Len_T)
    {
        if(S[i] == T[j])
        {
            i++;
            j++;
        }// end if
        else
        {
            i = i - j +1;
            j = 0;
        }// end else
    }//end while

    if(j >=Len_T)
    {
        printf("Success patterened at %d",i-j);	// 匹配成功位置信息提示
        return 0;
    }
    else
    {
        printf("Fail patterened");
        return -1;
    }
}//end BF

int main(int argc,char * argv[])
{
    char S[LEN+1],T[LEN+1];

    fgets(S,sizeof(S),stdin);	// 采用推荐的安全函数fgets
    fgets(T,sizeof(T),stdin);

    B_FIndex(S,T);
}

This code first glance no problem, and no difference in other blog code, the compiler can be passed, but it can not match 100% correct, why do you say?
case1: = master serial substring

hello world
hello world
Success patterened at 0
Process returned 0 (0x0)   execution time : 16.536 s
Press any key to continue.

You can correct output expectations.
case2: substring <main string

hello world
hello
Fail patterened
Process returned 0 (0x0)   execution time : 9.249 s
Press any key to continue.

This is what causes?
Because scanf () function can only receive a continuous string, then the end of the encounter spaces, have an impact on the matching function, so I chose the fgets () function (ratio () gets function is more secure, universal).
If we switch to direct gets () function in a typical application is no problem, the above code works correctly. But gets () function biggest problem is that it does not check whether the input exceeds the length of the array, which causes memory overflow security issues.
The real reason causing the problem on another difference with the fgets () and gets () is:

  • gets () reads characters one by one, read the guide line break is to stop (discard newline)
  • fgets () is clocked into the character, until it encounters the first line break or been read into the sizeof (str) -1 characters at the end of the operation. If a newline is read, it will save with line breaks, and other characters .

So the key question is the newline character.
In the case of the case1, the main character string

h e l l O w O r l d ‘\n’ ‘\0’

Also for string

h e l l O w O r l d ‘\n’ ‘\0’

So the match is successful, no problem.
But in case2, as a string

h e l l O ‘\n’ ‘\0’

Thus causing the substring matching is to take the time to go to match newline.

When the above analysis is the problem, the solution is to compute the length of the string in the right line feed function BF lost, the key code is modified as follows

    int Len_S = strlen(S) - 1; // 使用strlen函数获取字符串长度,不包含结束符,
    int Len_T = strlen(T) - 1; // 减1就去掉了换行符。

Of course, there are other ways, just to put forward specific amendments to this problem.

KMP algorithm

About this algorithm is not listed here, c ++ code directly given
next function
method gives here two kinds of next function, can work

/* 求模式全的netx[]函数
   next[]下标从0开始,初始值定位-1
   S:cdbcbabcdbcdbcacabc
   T:bcdbcdbca
*/

#include <iostream>
#include <string>

using namespace std;

int get_next1(string T,int next[])
{
    next[0] = -1;
    int k = -1;

    for (int i = 1; i < T.size(); i++)
    {
        while (k > -1 && T[k + 1] != T[i])///如果匹配到某一个不相等的时候,开始回溯
        {
            k = next[k];///回到next数组中记录的位置重新匹配
        }
        if (T[k + 1] == T[i] || k == -1)///如果字符一样,继续向前匹配
        {
            k++;
        }
        next[i] = k;///将最大匹配的值赋给next数组
    }
}

int get_next2(string T,int next[])
{// 求next函数的值,
    int i = 0;
    signed int j = -1;
    next[0] = -1;       // initialization
    while(i < T.size())     // 与T[0]字符的ASCII值比较,显然始终成立
    {
        if(j == -1 || T[i] == T[j])
        {
            ++i;
            ++j;
            next[i] = j;
        }
        else
            j = next[j];
    }
}

int main(int argc,char* argv[])
{

    string T;

    cout << "Please input Substring: ";
    cin >> T;
    int next[T.size()] = {0};
    get_next1(T,next);
    cout << "The next[] of substring T:";
    for(int i =0; i <T.size(); i++)
        cout << next[i] << " ";

    cout << endl;

    get_next2(T,next);
    cout << "The next[] of substring T:";
    for(int i =0; i <T.size(); i++)
        cout << next[i] << " ";

    cout << endl;
}

Export

Please input Substring: bcdbcdbca
The get_next1 of substring T:-1 0 0 0 1 2 3 4 0
The get_next2 of substring T:-1 0 0 0 1 2 3 4 5

Process returned 0 (0x0)   execution time : 4.611 s
Press any key to continue.

KMP achieve

#include <iostream>
#include <cstring>
using namespace std;


/*
    得到next数组
*/
void get_next(string T, int *next)
{
    int j, k;
    j = 0;
    k = -1;
    next[0] = -1;
    while(j <T.size())
    {
        if(k == -1 || T[j] == T[k])
        {
            j++;
            k++;
            next[j] = k;
        }
        else
        {
            k = next[k];
        }
    }
}

/*
    KMP算法
*/
int KMP(string S, string T, int *next)
{
    int i = 0, j = 0;
    get_next(T, next);
    int Len_S = S.size();
    int Len_T = T.size();

    while (i < Len_S && j < Len_T)
    {
        if (j == -1 || S[i] == T[j])
        {
            i++;
            j++;
        }
        else
        {
            j = next[j];
        }
    }
    if (j == Len_T)
    {
        cout << "Success patterned with shift " << i-j <<endl;
        return i - j;
    }
    else
    {
        cout << "Fail patterned" << endl;
        return -1;
    }
}

int main()
{
    string S,T;
    cin >> S ;
    cin >> T;
    int next[T.size()];
    int i;

    Get_Next(T, next);
    cout << "模式串的next的值为:";
    for (i = 0; i < T.size(); i++)
        cout << next[i] << " ";
    cout << endl;
    KMP(S, T, next);
}

Export

cdbcbabcdbcdbcacabc
bcdbcdbca
模式串的next的值为:-1 0 0 0 1 2 3 4 5
Success patterned with shift 6

Process returned 0 (0x0)   execution time : 19.706 s
Press any key to continue.
Published 25 original articles · won praise 23 · views 10000 +

Guess you like

Origin blog.csdn.net/Secur17y/article/details/100585369