C language to realize multi-string search

1. The source of the problem

Use C language programming to search for strings in the text file Android.log, search for "CameraService::connect" and "logicalCameraId: 5, cameraId: 5", and print out the content of the line containing the string.

String search is a problem we often encounter, especially in the location of certain information in the log plays an important role. How to find the information we want in the massive text data has high requirements for the speed and accuracy of the search algorithm.

2. Algorithm selection

There are many kinds of string search algorithms, such as: KMP algorithm, regular matching, brute force search, Rabin-Karp algorithm, Sunday algorithm, BF algorithm, etc.

Detailed explanation and implementation of the principle of each string search algorithm reference website: link

This article uses the KMP (Knut Morris Pratt) algorithm, which is a linear time complexity string matching algorithm, which is an improvement of BF (Brute-Force, the most basic string matching algorithm). The kmp algorithm is mainly to reduce the backoff during the string search process, and to reduce unused operations as much as possible. The algorithm complexity is O(n+m).

What is the overall idea of ​​the KMP algorithm? Let's look at a set of examples:
Insert picture description here

The "opening" of the KMP algorithm and the BF algorithm is the same, the same is to align the first digits of the main string and the pattern string, and compare character by character from left to right.

In the first round, the pattern string is compared with the first equal-length substring of the main string, and it is found that the first 5 characters are matched, and the 6th character does not match, which is a "bad character":

Insert picture description here

At this time, how to effectively use the matched prefix "GTGTG"?

We can find that in the prefix "GTGTG", the last three characters "GTG" and the first three characters "GTG" are the same:
Insert picture description here

In the next round of comparison, only by aligning these two identical fragments can there be a match. These two string fragments are called the longest matching suffix substring and the longest matching prefix substring, respectively.

In the second round, we directly move the pattern string back by two positions, align the two "GTG", and continue to compare from the bad character A in the main string just now:
Insert picture description here

Obviously, the character A of the main string is still a bad character, and the matching prefix at this time is shortened to GTG:
Insert picture description here

According to the first round of thinking, we will redefine the longest matching suffix substring and the longest matching prefix substring:
Insert picture description here

In the third round, we move the pattern string back two places again, align the two "Gs", and continue to compare from the bad character A in the main string just now:
Insert picture description here

The above is the overall idea of ​​the KMP algorithm: the longest matching suffix substring and the longest matching prefix substring are found among the matched prefixes, and the two are directly aligned in the next round, so as to realize the rapid movement of the pattern string.

Citing the article link: link

Three, program realization

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int KMP(char s[],char t[]);
void Getnext(int next[],char t[]);

int main(int argc, char *argv[])
{
    
    
	//Read the file
	FILE *fp = fopen("/root/program/Android.log","r");	
	if(fp == NULL)

	{
    
    
		printf("Open error!");
		return 0;
	}
	//Read the string that the user wants to look up.
	printf("Please enter the number of strings of you want to find: \n");
	int x ;
	scanf("%d",&x);
	char str[x][100];
	for (int i = 0; i < x; i ++)
	{
    
    	
		printf("Please enter a string: \n");
		scanf("%s",&str[i]);
	}

	//Assigns the read to s and outouts it	
	char s[1000];
	int m = 1;	//row
	while(fgets(s,sizeof(s),fp))
	{
    
    
		//char t[] = "CameraService::connect";	
		for (int i = 0; i < x; i ++)
		{
    
    
			int n = 0;	//column				
			n = KMP(s,str[i]);
			if (n != -1)
			{
    
    
				printf("%s is in the %d row %d column \n",str[i],m,n);
				printf("%s \n",s);
			}
		}
		m++;	//row +1
	}
	fclose(fp);	//close file
	return 0;
}

/*
KMP算法
关键在于next,回退位置
*/
int KMP(char s[],char t[])
{
    
    
	int i=0,j=0;
	int slen = strlen(s);
	int tlen = strlen(t);
	int next[tlen];					//next数组长度与字符串t等长?
	Getnext(next,t);
	while(i < slen && j < tlen)
	{
    
    
		if(j == -1 || s[i] == t[j])
		{
    
    
			i++;
			j++;
		}
		else 
		{
    
    
			j = next[j];	// j回退
		}
	}
	if (j == tlen)
		return i - j;	//匹配成功,返回子串位置
	else
		return (-1);	//匹配失败
}

void Getnext(int next[],char t[])
{
    
    
	int j = 0,k = -1;
	next[0] = -1;
	int tlen =strlen(t); 
	while(j < tlen -1)
	{
    
    
		if (k == -1 || t[j] == t[k])
		{
    
    
			j++;
			k++;
			if(t[j] == t[k])	//当两个字符相同时,就跳过
				next[j] = next[k];
			else
				next[j] = k;
		}
		else
		{
    
    
			k = next[k];
		}
	}

}

Guess you like

Origin blog.csdn.net/Fighting_gua_biu/article/details/112258867