Knowledge of strings of data structures, simple pattern matching, KMP pattern matching

1. A series of basic knowledge

1. Definition of string:

String (String) is a finite sequence of zero or more characters, also known as a string.
一般记为:
s="a1a2.......an"(a>=0)

Note: where s is the string name, the sequence of characters enclosed in double quotes is the string value, and the double quotes themselves are not part of the string. ai (1 <= i <= n) is an arbitrary character, which is called the element of the string and is the basic unit that constitutes the string, i is its serial number in the entire string; n is the length of the string, which means that the string contains The number of characters.

2. Space string:

A string containing only spaces. The space string has a length, which is different from the empty string.

3. Empty string:

Empty string – A string of zero length is called an empty string. You can directly use two double quotes to indicate "";

4. String and main string:

A subsequence composed of any consecutive characters in a string is called a substring of the string. The string containing the substring is the main string of the substring.

5. Comparison of strings:

It is similar to the usage of strcmp in C language, using ASCII encoding for comparison.

Given two strings: s = ”a1a2 …… an”, t = ”b1b2 …… bm”, when one of the following conditions is met, s <t:

1. n <m, and ai = bi (i = 1, 2, ..., n), where n represents the length of the s string and s represents the length of the t string. Smaller than the t string means that the s string is less than the t string
2. There is a certain k ≤ min (m, n) This is when k> 1, so that ai = bi (i = 1, 2, ..., k-1), ak <bk, if the two strings from the first The characters start to be unequal, that is, if a1! = B1, only the size of a1 and b1 are compared directly.

6. The relevant data structure of the string:

ADT 串(string)
Data
    串中元素仅由一个字符组成,相邻元素具有前驱和后继关系。
Operation
    StrAssign(T, *chars):        生成一个其值等于字符串常量chars的串T
    StrCopy(T, S):               串S存在,由串S复制得串T
    ClearString(S):              串S存在,将串清空。
    StringEmpty(S):              若串S为空,返回true,否则返回false。
    StrLength(S):                返回串S的元素个数,即串的长度。
    StrCompare(S, T):            若S>T,返回值>0,若S=T,返回0,若S<T,返回值<0。
    Concat(T, S1, S2):           用T返回由S1和S2联接而成的新串。
    SubString(Sub, S, pos, len): 串S存在,1≤pos≤StrLength(S),
                                 且0≤len≤StrLength(S)-pos+1,用Sub返
                                 回串S的第pos个字符起长度为len的子串。
    Index(S, T, pos):            串S和T存在,T是非空串,1≤pos≤StrLength(S)。
                                 若主串S中存在和串T值相同的子串,则返回它在主串S中
                                 第pos个字符之后第一次出现的位置,否则返回0。
    Replace(S, T, V):            串S、T和V存在,T是非空串。用V替换主串S中出现的所有
    

2. Storage of strings

1. Sequential storage structure:

The sequential storage structure of the string is to use a group of consecutive address storage units to store the string. We will allocate a fixed-length storage area for the string, which is generally defined by a fixed-length array, which is an array.

Note: When we define a character array in C language, we need to add an extra length, because the compiler will automatically add a '\ 0' for us, so that it is convenient to determine the length of its string. But this '\ 0' does not count into the string length.

2. Chain storage structure:

Using linked lists to store strings, each node has two fields: one is the data field (data) and a pointer field (next). Data field (data)-store the characters in the string. Pointer field (next) – stores the address of the subsequent node.

Note : A node can store a character or a character array, but due to the special nature of the string, the use of linked list storage as a string storage method is also not practical, so it is used less.

3. Naive pattern matching algorithm

It is to traverse the main string, and then compare the string to be matched with the substring, first match the first letter of the substring to be matched with the main string, if the match is successful, the coordinates of the two strings are in turn ++, the match is not When successful, the coordinates of the main string return to the coordinates at the beginning of the match. The coordinates of the string to be matched are cleared. If the coordinate to be matched is equal to the length of the substring to be matched, the match is proved successful. The first coordinate of the main string after the match is returned, otherwise it returns 0
int Index(String S,String T,int pos)       //定义了一个主串,子串,开始查找的位置
 {
      int i=pos;      //主串的查找位置,从该位置以后开始查找是否有字串
     int j=1;          //此处不太理解  为啥其实位置是1,感觉应该是0开始的
     while(i<=S[0]&&j<=T[0])
      {  
	if(S[i]==T[j])//判断两个字符串是否相等
 	{
 		 ++i;         //两个相等时自加1,当判断出字串时,j=T的长度+1
 		 ++j;
 	} 
 	else{
 		 i=i-j+2;    //两个字符不相等时,i-j的相当于返回了当前字串开头的那个字母的前一个字符,然后加2,就相当于从主串的下一个字符作为字串开头
 	 	 j=1;       //这时被匹配的字串回到首位
 	      }
 	 }
   if(j>T[0])
 	 return i-T[0];
   else
	return 0;
 }

But if a large number of matches in your main string are found to be unequal every time, then match again. . . . This cycle will waste a lot of time, so there is a KMP pattern matching algorithm.

3. KMP pattern matching

In the traditional naive pattern matching algorithm, when the main string matches from i, when the match reaches the j position, a mismatch is found, the main string jumps back to the i + 1 position, and the matching string jumps back to the 0 position, which results in a low matching efficiency. The time complexity is high.

When KMP does not match the j position, the main string does not move. The matching string first calculates the maximum matching string length of the prefix substring and suffix substring from the current position. The essence of the KMP algorithm is to find the next process.

1. How to find the next array

Insert picture description here

Insert picture description here
Note : 1. In my understanding, there must be characters after the prefix, and the entire character cannot be a prefix. Similarly, there must be characters before the suffix.

For example: if the string is "aaaa", the prefix is ​​"aa", and the suffix is ​​"aa", it cannot be "aaa".
2. For the pattern string T, next [j] represents the substring of the first j characters of T In it, the length of the longest common string of its prefix and suffix is ​​+1.

code show as below:

void get_next(char T[], int *next)
{
 
 int i,j;
 i = 1;
 j = 0;
 next[1] = 0;
 while(i < T[0])      //T[0]是子串T的长度
 {       
  //T[i]表示后缀的单个字符
  //T[j]表示前缀的单个字符
  	if( j==0 || T[i] == T[j])    
  	{
  	 ++i;
  	 ++j;
   	next[i] = j;
  	}
  	else
	{
  	 j = next[j];
 	 }
 }
}

The overall KMP code is as follows:


 
int Index_KMP(char S[], char T[], int pos)
{
 int i = pos;
 int j = 1;
 int next[255];
 get_next(T, next);
 
 while(i <= S[0] && j <= T[0]){
  //相对于朴素算法,增加了一个j==0的判断
  if( j==0 || S[i] == T[j]){
   ++i;
   ++j;
  }else{
   //j回退到合适的位置,i的值不变
   j = next[j];
  }
 }
 if( j>T[0]){
  return i-T[0];
 }else{
  return 0;
 }
}
Published 10 original articles · Likes2 · Visits 217

Guess you like

Origin blog.csdn.net/dfwef24t5/article/details/105441813