Data structure-----Detailed explanation of String

Table of contents

Preface

1. Definition of string

Related types

2. String storage structure

sequential storage representation

Heap allocated storage representation

Blockchain storage representation

3. String operation method

4. String matching algorithm

(1) BF algorithm

process principle 

Code implementation (C/C++) 

Analysis of Algorithms

(2)KMP algorithm

process principle

Matching process: 

 Get next array:

Code implementation (C/C++)

Analysis of Algorithms


Preface

        We learned about sequential tables and linear tables earlier. The storage data fields of these two data structures can be of any type, such as integers, string types, structures containing multiple types, etc... So today we learn new ones. Data structure - string, the type of its data field can only be character type, let's take a look below!

1. Definition of string

  • A string is a finite sequence of zero or more character arrays.
  • The number of characters in a string is called the length of the string, and a string containing zero elements is called an empty string.
  • A string is a linear list whose elements are restricted to characters.

(Note: String operations are very different from general linear table operations. Linear tables mainly target a certain element in the table , while string operations mainly target substrings )

Related types

  • String : A limited sequence of 0 or more characters. S=′a1​a2​...an′​(n>=0).
  • String length : the number of characters in the string
  • Empty string : a string of length 0
  • Space string : a string of 0 or more space characters
  • Substring : a subsequence of any consecutive characters in the string
  • Main string : a string containing substrings
  • Position : The sequence number of the character in the string. ps : The position of the substring in the main string is the position of its first character in the main string.
  • String equality : the length and corresponding characters are equal

2. String storage structure

sequential storage representation

 The sequential storage structure uses an area of ​​continuous addresses to store character data, which is actually stored in the form of an array. The sequential storage method must specify the maximum storage value. The quantity stored cannot exceed the specified maximum value, and the excess part will be discarded. The sequential storage structure is as follows:

#define Maxsize 256//定义最大容量
//顺序储存
typedef struct string {
	char ch[Maxsize];//储存字符串
	int length;		//当前串的长度
}String;

Heap allocated storage representation

 The heap allocation storage method uses dynamic allocation of space to store data, and then links the discontinuous spaces allocated in the heap together through pointer fields to form a linked list. Although dynamic allocation can allocate capacity according to actual conditions, since the pointer field occupies 4 bytes of space, the data field only occupies 1 byte of space, which is a bit of a waste of space, so many times we will use Sequential storage method to implement a string. The heap allocation storage structure is as follows:

typedef struct string {
	char ch;	//数据域
	struct string* next;//指针域
}String;

Blockchain storage representation

 The storage structure of the blockchain is an improvement on the heap allocation storage method, which increases the space occupation ratio of the data domain and improves the utilization of space. One node can store multiple characters, as shown below:

3. String operation method

The following is the method of operating strings. It is basically not much different from the operation methods of sequence tables and linked lists we learned earlier. I will not go into details here. Interested friends can go and compile the code themselves. Write a complete implementation of these functions. 

StrAssign(&T, chars);// 赋值操作。把串T赋值为 chars

Strcopy(&T, S);// 复制操作。把串S复制得到串T。

StrEmpty(S);// 判空操作。若S为空串, 则返回TRUE, 否则返回 FALSE

Replace(&S, T, V);//串的替换操作,把T替换为V

StrCompare(S, T);// 串比较操作。若S > T, 则返回值 > 0; 若S = T, 则返回值 = 0; 若S < T, 则返回值 < 0。

StrEngth(S);// 求串长。返回串S的元素个数

Substring(&Sub, S, pos, 1en);// 求子串。用Sub返回串S的第pos个字符起长度为len的子串。

Concat(&T, S1, S2);// 串联接。用T返回由S1和S2联接而成的新串。

StrInsert(&S, pos, T);//串的插入操作。把T插入到S的pos位置上

Index(S, T);// 子串的定位操作。若主串S中存在与串T值相同的子串, 则返回它在主串S中第一次出现的位置; 否则函数值为0
Clearstring(&S);// 清空操作。将S清为空串

Destroystring(&S);// 销毁串。将串S销毁
	

 What I want to focus on next is the search and matching of strings, that is, matching whether there is a substring that meets the conditions in a main string. Let’s take a look!

4. String matching algorithm

(1) BF algorithm

         The BF algorithm, also known as the Brute Force algorithm, is an ordinary pattern matching algorithm. The idea of ​​the BF algorithm is to match the first character of the target string S with the first character of the pattern string T. If they are equal, continue the comparison. The second character of S and the second character of T; if they are not equal, compare the second character of S with the first character of T, and continue the comparison until the final matching result is obtained. The BF algorithm is a brute force algorithm

process principle 
Code implementation (C/C++) 
#include<stdio.h>
#include<assert.h>
#include<stdlib.h>
#define Maxsize 256
//顺序储存
typedef struct string {
	char ch[Maxsize];//储存字符串
	int length;		//当前串的长度
}String;

//01---串的模式匹配算法  暴力查找法
int Index_BF(String* S, String* T) {
	assert(S);
	assert(T);
	int i = 0, j = 0;
	while (i < S->length && j < T->length) {
		if (S->ch[i] == S->ch[i])	//当匹配到相同的时候
		{
			i++;		//主串和模式串的指针指向 依次+1
			j++;
		}
		else
		{
			i = i - j + 1;//主串回溯到后面一个字符
			j = 0;			//模式串回溯到第一个字符
		} 
	}
	if (j == T->length)
		return i-j+2;	//返回匹配成功主串的位置
	return -1;
}
Analysis of Algorithms

 It can be seen that the BF algorithm is indeed very violent. It matches one by one until it succeeds, so the time complexity is O(mn) , m is the length of the main string, and n is the length of the pattern string.

(2)KMP algorithm

        The KMP algorithm is an improved string matching algorithm proposed by DEKnuth, JHMorris and VRpratt, so people call it the Knuth-Morris-Pratt operation (KMP algorithm for short). The core of the KMP algorithm is to use the information after the matching fails to minimize the number of matches between the pattern string and the main string to achieve the purpose of fast matching. The specific implementation is through a next() function, which itself contains the local matching information of the pattern string.

process principle

 The KMP algorithm can greatly reduce the number of matches, which means that the main string does not need to perform backtracking operations, thus improving the efficiency of matching. However, the KMP algorithm is still a bit difficult to understand and requires the use of a next array (possible mismatch location), I will explain in detail below.

Matching process: 

hypothesis

主串:    a b c d a b e a b c d a b

Pattern string: abcdabd

first match 

a b c d a b e a b c d a b d

a b c d a b d

It is obvious that the sixth one (d and e) is different. Before that, the maximum number of the same characters appeared before was 2 (a, b respectively).

 second match

We knew earlier that the maximum number of identical characters in the pattern string is 2. At this time, we directly move the position of the same character in the substring to the following position:

a b c d a b e a b c d a b d

            a b c d a b d

Here we can find that the main string does not match the third character in the pattern string, so at this time we need to compare the first position of the pattern string with e

third match

 a b c d a b e a b c d a b d

                   a b c d a b d

Obviously, the first character of the pattern string does not match the main string at this time, so the comparison position of the main string needs to be moved back one place, and the pattern string is also moved back one place, and compared again.

fourth match

 a b c d a b e a b c d a b d

                      a b c d a b d

At this time, the match is successful, and the position of the first character in the main string that successfully matches is returned (the position of a in the main string

 Get next array:

For pattern string: abcdabd 

1. There is no character in front of the first character a, so there is no prefix or suffix. Then its next array value is -1, that is, next[0]=-1

2. For the second character b, the prefix and suffix in front of it are the same character a, so it cannot mean that the prefix and suffix are equal, so its next array value is 0, that is, next[1]=0

3. For the third character c, its previous prefixes a and b are not equal, so there is no equal prefix and suffix, so next[2]=0

4. For the fourth character d, its prefixes are a, ab, and its suffixes are bc, c. Therefore, the prefix and suffix are not equal, so next[3]=0

5. For the fifth character a, its prefixes are a, ab, and abc, and its suffixes are bcd, cd, and c. They are also not equal, so next[4]=0

6. For the sixth character b, its prefixes are a, ab, abc, abcd, and its suffixes are bcda, cda, da, a. Here we find that there is a prefix equal to the suffix, which is a, so next[5]=1

7. For the seventh character d, its prefixes are a, ab, abc, abcd, abcda, and its suffixes are bcdab, cdab, dab, ab, a. There are two equal prefixes and suffixes in total, so next[6]=2

                               a          b           c             d          a            b        d 

                         next[0]    next[1]   next[2]   next[3]  next[4]  next[5]  next[6]

Value of next array: -1 0 0 0 0 1 2

Note: The value of the next array is only related to the pattern string and has nothing to do with the main string.       

 Here is an example of other pattern strings. You can also try to see how to get the next array.

Code implementation (C/C++)
#include<stdio.h>
#include<assert.h>
#include<stdlib.h>
#define Maxsize 256
//顺序储存
typedef struct string {
	char ch[Maxsize];//储存字符串
	int length;		//当前串的长度
}String;

//02---串的模式匹配算法  KMP算法
//next 数组获取
int* Next(String* S) {
	assert(S);
	int* next = (int*)malloc(sizeof(int) * S->length);//开辟一个跟模式串一样长度的数组next
	int j = -1, i = 0;	//初始化
	next[0] = -1;	//初始化
	while (i < S->length-1) {
		if (j == -1 || S->ch[i] == S->ch[j]) {	//当j为-1时候,也就是主串和模式串字符不匹配时;或者主串和模式串字符匹配的时候,进行next 数组赋值操作
			i++;
			j++;
			next[i] = j;	//如果不匹配的话,此时next数组当前的位置赋值为0,如果匹配的话就依次+1
		}
		else {
			j = next[j];	
		}
	}
	return next;
}

//算法接口
int Index_KMP(String* S, String* T) {
	assert(S);
	assert(T);
	int i = 0, j = 0;
	int* next = Next(&T);
	while (i < S->length && j < T->length) {
		if (S->ch[i] == S->ch[i])
		{
			i++;
			j++;
		}
		else
		{
			//这里对比与BF算法,就少了主串i的回溯
			j = next[j];	//模式串j就直接移到next数组当前的位置
		}
	}
	if (j == T->length)
		return i - j + 2;
	return -1;
}
Analysis of Algorithms

It can be seen that compared to the BF algorithm, the KMP algorithm greatly improves the matching efficiency. The time complexity is O(m+n) . m is the length of the main string and n is the length of the pattern string.

The above is all the content of this issue. Have you learned it? See you in the next issue!

Share a wallpaper:

Guess you like

Origin blog.csdn.net/m0_73633088/article/details/133149903