The use of regular expression Regex classes in C# and Unity - the most vernacular, teach you how to do it

Cause: I used regular expressions to match some network protocols in the project. The new protocol A, on the surface, looks the same as the previous writing format, but the original regular expressions still can’t match the content of A, so I made a special trip to understand the use and analysis of regular expressions. Finally, I found out that the reason is that it is not compatible with Windows newline characters and Linux newline characters. Just because this part is unfamiliar, I will record my own puzzle-solving process.

Regarding regularization, please read the official introduction directly: Official description
Basically all questions can be answered from the official description, but in order to save time.
Below I use an example to describe the problems I encountered, the conclusions I got, and the methods I used.

The following part is the structure of the string I need to match regularly / In actual operation, we will contain multiple following structures in the same string

// 创建请求 req
struct TestCreateRq {
    
    
    string Name;            // 名称
    i32 Icon;               // 图标
    i32 Type;               // 条件类型 
    i32 requirementParam;   // 条件参数
}

The following part is the regular expression I use/what I want to do is to find all the above structures in the string, and then print out the description of the structure and the name of the structure

		//这里是我们使用的匹配规则
 		Regex reg = new Regex(@"(//.*\r\n)?struct\s*(\w*)\s*{\r\n([\w\W]*?)}\s*", RegexOptions.Multiline);
 		//这里content是我们用来匹配的字段,里面包含多个上面的结构体
        var matches = reg.Matches(content);
        foreach (Match item in matches)
        {
    
    
            Debug.Log(item.Result("$1").Replace("//", "").Replace("\r\n", "");
            Debug.Log(item.Result("$2"));
        }

Let's look directly at the matching rules section

Question about regex

Question 1:
What does @ in front of the string mean?
Answer: @ is a mandatory non-escaping symbol in c#, and the escape character inside is invalid.
example:

Console.WriteLine("你好\t吗?");
//输出结果为:你好  吗?
Console.WriteLine(@"你好\t吗");
//输出结果为:你好\t吗?

Question 2: What does ( ) mean in regular expressions
Answer: The grouping structure of regular expressions

Question 3: What is RegexOptions.Multiline
Answer: Enumeration of regular matching rules

Question 4: "$1" "$2"What is it
Answer: Substitution in the regular expression
In fact, it should be seen from the combination of the above and below. The specific content expressed here is the content of the grouping in the above (), corresponding to group 1 and group 2 respectively, but the actual description is: $number The language element includes the last substring matched by the number capture group in the replacement string, where number is the index of the capture
group . For example, the replacement pattern $1 indicates that the matched substring will be replaced by the first captured group.
So the actual content you get should be the last matched string of $1the first matching group , and the obtained content should be the last matched string of the second matching group.(//.*\r\n)$2(\w*)

Above are some answers to frequently asked questions.

About regular analysis/use

Use and analysis are actually the same thing, and the analysis here just reverses the process of use. However, whether it is used or reversed, character comparison is required.
insert image description here

Let's start to analyze it against the above interpretation

@"(//.*\r\n)?struct\s*(\w*)\s*{\r\n([\w\W]*?)}\s*"

1, @, refer to the above description
2, group 1 (//.*\r\n)analysis, the content of this part of the match is// 创建请求 req

  1. //correspond//
  2. .matches any single character except "\n"
  3. *matches zero or more occurrences of the preceding subexpression
  4. \r\nmatches a carriage return, matches a newline
  5. The overall matching structure is // + 零或多个除“\n”之外的任何单个字符 + 回车换行, which // 创建请求 reqmatches this part

3. ?Match the previous zero or one subexpression
, that is, the previous // 创建请求 reqstructure matches at most one, and it doesn’t matter if there is none.
4. struct\s*(\w*)\s*Parsing, the matching content of this part isstruct TestCreateRq

  1. structcorrespondstruct
  2. \sMatches any whitespace character, including spaces, tabs, form feeds, etc.
  3. *matches zero or more occurrences of the preceding subexpression
  4. (\w*)Group 2, the expression is \w*Matches zero or more occurrences of any word character including an underscore.
  5. \s*Matches zero or more times any whitespace character, including spaces, tabs, form feeds, etc.
  6. The overall matching structure is struct + 零次或多次任何空白字符,包括空格、制表符、换页符等 + 零次或多次包括下划线的任何单词字符 +零次或多次任何空白字符,包括空格、制表符、换页符等 , which struct TestCreateRqmatches this part

5, {\r\n([\w\W]*?)}\s*parse, the content of this part of the match is
{ string Name; // 名称 i32 Icon; // 图标 i32 Type; // 条件类型 i32 requirementParam; // 条件参数 }

  1. { correspond{
  2. \r\nmatches a carriage return, matches a newline
  3. ([\w\W]*?)Group 3 in as few strings as possible ( ?当该字符紧跟在其他限制符(*,+,?,{n},{n,},{n,m})后面时,匹配模式尽可能少的匹配所搜索的字符串) matches zero or more times ( *) any word character ( ) including an underscore \wand \Wa set of characters ( ) that are not a word character ( []).
  4. }correspond}
  5. \s*Matches zero or more times any whitespace character, including spaces, tabs, form feeds, etc.
  6. The overall matching structure is { + 回车换行 +尽可能少的字符串中匹配零次或多次 包括下划线的任何单词字符和非单词字符的字符集合+ } +零次或多次任何空白字符,包括空格、制表符、换页符等 ,also matched with our matching content.

The above is the analysis of the regular expressions I used. The difficulty lies in understanding the *?execution order of the non-greedy matching generated. In the case of non-greedy matching, the *?following rules will be matched first.

Inversely, we only know the meaning.
So if it is written, let’s // 创建请求 reqwrite an example as follows

  1. first//
  2. 4 characters in the middle can also be used.*
  3. there is one behindreq
  4. There may also be a carriage return and line feed at the end. \r\nIf it is a group, add ()
  5. The result is that (//.*req\r\n)
    of course it can be matched, but why is the writing inconsistent? This is related to the versatility of use. For example, if you want to match // 创建请求 resor // 啦啦啦or // 啦 // 啦 // 啦use the current matching method, it will not match. This can actually be understood as a fuzzy search, the higher the degree of fuzziness, the stronger the versatility. In the process of our use, it is best to improve the versatility as much as possible while meeting the needs.

The above example is a good example. Now back to our own question.

When I use to @"(//.*\r\n)?struct\s*(\w*)\s*{\r\n([\w\W]*?)}\s*"match the following structure, it doesn't match.

struct TestCreateRq {
    
    
    string Name;            // 名称
    i32 Icon;               // 图标
    i32 Type;               // 条件类型 
    i32 requirementParam;   // 条件参数
}

Don't sell it, the final result is because the code format provided is inconsistent, because the Windows /r/nline break and Linux /nline break are different. As a result, although the code looks the same. But it does not match normally when matching.

So how can we be compatible with two newline characters? Referring to the above explanation, we need to /r/nbe compatible with the judgment where the judgment is made/n

So it can be changed to the following

	//即在换行符中间增加`\r`和`\n`之间增加`?`  即`\r`最多匹配到一个,没有也无所谓。
 	Regex reg = new Regex(@"(//.*\r?\n)?struct\s*(\w*)\s*{\r?\n([\w\W]*?)}\s*", RegexOptions.Multiline);

Finally, post an online verification regular expression website recommended by the boss, which is convenient for verification and testing of online verification regular expressions

that's all.

Guess you like

Origin blog.csdn.net/qq_39860954/article/details/126784143