escaping in regex

Escape sequences generally serve two functions. The first function is to encode special data that cannot be directly represented by the alphabet. The second function is used to represent characters that cannot be directly entered by keyboard (such as carriage return).

In the C language, the backslash character "\" is used as an escape character to represent those non-printable ASCII control characters. In addition, in the URI protocol, some symbols in the request string have special meanings and need to be escaped. The escape character is the percent sign "%". The reason why this character is called an escape character is because the character behind it is not the original meaning.

Below are some common escape characters and what they mean.

 Regular expressions are also escaped with backslashes. Generally speaking, \d in the regular expression represents a single number, but if we want to express it as a backslash and the letter d, we need to escape it at this time and write it as \\d, which means that the backslash is followed by a letter d.

The backslash and d just now are two characters that appear consecutively. If you want to express it as a backslash or d, you can use the pipe symbol or square brackets to achieve it, such as \|d or [\d].

The specific process of correctly expressing "backslash" in regular expressions is as follows: the string we input, four backslashes \\, after the first step of string escaping, it means two backslashes \; These two backslashes are escaped by the second step of regularization, and it can represent a single backslash \.

 If now we want to find functions such as asterisk (*), plus sign (+), question mark (?) itself instead of metacharacters, then we need to escape them, just add a backslash in front of them OK.

Square brackets [] and curly braces {} only need to escape the opening bracket in regular expressions, but parentheses () must both be escaped. In regex, parentheses are usually used for grouping, or to treat a part as a whole. If you only escape the opening or closing brackets, the regex will think that the other half is missing, so an error will be reported.

There are three situations that need to be escaped in the character group

1. The caret is enclosed in square brackets and needs to be escaped at the first position:

>>> import re
>>> re.findall(r'[^ab]', '^ab')  # 转义前代表"非"
['^']
>>> re.findall(r'[\^ab]', '^ab')  # 转义后代表普通字符
['^', 'a', 'b']

2. The dashes are in square brackets, and not at the beginning and end

>>> import re
>>> re.findall(r'[a-c]', 'abc-')  # 中划线在中间,代表"范围"
['a', 'b', 'c']
>>> re.findall(r'[a\-c]', 'abc-')  # 中划线在中间,转义后的
['a', 'c', '-']
>>> re.findall(r'[-ac]', 'abc-')  # 在开头,不需要转义
['a', 'c', '-']
>>> re.findall(r'[ac-]', 'abc-')  # 在结尾,不需要转义
['a', 'c', '-']

3. The right parenthesis is in the square brackets, and not in the first place:

>>> import re
>>> re.findall(r'[]ab]', ']ab')  # 右括号不转义,在首位
[']', 'a', 'b']
>>> re.findall(r'[a]b]', ']ab')  # 右括号不转义,不在首位
[]  # 匹配不上,因为含义是 a后面跟上b]
>>> re.findall(r'[a\]b]', ']ab')  # 转义后代表普通字符
[']', 'a', 'b']

Generally speaking, if we want to express metacharacters (.*+?() and the like) as their literal meaning, we need to escape them, but if they appear in brackets in the character group, they can be Do not escape. In this case, it is generally a single-length metacharacter, such as dot (.), asterisk (*), plus sign (+), question mark (?), left and right parentheses, etc. They all no longer have a special meaning, but represent the character itself. But if symbols such as \d or \w appear in square brackets, they are still the meaning of metacharacters themselves.

>>> import re
>>> re.findall(r'[.*+?()]', '[.*+?()]')  # 单个长度的元字符 
['.', '*', '+', '?', '(', ')']
>>> re.findall(r'[\d]', 'd12\\')  # \w,\d等在中括号中还是元字符的功能
['1', '2']  # 匹配上了数字,而不是反斜杠\和字母d

 This article is the study notes for Day23 in August. The content comes from Geek Time "Introduction to Regular Expressions Course". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/132462417