Why does re.match(r"\\w","\w") match '\\w'? Why do I get '\x07' when matching with re.match("\a","\a")? Research


Basic knowledge, matching special characters in Python
Meaning of \: Mark the next character as a special character, or a literal character, or a backreference, or an octal escape character.
For example, 'n' matches the character "n". '\n' matches a newline. The sequence '\\' matches "\" and "\(" matches "(".


\s matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [ \f\n\r\t\v].

escape symbol

Regular expressions also support escapes for most Python strings: \a, \b, \f, \n, \r, \t, \u, \U, \v, \x, \\
Note 1: \b is usually used to match a word boundary, it means "backspace" only in character classes
Note 2: \u and \U are only recognized in Unicode mode
Note 3: The octal escape (\digit) is limited if the first digit is 0, or if there are 3 octal digits,
then it is considered an octal number; otherwise it is considered a subgroup reference; as for strings, octal escapes are always limited to a maximum of 3 digits in length


Question 1: Why do I get '\x07' when re.match("\a","\a") matches?

'\x07' is the corresponding ASCII encoding. The default encoding format of python is utf-8. If sometimes the encoding in utf-8 also overlaps with ASCII, this will happen.

Here is the code on the terminal where ipython3 enters:
re.match("a","a")
Out[6]: <_sre.SRE_Match object; span=(0, 1), match='a'>

In [7]: re.match("\a","\a")#It stands to reason that "\a" matches "\a", it should be "\a", why is it '\x07'?
Out[7]: <_sre.SRE_Match object; span=(0, 1), match='\x07'>

In [8]: re.match(r"\a","\a")
Out[8]: <_sre.SRE_Match object; span=(0, 1), match='\x07'>

In [183]: len("abc")#The length of ordinary characters is 3
Out[183]: 3

In [9]: rs = r"\a"

In [10]: len("\x07") #The length is 1, not the length 3, which proves that it is not an ordinary character
 Out[1]: 1

The following proves that A can match \x41, is \x41 the corresponding A in ascii?

In [103]: re.match(r"\x41","A")
Out[103]: <_sre.SRE_Match object; span=(0, 1), match='A'>

In [104]: re.match("\x41","A")
Out[104]: <_sre.SRE_Match object; span=(0, 1), match='A'>

The following proof \x07 is \a

In [106]: re.match("\x07","\a")
Out[106]: <_sre.SRE_Match object; span=(0, 1), match='\x07'>

In [107]: re.match(r"\x07","\a")
Out[107]: <_sre.SRE_Match object; span=(0, 1), match='\x07'>







Question 2. Why does re.match(r"\\w","\w") match '\\w'?

Reason: When a Python regular expression matches the special character \w, it is converted to \\w and then matched

The following two fail to match
In [134]: re.match("\w","\w")#The match failed, why?

In [135]: re.match(r"\w","\w")#The match failed, why?


In [142]: re.match("\\w","\w")#match failed, why?

In [136]: re.match(r"\\w","\w")
Out[136]: <_sre.SRE_Match object; span=(0, 2), match='\\w'>

Description '\\w'

How come there is an extra '\' in the matching result '\\w'? The reason is that different tools display different content. Change a tool, pycharm, to have a look.
import re
result = re.match(r"\\w","\w").group()
print(result)#output\w

It is found that what is printed is \w, that is to say, the terminal \\w is to display \w, escape and add \, and the output in pycharm is already escaped.
That is to say, \\w wants to output \w, and the unique addition of \ becomes \\w
Prove the idea:
s = "\\w"
print(s)# It is found that \w is printed regardless of the terminal and pycharm running results, which proves the idea.

Next prove
In [136]: re.match(r"\\w","\w")
<_sre.SRE_Match object; span=(0, 2), match='\\w'>


should be r"\\w" is equivalent to "\\\\w";
So the question is, how can "\\\\w" match "\w"? My guess, the "\w" of the string is estimated not to be the "\w" of the string, but "\\w"
Is my guess correct?
n [143]: re.match("\\\\w","\\w")
Out[143]: <_sre.SRE_Match object; span=(0, 2), match='\\w'>

It is found that this is all the case, that is to say, \w in the string is equivalent to \\w, and is matched according to \\w during the matching process.




According to this principle, the following situation can be explained

Here is the code on the terminal where ipython3 enters:

In [147]: re.match(r"\\d","\d")
Out[147]: <_sre.SRE_Match object; span=(0, 2), match='\\d'>

In [152]: re.match(r"\\s","\s")
Out[152]: <_sre.SRE_Match object; span=(0, 2), match='\\s'>

n [153]: re.match(r"\\W","\W")
Out[153]: <_sre.SRE_Match object; span=(0, 2), match='\\W'>

In [154]: re.match(r"\\S","\S")
Out[154]: <_sre.SRE_Match object; span=(0, 2), match='\\S'>

In [155]: re.match(r"\\D","\D")
Out[155]: <_sre.SRE_Match object; span=(0, 2), match='\\D'>

In [177]: re.match(r"\\B","\B")
Out[177]: <_sre.SRE_Match object; span=(0, 2), match='\\B'>


Out[35]re.match(r"\\j","\j")
Out[36]: <_sre.SRE_Match object; span=(0, 2), match='\\j'>





Question 3, it means that \b in the boundary, when it is matched as a string, is the reason why \b itself and re.match(r"\\b","\b") did not match successfully
In [166]: re.match(r"\\w","\w")#Successful match The reason has been confirmed above
Out[166]: <_sre.SRE_Match object; span=(0, 2), match='\\w'>

In [167]: re.match(r"\\b","\b")#Why doesn't this match?

In [169]: re.match(r"\b","\b")#match failed
It should be that "\b" is not equivalent to "\\b", it is itself "\b", and because r"\b" is equivalent to "\\b"
The regular expression is: "\\b", the string to be matched is: "\b", they are all ordinary characters, they are not matched, think about it first, and then confirm it.
To match the normal word "\b", the regular expression just needs to write: "\b".

Prove my idea as follows:

In [170]: re.match("\b","\b")
Out[170]: <_sre.SRE_Match object; span=(0, 1), match='\x08'>


In [172]: re.match("\x08","\b")
Out[172]: <_sre.SRE_Match object; span=(0, 1), match='\x08'>

In [174]: re.match(r"\b","\b")#The match is unsuccessful


In summary, the \b in the boundary represents the \b itself when it is matched as a string to the regular
Therefore, the reason why re.match(r"\\b","\b") does not match is also the above principle.

According to the principle just now, it can also be explained

In [179]: re.match(r"\\$","\$") failed to match
In [179]: re.match(r"\\^","\^") failed to match


can also explain

In [55]: re.match("\n","\n")
Out[55]: <_sre.SRE_Match object; span=(0, 1), match='\n'>

In [56]: re.match("\na","\na")
Out[56]: <_sre.SRE_Match object; span=(0, 2), match='\na'>


In [59]: re.match("\\\\nabc","\\nabc")
Out[59]: <_sre.SRE_Match object; span=(0, 5), match='\\nabc'>


In [60]: re.match(r"\\nabc","\\nabc")
Out[60]: <_sre.SRE_Match object; span=(0, 5), match='\\nabc'>
r"\\nabc" is equivalent to '\\\\nabc'


references:
https://www.zhihu.com/question/23374078
http://blog.csdn.net/l347129991/article/details/70257704
http://www.cnblogs.com/jingleguo/archive/2008/06/02/1211820.html
http://www.360doc.com/content/13/0125/13/3046928_262317374.shtml

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326677407&siteId=291194637