Detailed explanation of metacharacters in Python regular expressions (1)

Introduction

Regular expressions (Regular expressions, also called REs, or regexes or regex patterns) are essentially a tiny and highly specialized programming language. It is embedded in Python and provided to programmers through the re module. Using regular expressions, you need to specify some rules to describe the set of strings you want to match. These string sets may contain English sentences, e-mail addresses, TeX commands

note:

Python's regular expression engine is written in C language, so the efficiency is extremely high. In addition, the so-called regular expression, the RE mentioned here, is the "some rules" we mentioned above.

Character match

Most letters and characters will match themselves. For example, the regular expression Dog will exactly match the string "Dog", which is case sensitive by default. Of course, there are exceptions to this rule. There are a few special characters that we call metacharacters. They cannot match themselves. They define character classes, subgroup matching, and pattern repetitions.

Metacharacter

Below is the complete list of metacharacters:

.   ^   $   *   +   ?   {
    
     }   [ ]   \   |   ( )

Without these metacharacters, regular expressions with strings become find()as mediocre methods ...

[ ]

Below the brackets [], they specify a character class to store the set of characters you need to match . The characters to be matched can be listed separately , or the matching range can be specified by two characters and a bar -. For example, [abc]matches the characters a, b or c; [ac] can achieve the same functionality. The latter uses ranges to represent the same set of characters as the former . Example: If you want to match only lowercase letters, your RE can be written as [az]

note:

Metacharacters in square brackets do not trigger "special functions", in character classes, they only match themselves. For example, [akm$]will match any character 'a', 'k', 'm'or '$', '$'is a meta-character, but in square brackets, it does not represent a special meaning , it only match '$' characters themselves .

You can also match all other characters not listed in the square brackets. In practice is the beginning of a class is added caret ^, e.g. [^5]matches any character except '5'

Backslash\

Like Python's string rules, if a metacharacter is immediately followed by a backslash , the "special function" of the metacharacter will not be triggered . For example, you need to match the symbol [or \, you can them with a backslash in front to remove their special features:\[,\\

Some characters after the backslash can also indicate special meaning , such as representing a decimal number , representing all letters, or representing a set of non-blank characters.

The backslash followed by metacharacters removes special functions, and the backslash followed by ordinary characters realizes special functions.

For example: \wmatch any word character . If the regular expression in bytes represented, which corresponds to a character class [a-zA-Z0-9_]; if the regular expression is a string , \wwill match all Unicodethe database ( unicodedatamodule) labeled alphabetic characters. When you compile a regular expression by providing re.ASCIIrepresentation to further limit \wdefined.

re.ASCIIFlag causes \wonly match ASCIIcharacters, do not forget, Python3is the Unicode.

Special characters meaning
\d Matches any decimal digit; equivalent to class [0-9]
\D The \dcontrary, any non-matched character decimal numbers; equivalent to the class[^0-9]
\s Match any blank character (including spaces, newlines, tabs, etc.); equivalent to class [ \t\n\r\f\v]
\S Contrary to \s, matches any non-whitespace character; equivalent to the class [^ \t\n\r\f\v]
\w Match any word character, see explanation above
\W Opposite to \w
\b Match the beginning or end of a word
\B Opposite to \b

They can be contained in a character class, and they also have special meanings. For example, [\s,.]it is a character class that will match any whitespace characters ( /sspecial meaning), ','or '.'.

.

Another metacharacter is .that it matches any character except the newline character . If the set re.DOTALLmark , .it will match any character, including newline including .

*

Let's look at *this meta-character, of course, it is not matching '*'the character itself, it is used to specify a character matches zero or more times before

E.g. ca*tmatching ct(characters 0 A), cat(characters. 1 a), caaat(A 3 characters), and the like. Note that, due to internal restrictions int type size of the C language , the regular expression engine will limit the character 'a'of repeating number not more than 2 billion .

The default repetition rule of regular expressions is greedy . When you repeatedly match an RE , the matching engine will try to match as much as possible . Until the RE does not match or reaches the end , the matching engine will fall back one character, and then continue to try to match

Through examples, I will explain to you what is "greedy" step by step: first consider the expression a[bcd]*b. First, you need to match characters 'a', then zero or more [bcd], and finally end with'b'. Now imagine that this string matches the RE abcbdwill happen?

step match Description
1 a Match the first character'a' of RE
2 abcbd The engine will match as much as possible if the rules are met [bcd]*, until the end of the string
3 failure The engine tries to match the last character'b' of RE, but the current position is already at the end of the string, so it fails
4 abcb Back to back, so [bcd]*to match at least one character
5 failure Try to match the last character'b' of RE again, but the last character of the string is'd', so it fails.
6 abc Again back to back, so [bcd]*the only match'bc'
7 abcb Try to match the character'b' again, this time the character pointed to by the current position of the string is exactly'b', the match is successful

The result of RE matching is abcb.

The default matching rules of regular expressions are greedy, and how to use non-greedy matching methods is described later.

+

Pay special attention to *and +difference: *the match is zero or more times , so the duplicate content probably does not occur ; + need to appear at least once . For example, ca+tit will match catand caaat, but will not match ct.

?

There are also two metacharacters that represent repetition, one of which is a question mark, which is ?used to specify that the previous character matches zero or one time . You can think of it this way, its role is to mark something optional . For example, a small fish can match a small fish or a fish.

{}

The most flexible should be the metacharacter {m,n}(m and n are both decimal integers). Several metacharacters mentioned above can be expressed by it. Its meaning is that the previous character must match between m and n times. For example a/{1,3}bmatches a/b, a//band a///b. But it will not match ab(no slashes); nor will it match ab(more than three slashes).

You can omit m or n, in which case the engine will assume a reasonable value instead. If m is omitted, it will be interpreted as the lower limit 0; if n is omitted, it will be interpreted as infinity (in fact, it is the 2 billion we mentioned above).

If it is {,n} it is equivalent to {0,n}; if it is {m,} it is equivalent to {m,+infinity}; if it is {n}, the previous character is repeated n times. There is also a super error-prone thing that is written as {m, n}, which looks beautiful, but note that you cannot add spaces in the regular expression, otherwise it will change the original meaning.

In fact *, +and ?it can be used {m,n}instead. {0,}With *the same; {1,}with + is the same; {0,1}with ?the same. However, I encourage everyone to remember and use *, +and ?, because these characters are shorter and easier to read.

Another reason is to match engine * + ?has been optimized, the efficiency is higher.

Guess you like

Origin blog.csdn.net/CSNN2019/article/details/114464524