Article Directory
Introduction
Regular expressions (Regular expressions, also called REs, or regexes or regex patterns) are essentially a tiny and highly specialized programming language. It is embedded in Python and provided to programmers through the re module. Using regular expressions, you need to specify some rules to describe the set of strings you want to match. These string sets may contain English sentences, e-mail addresses, TeX commands
note:
Python's regular expression engine is written in C language, so the efficiency is extremely high. In addition, the so-called regular expression, the RE mentioned here, is the "some rules" we mentioned above.
Character match
Most letters and characters will match themselves. For example, the regular expression Dog will exactly match the string "Dog", which is case sensitive by default. Of course, there are exceptions to this rule. There are a few special characters that we call metacharacters. They cannot match themselves. They define character classes, subgroup matching, and pattern repetitions.
Metacharacter
Below is the complete list of metacharacters:
. ^ $ * + ? {
} [ ] \ | ( )
Without these metacharacters, regular expressions with strings become find()
as mediocre methods ...
[ ]
Below the brackets [], they specify a character class to store the set of characters you need to match . The characters to be matched can be listed separately , or the matching range can be specified by two characters and a bar -. For example, [abc]
matches the characters a, b or c; [ac] can achieve the same functionality. The latter uses ranges to represent the same set of characters as the former . Example: If you want to match only lowercase letters, your RE can be written as [az]
note:
Metacharacters in square brackets do not trigger "special functions", in character classes, they only match themselves. For example, [akm$]
will match any character 'a'
, 'k'
, 'm'
or '$'
, '$'
is a meta-character, but in square brackets, it does not represent a special meaning , it only match '$' characters themselves .
You can also match all other characters not listed in the square brackets. In practice is the beginning of a class is added caret ^, e.g. [^5]
matches any character except '5'
Backslash\
Like Python's string rules, if a metacharacter is immediately followed by a backslash , the "special function" of the metacharacter will not be triggered . For example, you need to match the symbol [
or \
, you can them with a backslash in front to remove their special features:\[,\\
Some characters after the backslash can also indicate special meaning , such as representing a decimal number , representing all letters, or representing a set of non-blank characters.
The backslash followed by metacharacters removes special functions, and the backslash followed by ordinary characters realizes special functions.
For example: \w
match any word character . If the regular expression in bytes represented, which corresponds to a character class [a-zA-Z0-9_]
; if the regular expression is a string , \w
will match all Unicode
the database ( unicodedata
module) labeled alphabetic characters. When you compile a regular expression by providing re.ASCII
representation to further limit \w
defined.
re.ASCII
Flag causes \w
only match ASCII
characters, do not forget, Python3
is the Unicode.
Special characters | meaning |
---|---|
\d | Matches any decimal digit; equivalent to class [0-9] |
\D | The \d contrary, any non-matched character decimal numbers; equivalent to the class[^0-9] |
\s | Match any blank character (including spaces, newlines, tabs, etc.); equivalent to class [ \t\n\r\f\v] |
\S | Contrary to \s, matches any non-whitespace character; equivalent to the class [^ \t\n\r\f\v] |
\w | Match any word character, see explanation above |
\W | Opposite to \w |
\b | Match the beginning or end of a word |
\B | Opposite to \b |
They can be contained in a character class, and they also have special meanings. For example, [\s,.]
it is a character class that will match any whitespace characters ( /s
special meaning), ','
or '.'
.
.
Another metacharacter is .
that it matches any character except the newline character . If the set re.DOTALL
mark , .
it will match any character, including newline including .
*
Let's look at *
this meta-character, of course, it is not matching '*'
the character itself, it is used to specify a character matches zero or more times before
E.g. ca*t
matching ct
(characters 0 A), cat
(characters. 1 a
), caaat
(A 3 characters), and the like. Note that, due to internal restrictions int type size of the C language , the regular expression engine will limit the character 'a'
of repeating number not more than 2 billion .
The default repetition rule of regular expressions is greedy . When you repeatedly match an RE , the matching engine will try to match as much as possible . Until the RE does not match or reaches the end , the matching engine will fall back one character, and then continue to try to match
Through examples, I will explain to you what is "greedy" step by step: first consider the expression a[bcd]*b
. First, you need to match characters 'a'
, then zero or more [bcd]
, and finally end with'b'. Now imagine that this string matches the RE abcbd
will happen?
step | match | Description |
---|---|---|
1 | a | Match the first character'a' of RE |
2 | abcbd | The engine will match as much as possible if the rules are met [bcd]* , until the end of the string |
3 | failure | The engine tries to match the last character'b' of RE, but the current position is already at the end of the string, so it fails |
4 | abcb | Back to back, so [bcd]* to match at least one character |
5 | failure | Try to match the last character'b' of RE again, but the last character of the string is'd', so it fails. |
6 | abc | Again back to back, so [bcd]* the only match'bc' |
7 | abcb | Try to match the character'b' again, this time the character pointed to by the current position of the string is exactly'b', the match is successful |
The result of RE matching is abcb
.
The default matching rules of regular expressions are greedy, and how to use non-greedy matching methods is described later.
+
Pay special attention to *
and +
difference: *
the match is zero or more times , so the duplicate content probably does not occur ; +
need to appear at least once . For example, ca+t
it will match cat
and caaat
, but will not match ct
.
?
There are also two metacharacters that represent repetition, one of which is a question mark, which is ?
used to specify that the previous character matches zero or one time . You can think of it this way, its role is to mark something optional . For example, a small fish can match a small fish or a fish.
{}
The most flexible should be the metacharacter {m,n}
(m and n are both decimal integers). Several metacharacters mentioned above can be expressed by it. Its meaning is that the previous character must match between m and n times. For example a/{1,3}b
matches a/b
, a//b
and a///b
. But it will not match ab
(no slashes); nor will it match ab
(more than three slashes).
You can omit m or n, in which case the engine will assume a reasonable value instead. If m is omitted, it will be interpreted as the lower limit 0; if n is omitted, it will be interpreted as infinity (in fact, it is the 2 billion we mentioned above).
If it is {,n} it is equivalent to {0,n}; if it is {m,} it is equivalent to {m,+infinity}; if it is {n}, the previous character is repeated n times. There is also a super error-prone thing that is written as {m, n}, which looks beautiful, but note that you cannot add spaces in the regular expression, otherwise it will change the original meaning.
In fact *
, +
and ?
it can be used {m,n}
instead. {0,}
With *
the same; {1,}
with + is the same; {0,1}
with ?
the same. However, I encourage everyone to remember and use *
, +
and ?
, because these characters are shorter and easier to read.
Another reason is to match engine * + ?
has been optimized, the efficiency is higher.