Table of contents
2. Asterisk: Repeat match any number of times
3. Plus sign: repeat matching multiple times
4. Curly braces: match the specified number of times
5. Greedy mode and non-greedy mode
6. Backslash: Escaping metacharacters
7. Square brackets: match one of several characters
8. Start, end position and single-line, multi-line mode
9. Parentheses: group selection
regular expression
Application Scenario: Text Processing Extracts Information
The key is: how to use the syntax of regular expressions correctly
Verification website: https://regex101.com/
Character classification:
- Ordinary characters: no special meaning, directly used to match
- Special characters: also known as metacharacters, have special meanings and are not directly used for matching
1. Dot: match all characters
".": Indicates to match any single character except newline .
content = '''ive是芙
izone不是芙'''
import re
# r禁止了对字符的转义
p = re.compile(r'.芙')
# findall查找符合匹配条件的文本
for one in p.findall(content):
# <class 'str'>
print(type(one))
print(one)
Look at the type of p after calling compile:
# <class 're.Pattern'>
# 从而才能调用该类中的各种方法
print(type(p))
2. Asterisk: Repeat match any number of times
By default, dot only matches one character, and using asterisk etc. can match multiple characters on this basis.
"*": It can match any number of times, including 0 times.
- "*" is used together with ".", that is, ".*", which means to match all characters before or after the specified character, including the specified character.
- For example: ",.*" means to match any character after the Chinese comma any number of times.
3. Plus sign: repeat matching multiple times
"+": It can match any number of times, but not including 0 times.
- The difference from "*" is that one contains 0 times and the other does not
4. Curly braces: match the specified number of times
"{}": Indicates to match a character in front of "{}" for the specified number of times.
- c{min, max}: c is the matched character, min is the minimum number of occurrences, max is the maximum number of occurrences
- c{num}: Directly specify that you need to match num times
Matches a phone number: \d{11}, where \d represents a digit.
5. Greedy mode and non-greedy mode
"*", "+", "?" are all greedy , they will match as much content as possible.
<html><head><title>Title</title></head></html>
Add "?" to become non-greedy mode:
Multiple objects are matched respectively:
6. Backslash: Escaping metacharacters
"\" escapes metacharacters to normal characters.
"\" Followed by some characters, it can also match a character of a certain type .
- \d: Match any numeric character between 0-9, equivalent to the expression [0-9]
- \D: Match any character that is not a number between 0-9, equivalent to the expression [^0-9]
- \s: Match any blank character, including spaces, tabs, newlines, etc., equivalent to the expression [\t\n\r\f\v]
- \S: Match any non-blank character, equivalent to the expression [^\t\n\r\f\v]
- \w: Match any text character, including uppercase and lowercase letters, numbers, and underscores, which is equivalent to the expression [a-zA-Z0-9]
- \W: Match any non-literal character, equivalent to the expression [^a-zA-Z0-9]
\w Also includes Unicode literal characters by default, or only ASCII letters if an ASCII tag is specified.
- re.compile(r'.芙', re.A)
7. Square brackets: match one of several characters
- 1[35]\d{9}: Indicates several characters
- 1[3-5]\d{9}: "-" indicates a range
Going one step further:
- "." becomes an ordinary character in "[]" and is no longer a metacharacter
- "^" in "[]" means the concept of "not"
8. Start, end position and single-line, multi-line mode
"^" indicates that only the matching content at the beginning of each line is required .
- The matching results in single-line mode and multi-line mode are different
- Multi-line mode: re.compile(r'.fu', re.M)
"$" indicates that only the matching content at the end of each line is required .
9. Parentheses: group selection
The group is to mark some parts of the content matched by the regular expression as a certain group.
We can mark multiple groups in a regular expression.
The matching result is multiple groups: