[Python] Getting started with regular expression syntax

Table of contents

regular expression

1. Dot: match all characters

2. Asterisk: Repeat match any number of times

3. Plus sign: repeat matching multiple times

4. Curly braces: match the specified number of times

5. Greedy mode and non-greedy mode

6. Backslash: Escaping metacharacters

7. Square brackets: match one of several characters

8. Start, end position and single-line, multi-line mode

9. Parentheses: group selection


regular expression

Application Scenario: Text Processing Extracts Information

The key is: how to use the syntax of regular expressions correctly

Verification website: https://regex101.com/

Character classification:

  • Ordinary characters: no special meaning, directly used to match
  • Special characters: also known as metacharacters, have special meanings and are not directly used for matching

1. Dot: match all characters

".": Indicates to match any single character except newline .

content = '''ive是芙
izone不是芙'''

import re
# r禁止了对字符的转义
p = re.compile(r'.芙')
# findall查找符合匹配条件的文本
for one in p.findall(content):
    # <class 'str'>
    print(type(one))
    print(one)

Look at the type of p after calling compile:

# <class 're.Pattern'>
# 从而才能调用该类中的各种方法
print(type(p))

2. Asterisk: Repeat match any number of times

By default, dot only matches one character, and using asterisk etc. can match multiple characters on this basis.

"*": It can match any number of times, including 0 times.

  • "*" is used together with ".", that is, ".*", which means to match all characters before or after the specified character, including the specified character.
  • For example: ",.*" means to match any character after the Chinese comma any number of times.

3. Plus sign: repeat matching multiple times

"+": It can match any number of times, but not including 0 times.

  • The difference from "*" is that one contains 0 times and the other does not

4. Curly braces: match the specified number of times

"{}": Indicates to match a character in front of "{}" for the specified number of times.

  • c{min, max}: c is the matched character, min is the minimum number of occurrences, max is the maximum number of occurrences
  • c{num}: Directly specify that you need to match num times

Matches a phone number: \d{11}, where \d represents a digit.

5. Greedy mode and non-greedy mode

"*", "+", "?" are all greedy , they will match as much content as possible.

<html><head><title>Title</title></head></html>

Add "?" to become non-greedy mode:

Multiple objects are matched respectively:

6. Backslash: Escaping metacharacters

"\" escapes metacharacters to normal characters.

"\" Followed by some characters, it can also match a character of a certain type .

  • \d: Match any numeric character between 0-9, equivalent to the expression [0-9]
  • \D: Match any character that is not a number between 0-9, equivalent to the expression [^0-9]
  • \s: Match any blank character, including spaces, tabs, newlines, etc., equivalent to the expression [\t\n\r\f\v]
  • \S: Match any non-blank character, equivalent to the expression [^\t\n\r\f\v]
  • \w: Match any text character, including uppercase and lowercase letters, numbers, and underscores, which is equivalent to the expression [a-zA-Z0-9]
  • \W: Match any non-literal character, equivalent to the expression [^a-zA-Z0-9]

\w Also includes Unicode literal characters by default, or only ASCII letters if an ASCII tag is specified.

  • re.compile(r'.芙', re.A)

7. Square brackets: match one of several characters

  • 1[35]\d{9}: Indicates several characters
  • 1[3-5]\d{9}: "-" indicates a range

Going one step further:

  • "." becomes an ordinary character in "[]" and is no longer a metacharacter
  • "^" in "[]" means the concept of "not"

8. Start, end position and single-line, multi-line mode

"^" indicates that only the matching content at the beginning of each line is required .

  • The matching results in single-line mode and multi-line mode are different
  • Multi-line mode: re.compile(r'.fu', re.M)

"$" indicates that only the matching content at the end of each line is required .


 

9. Parentheses: group selection

The group is to mark some parts of the content matched by the regular expression as a certain group.

We can mark multiple groups in a regular expression.

The matching result is multiple groups:

Guess you like

Origin blog.csdn.net/m0_64140451/article/details/131744240