Python regular expressions, commonly used parameters

Regular expression

1 Introduction

The best verification tool to use, there is no one
https://regex101.com/
online verification tool

The latest official document
https://docs.python.org/zh-cn/3/library/re.html?highlight=re#module-re
Step 1: Define variables
Step 2: Introduce regular modules
Step 3: Define filter content
such as:

content ='''沐足城,消费98块钱
沐足城,消费168块钱
沐足城,消费298块钱
沐足城,消费998块钱'''
import re
p = re.compile(r'')
for dalao in p.findall(content):
	print(dalao)

注意

# 这里面的表示我的第一行开始就是数据
content ='''沐足城,消费98块钱
'''
# 这里表示第二行开始才是数据
content ='''
沐足城,消费98块钱
'''

2. Special characters

metacharactersMetacharacter

.*+?\[]^${
    
    }|()

实用技巧
Because of this special symbol, it needs
re.compile()to be addedr

re.compile(r'')

Indicates not to perform string translation in python grammar

So why do we loop?
Because the execution re.compile()is a class, so we can process it (using method)

p = re.compile(r'')
print(type(p))
得到的是<class 're.Patterr'> 是一个类

.findallTo find all eligible列表文本

for dalao in p.findall(content):
	print(type(dalao))
	得到的是<class 'str'> 字符串

3. Explanation of metacharacters

  1. 【.】

.

Point - matches all characters
said to match any single character (一个字符,不能为0)[] except newline
[blocks]
all at the end, and includes a character word preceding
Example

p = re.compile(r'.块')
8888
  1. 【*】

*

Indicates that any number subexpression matches, including zero
缺省模式是按行来匹配的, (No. newline ends)
Example

p = re.compile(r',.*')
,消费98块钱
,消费168块钱
,消费298块钱
,消费998块钱
【,】以逗号开头.】表示任意字符*】任意次数.*】任意字符可以出现任意次数

Example

p = re.compile(r'9*')
9
9
99

Example

p = re.compile(r'费9*')
999

Here starts with [fee], any number of times, it can be 0 times

  1. 【+】

+

The + sign means to match the preceding sub-expression one or more times. The
+ sign is repeated multiple times, excluding 0 times
Insert picture description here

  1. {}

{}

{} curly braces-match the specified number of times,
9{2}8where 9 appears 2 times and ends with 8
Insert picture description here

9{2,3} 9 of them appear 2 to 3 times
Insert picture description here

  1. 【\】

\

\ The escape of metacharacters by backslash is
added in front of the metacharacters
\.that need to be escaped, which means that it contains the symbol [.], and [.] is no longer any character.
\dMatch any digit character between 0-9 and
\Dmatch any one It is not a numeric character between 0-9, it is equivalent to an expression [^0-9] 包括换行符
\smatch any blank character, [space bar, tab, newline], it is equivalent to an expression [\t\n\r\f\v] to
\Smatch any one Non-blank character, equivalent to expression. [^\t\n\r\f\v]
\wMatch any literal character, [uppercase letter, lowercase letter, number, underscore], equivalent to expression. [a-zA-Z0-9]
\WMatch any non-literal character, equivalent to expression [^a-zA-Z0-9]
such as
\dbackslash plus d Represents the number The
\d{3}number appears 3 times
Insert picture description here

  1. ?

?

第一种
The? Sign can indicate to match the preceding sub-character 0 to 1 times
Insert picture description here

第二种
? No. Cancel the greedy mode and try to get as many arrays as possible

p = re.compile(r'<.*?>')

The content is that <html><head></head></html>
on the right side, there are 4 elements in the calendar. In
Insert picture description here
greedy mode

p = re.compile(r'<.*>')

The right side will show that there is 1 element in the calendar
Insert picture description here

  1. [ ]

[]

[] The elements in square brackets only represent one digit, and the ones filled in are defined conditions

[2-5]==[2345] 里面的杠表示范围
[25]==25
[a-c]==[abc]
[ac]==a或c
[A-C]==[ABC]
[AC]==A或C
[.]==在这里只表示点,不再是任意字符
[^]==表示非
[^\d]==表示非数字
[^abc]==非abc三个字符

Such as
Insert picture description here

Insert picture description here

  1. ^

^

Up arrow indicates the beginning of the text

p = re.compile(r'^\d+')

The case is below

  1. $

$

$ Means the end of the text

p = re.compile(r'\d+$')

The case is below

  1. ()

()

Parentheses, group selection In the
filter matching conditions, only the content in the parentheses is taken

import re
p = re.compile(r'^(.*)剑',re.MULTILINE)
for dalao in p.findall(content):
	print(dalao)

What we need is not the word
Insert picture description here

小案例

Multiple () group selection

content = '''小明,是个学生,手机号码是18898836599
赵昊,是个大佬,手机号码是18898836588
王猛,是个技工,手机号码是18898836577'''

import re
p = re.compile(r'(.{2,3}),.*个(.*),.*(\d{11})',re.MULTILINE)
for dalao in p.findall(content):
	print(dalao)

Extract the name, occupation, click the phone number
Insert picture description herein our own editor, it will be generated

('小明','学生','18898836599')
('赵昊','大佬','18898836588')
('王猛猛','技工','18898836577')

In the tool website,
Full match means that it matches the record
that meets the conditions. Group means that it is the record after filtering.

第二个参数

  1. re.ASCIIShorthand re.A
p = re.compile(r'\w{2,4}',re.A)
p = re.compile(r'\w{2,4}'.re.ASCII) 

Let \w, \W, \b, \B, \d, \D, \s and \S only match ASCII, not Unicode. This is only valid for Unicode style and will be ignored by byte style.
Here is the result obtained after clicking the option ASCII, (click the small gray flag)
there are 4 elements on the right
Insert picture description here

Insert picture description here

  1. re.MULTILINE Shorthand re.M
p = re.compile(r'^\d+',re.M)
p = re.compile(r'^\d+',re.MULTILINE)

After setting, the
style character'^' matches the beginning of the string and the beginning of each line (the symbol immediately following the newline character); the
style character ' ′ matches the end of the string and the end of each line (the symbol before the newline character) . By default, '' matches the beginning of the string, ′ 'matches the end of the string, and the end of each line (the symbol before the newline character). By default,'^' matches the string header, '' Matchwiththe wordcharacterstringtail,andeach ofarowofjunctionthe end(transducerlinebreaksbefore thesurfacethatasymbolnumber). Defaultrecognitionsituationconditionsat,'' Matchwiththe wordcharacterstringhead,' ' Matches the end of the string. Corresponds to the inline mark (?m).

p = re.compile(r'^\d+',re.MULTILINE)

The beginning of each line of text
Insert picture description here

p = re.compile(r'^\d+')

Start of text only
Insert picture description here

p = re.compile(r'\d+$',re.MULTILINE)

End of each line
Insert picture description here

p = re.compile(r'\d+$')

Only at the end of the text
Insert picture description here
The segmentation method in regular

split

content ='''小明, 赵昊; 王猛。 张三''''
import re
contentlist = re.split(r'[,;。\s]\s*',content)
for dalao in p.findall(contentlist):
	print(dalao)

Explanation

[,;。\s]==出现,号或;号或。号或空白符
\s* == 出现空白符号任意次0至N次

# 创建一个新变量 = 正则切割(条件,切割目标参数)
contentlist = re.split(r'[,;。\s]\s*',content)

List of results

[小明','赵昊','王猛','张三]

Guess you like

Origin blog.csdn.net/weixin_47021806/article/details/111399729