基本符号：

^ 表示匹配字符串的开始位置 (例外用在中括号中[ ] 时,可以理解为取反,表示不匹配括号中字符串)

$ 表示匹配字符串的结束位置

* 表示匹配零次到多次

+ 表示匹配一次到多次 (至少有一次)

? 表示匹配零次或一次

. 表示匹配单个字符

| 表示为或者,两项中取一项

( ) 小括号表示匹配括号中全部字符

[ ] 中括号表示匹配括号中一个字符范围描述如[0-9 a-z A-Z]

{ } 大括号用于限定匹配次数如 {n}表示匹配n个字符 {n,}表示至少匹配n个字符 {n,m}表示至少n,最多m

\ 转义字符如上基本符号匹配都需要转义字符如 \* 表示匹配*号

\w 表示英文字母和数字 \W 非字母和数字

\d 表示数字 \D 非数字

1.match(pattern, string, flags=0)

从头开始匹配

import re

# 1.match()从开始匹配

content = 'Hello 123 456 Wrod_This is a Regex Dome'
result = re.match('^Hello\s\d\d\d\s(\d{3}) Wrod_This',content)
print(result)
print(result.group())  #匹配信息
print(result.span())   #从第几到第几
print(result.group(1))

#2.通用匹配 .* 匹配任意字符，除了换行符
content = 'Hello 123 456 Wrod_This is a Regex Dome'
result = re.match('Hello.*Dome$',content)
print(result)
#<re.Match object; span=(0, 39), match='Hello 123 456 Wrod_This is a Regex Dome'>

#3.贪婪模式 .*尽可能多的匹配
content = 'Hello 123456 Wrod_This is a Regex Dome'
result = re.match('Hello.*(\d+).*',content)
print(result.group(1))
# 6

#4.非贪婪模式 .*?
content = 'Hello 123456 Wrod_This is a Regex Dome'
result = re.match('^He.*?(\d+).*Dome',content)
print(result.group(1))
""".*？尽可能匹配少的遇到数字就停止"""
#匹配结果：123456

#5.修饰符号
"""
re.I  匹配对大小写不敏感
re.M  多行匹配，影响^和$
re.S  让.匹配换行符
"""
content = """Hello 
            123456 Wrod_This 
            is a Regex Dome"""
result = re.match('^He.*?(\d+).*Dome',content,re.S)
print(result)
#没加re.S之前  None
#加上re.S后    <re.Match object; span=(0, 64), match='Hello \n            123456 Wrod_This \n          >

#6.转义匹配  \
content = '(百度)网址多少？www.baidu.com'
result = re.match('\(百度\)网址多少\？www\.(\w+)\.com',content)
print(result.group(1))   
#结果  baidu

2.search(pattern, string, flags=0)

扫描字符串，返回第一个符合规则的

import re
html = open('re_test.html','r',encoding='utf-8').read()
result = re.search('<a href="/3\.mp3" singer="齐秦">(.*?)</a>',html,re.S)
print('匹配结果:'+result.group(1))
#匹配结果:往事随风

result = re.search('<a.*singer=".*">(.*?)</a>',html,re.S)
print('匹配结果:'+result.group(1)) 
#不加限定条件返回第一个
#匹配结果:但愿人长久

3.findall(pattern, string, flags=0)

扫描所有，返回一个列表

import re
html = open('re_test.html','r',encoding='utf-8').read()
result = re.findall('<a href=".*?" singer=".*?">(.*?)</a>',html,re.S)
print(result)
#结果：['沧海一声笑', '往事随风', '光辉岁月', '记事本', '但愿人长久']

4.sub(pattern, repl, string, count=0, flags=0)

匹配信息，替换文本

import re
content = 'a2s31d2d31asdas52d1sa5d32dsa153ads1'
result = re.sub('\d','',content)
print(result)
#结果asddasdasdsaddsaads

5.compile(pattern, flags=0)

将字符串编译成正则表达式对象

import re
A='TEL:15304443352'
B='TEL:15304443355'
C='TEL:15304443356'
r = re.compile('TEL:(\d+)')
result1 = re.search(r,A)
result2 = re.search(r,B)
result3 = re.search(r,C)
print(result1.group(1))
print(result2.group(1))
print(result3.group(1))

15304443352
15304443355
15304443356

Python爬虫（三）| 正则表达式