Record the process of Xiaobai learning python crawler (2)
Regular expression
Special characters
^h means start with h,. means any character, * means any number of times
import re
line = 'hello 123'
#^h表示以h开头,.表示任意字符,*表示任意多次**
re_str = '^h.*'
if re.match(re_str, line):
print('匹配成功') # 输出:匹配成功
$ Means ending character
import re
line = 'hello 123'
re_str = '.*3$' # 前面可为任意多个任意字符,但结尾必须是3
if re.match(re_str, line):
print('匹配成功') # 输出:匹配成功
? Indicates non-greedy mode
import re
line = 'heeeello123'
re_str = '.*?(h.*?l).*' # 只要()中的子串
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1)) # 输出:heeeel
# 如果去掉?,则输出:heeeell
+ Means at least once
import re
line = 'heeeello123'
re_str = '.*(h.+?l).*'
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1)) #输出:heeeel
{2} means that the preceding character appears twice
import re
line = 'heeeello123'
re_str = '.*?(e.{2}?l).*' # 匹配的是e+任意2个字符+l
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1)) # 输出:eeel
| Means or
import re
line = 'hello123'
re_str = '((hello|heeello)123)'
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1)) # 输出:python123
[] indicates that the value range is given for a single character
import re
line = 'hello123'
re_str = "([jhk]ello123)" # [jhk]表示jhk中的任一个都可以
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1)) # 输出:hello123
[^] means non-character set
import re
line = 'hello123'
re_str = "([^j]ello123)" # [^j]表示不是j的都行
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1)) # 输出:hello123
\s means space \S means non-space
import re
line = 'hello123 好' #字符串有空格
re_str = "(hello123\s好)" # 匹配上空格
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1)) #输出:hello123 好
[\u4E00-\u9FA5] means Chinese characters
import re
line = 'hello 北京大学'
re_str = ".*?([\u4E00-\u9FA5]+大学)"
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1)) # 输出:北京大学
Small example to extract the date of birth
import re
line = 'xxx出生于2000年6月1日'
line = 'xxx出生于2000/6/1'
line = 'xxx出生于2000-6-1'
line = 'xxx出生于2000-06-01'
line = 'xxx出生于2000-06'
re_str = ".*出生于(\d{4}[年/-]\d{1,2}([月/-]|\d{1,2}|$))"
match_obj = re.match(re_str, line)
if match_obj:
print(match_obj.group(1))
Main functions of Re library:
function | Description |
---|---|
re.search() | Search for the first position of the regular expression in a string, and return the match object |
re.match() | Match the regular expression from the beginning of a string and return the match object |
re.findall() | Search string, return all matching strings in list type |
re.split() | Split a string according to the regular expression matching result and return the list type |
re.finditer () | Search string, return a matching result iteration type, each iteration element is a match object |
re.sub() | Replace all substrings matching regular expressions in a string, and return the replaced string |
Representation type of regular expression:
raw string type (native string type):
The re library uses the raw string type to represent regular expressions, expressed as: r'text'
For example: r'[1-9]\d{5}'
raw string refers to a string that does not contain escape characters
The string type is more cumbersome.
For example:'[1-9]\d{5}';'\d{3}-\d{8}|\d{4}-\d{7}'
当正则表达式包含转义符时,建议使用raw string类型来表示正则表达式。
re.match()
import re
ss = 'I love you, do you?'
res = re.match(r'((\w)+(\W))+',ss)
print(res.group())
I love you,
re.search()
import re
ss = 'I love you, do you?'
res = re.search(r'(\w+)(,)',ss)
#print(res)
print(res.group(0))
print(res.group(1))
print(res.group(2))
you,
you
,
Let's not give examples for the others.
Simple application of regular expressions
1. Look for the time result in the ping message
import re
ping_ss = 'Reply from 220.181.57.216:bytes=32 time=3ms TTL=47'
res = re.search(r'(time=)(\d+\w+)+(.)+TTL',ping_ss)
print(res.group(2))
3ms
2. Used to parse web pages
import re,requests
r = requests.get('https://www.baidu.com').content.decode('utf-8')
print(r)
pt = re.compile('(\<title\>)([\S\s]+)(\<\/title\>)')
print(pt.search(r).group(2))
百度一下,你就知道
In addition, you can use the "r" before the string to improve efficiency
pt = re.compile(r'(</title>)([\S\s]+)(</title>)')