记录小白学习python爬虫的过程（二）

正则表达式

特殊字符

^h表示以h开头，.表示任意字符，*表示任意多次

import re
line = 'hello 123'
#^h表示以h开头，.表示任意字符，*表示任意多次**
re_str = '^h.*'
if re.match(re_str, line):
    print('匹配成功')  # 输出：匹配成功

$表示结尾字符

import re
line = 'hello 123'
re_str = '.*3$' # 前面可为任意多个任意字符，但结尾必须是3
if re.match(re_str, line):
    print('匹配成功')  # 输出：匹配成功

?表示非贪婪模式

import re
line = 'heeeello123'
re_str = '.*?(h.*?l).*'  # 只要()中的子串
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1)) # 输出：heeeel
                              # 如果去掉?,则输出：heeeell

+表示至少出现一次

import re
line = 'heeeello123'
re_str = '.*(h.+?l).*'
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1)) #输出：heeeel

{2}表示前面字符出现2次

import re
line = 'heeeello123'
re_str = '.*?(e.{2}?l).*' # 匹配的是e+任意2个字符+l
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1)) # 输出：eeel

| 表示或

import re
line = 'hello123'
re_str = '((hello|heeello)123)'
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1)) # 输出：python123

[]表示对单个字符给出取值范围

import re
line = 'hello123'
re_str = "([jhk]ello123)"  # [jhk]表示jhk中的任一个都可以
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1)) # 输出：hello123

[^]表示非字符集

import re
line = 'hello123'
re_str = "([^j]ello123)" # [^j]表示不是j的都行
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1)) # 输出：hello123

\s表示空格 \S表示非空格

import re
line = 'hello123 好' #字符串有空格
re_str = "(hello123\s好)"  # 匹配上空格
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1)) #输出：hello123 好

[\u4E00-\u9FA5]表示汉字

import re
line = 'hello 北京大学'
re_str = ".*?([\u4E00-\u9FA5]+大学)"
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1)) # 输出：北京大学

小例子提取出生日期

import re
line = 'xxx出生于2000年6月1日'
line = 'xxx出生于2000/6/1'
line = 'xxx出生于2000-6-1'
line = 'xxx出生于2000-06-01'
line = 'xxx出生于2000-06'
re_str = ".*出生于(\d{4}[年/-]\d{1,2}([月/-]|\d{1,2}|$))"
match_obj = re.match(re_str, line)
if match_obj:
    print(match_obj.group(1))

Re库主要功能函数：

函数	说明
re.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
re.match()	从一个字符串的开始位置起匹配正则表达式，返回match对象
re.findall()	搜索字符串，以列表类型返回全部能匹配的字符串
re.split()	将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
re.finditer()	搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象
re.sub()	在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

正则表达式的表示类型：

raw string类型（原生字符串类型）：

re库采用raw string类型表示正则表达式，表示为：r’text’
例如：r’[1-9]\d{5}’
raw string是指不包含转义符的字符串

string类型，更繁琐。

例如：’[1-9]\d{5}’；’\d{3}-\d{8}|\d{4}-\d{7}’

当正则表达式包含转义符时，建议使用raw string类型来表示正则表达式。

re.match()

import re
ss =  'I love you, do you?'

res = re.match(r'((\w)+(\W))+',ss)
print(res.group())

I love you,

re.search()

import re
ss = 'I love you, do you?'
res = re.search(r'(\w+)(,)',ss)
#print(res)
print(res.group(0))
print(res.group(1))
print(res.group(2))

you,
you
,

其他的就暂且不举例了。

正则表达式的简单应用

1.寻找ping信息中的时间结果

import re
ping_ss = 'Reply from 220.181.57.216:bytes=32 time=3ms TTL=47'
res = re.search(r'(time=)(\d+\w+)+(.)+TTL',ping_ss)
print(res.group(2))

3ms

2.用来解析网页

import re,requests
r = requests.get('https://www.baidu.com').content.decode('utf-8')
print(r)
pt = re.compile('(\<title\>)([\S\s]+)(\<\/title\>)')
print(pt.search(r).group(2))

百度一下，你就知道

另外，可以用字符串前的“r”来提高效率

pt = re.compile(r'(</title>)([\S\s]+)(</title>)')

记录小白学习python爬虫的过程（二）

记录小白学习python爬虫的过程（二）

特殊字符

^h表示以h开头，.表示任意字符，*表示任意多次

$表示结尾字符

?表示非贪婪模式

+表示至少出现一次

{2}表示前面字符出现2次

| 表示或

[]表示对单个字符给出取值范围

[^]表示非字符集

\s表示空格 \S表示非空格

[\u4E00-\u9FA5]表示汉字

小例子 提取出生日期

raw string类型（原生字符串类型）：

string类型，更繁琐。

re.match()

re.search()

正则表达式的简单应用

猜你喜欢

小例子提取出生日期