Re module use case

Written in the previous words:
There are many functions in the re module, but the following three:
re.sub, re.findall, re.match These three commands are frequently used in
crawlers. Re.sub,re are often used in crawlers. findall,re.match For data cleaning and extraction, please be sure to master these 3 commands.

Next are some small cases to train and master the use of re.sub, re.finadll, re.match and the corresponding output results.

  1. findall: extraction -> the core is what to extract (define a regular expression, where to proceed)
  2. sub: Replacement——>The core lies in what object to replace, what to be replaced, and where to proceed
  3. match: match——>The core lies in what to match and where to match
with open('index.html','r',encoding='utf-8') as f:
    html=f.read()
    print(html)

The html output result is as follows:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <footer>
        <div>
            <div class="email">
                Email:kefu@CSDN.net
            </div>
            <div class="tel">
                手机号:400-660-0108
            </div>
        </div>
    </footer>
</body>
</html>
# 定义一个提取email的正则表达式
#先导入re 模块
import re
pattern_1='<div class="email">(.*?)</div>' #匹配div标签里面的class="email"
ret_1=re.findall(pattern_1,html)  # 用正则表达式,在html中去提取
print(ret_1)   # 但提取的结果为一个空列表,原因在于,.匹配除了换行符意外的所有字符
## 因此先过滤掉\n,过滤\n  使用re.sub()
html_s=re.sub('\n','',html)
print(html_s)
<!DOCTYPE html><html lang="en"><head>    <meta charset="UTF-8">    <title>Title</title></head><body>    <footer>        <div>            <div class="email">                Email:kefu@CSDN.net            </div>            <div class="tel">                手机号:400-660-0108            </div>        </div>    </footer></body></html>
## 过滤掉\n后,再进行提取操作即findall
ret_2=re.findall(pattern_1,html_s)
print(ret_2)
['                Email:[email protected]            ']
## 如上显示的结果前尾都有空格的列表,通过取列表的索引0,在通过strip()函数能够去掉收尾的空白
print(ret_2[0].strip())  # 如下显示即为提取到的邮箱地址,就是我们想要提取的数据
Email:kefu@CSDN.net
###  定义一个匹配密码的正则表达式
###注意前面加一个^是为了防止被转义
password_pattern=r'^[a-zA-Z0-9_]{5,15}$' # 该密码以字母、数字或者下划线为开头,长度为6-16位
pass1='1234567'
pass2='k123456'
pass3='k123'
print(re.match(password_pattern,pass1))
print(re.match(password_pattern,pass2))
print(re.match(password_pattern,pass3))
<re.Match object; span=(0, 7), match='1234567'>
<re.Match object; span=(0, 7), match='k123456'>
None

Experience:
re.sub, re.findall, re.match these three commands
are often used in data cleaning, extraction, and crawlers.
Be sure to master these three commands

Guess you like

Origin blog.csdn.net/weixin_42961082/article/details/109786080