Python爬虫——BeautifulSoup的使用（C）

使用BeautifulSoup爬取豆瓣电影的信息

1.下载BeautifulSoup库

方法一：在cmd中输入 pip install bs4
方法二：在pycharm的setting中添加bs4

2.导入需要的库

# 导入需要的库
from bs4 import BeautifulSoup
import requests
import re

3.网页操作

①公式五步走：地址+头+request.get的text+BeautifulSoup(html,‘lxml’)+soup.select(“xxx”)
即 url +
     headers +
     html=requests.get(url,headers=headers).text+
     soup=BeautifulSoup(html,‘lxml’)+
     allList=soup.select(“xxx”)
②BeautifuiSoup方法中select选择器的使用为：
               class用 . (点) ;
                         id用#；
                       标签中没有CSS修饰时，可以直接用标签，如：a 或 li 或 span 等等
完整代码为：

from bs4 import BeautifulSoup
import requests
import re

url="https://movie.douban.com/top250"
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
html = requests.get(url, headers=header).text
allMovieList = []# 存储所有的oneList
# 生成BeautifulSoup来解析对象
movies_soup = BeautifulSoup(html,'lxml')
#BeautifulSoup上的一个select对象，select方法
allList = movies_soup.select(".grid_view li")
for one in allList:
    try:
        name = one.select(".item .info .hd a span")[0].string
        score = one.select(".item .info .bd .star span")[1].string
        people = one.select(".item .info .bd .star span")[3].string
        people = re.sub("\D","",people)# 清除非空字符
        oneList = [name,score,people]
        allMovieList.append(oneList)
    except:
        print("error")
print(allMovieList)

运行结果为
在这里插入图片描述

4.上述代码中细节分析

①title中[0].string, score中[1].string, people中[3].string的作用

（1）首先分析[0],[1],[3]的作用——实现精准定位
在这里插入图片描述
title中的[0]是因为select最后定位的是span标签，而span标签有三个，如果没有[0],则打印出来的就是绿色框子中的所有内容，但我们只需要第一个span标签中的内容，所以用[0]精准定位了红色框子中的内容。
（2）.string的作用——去除html+css部分，只保留文字部分
看图：
在这里插入图片描述

②re.sub()是替换方法

上述代码中如果没有这一方法，运行的结果为：
在这里插入图片描述
使用后：

# 清除非空字符；\D是匹配非数字的字符
people = re.sub("\D","",people)

在这里插入图片描述

施施吖

发布了27 篇原创文章 · 获赞 7 · 访问量 2121

私信关注