前言

在使用爬虫爬取网站之前，要清楚网站源代码，请求源代码之后分析。

一、分析网站

在浏览器中，按键盘的F12(笔记本按Fn+F12）打开开发者工具，也可以通过在网页上右击，选择“检查”，开发者界面共有：Elements,Console,Sources,Network,Performance,Memory,Application,Security,Lighthouse。9个选项。

点击Elements页面可以看到网页源代码，还可以在此编辑内容。在Elements选项内可以找到所需内容的正则。

二、爬取信息

以爬取B站排行榜为例：

1.引入库

import requests
# 如果颜色为灰色说明没有下载，在Terminal页面中输入 pip install requests 
from bs4 import BeautifulSoup
# 如果颜色为灰色说明没有下载，在Terminal页面中输入 pip install bs4 
import re

2.分析网页

该处是url请求和UA伪装

url_Bzhan = 'https://www.bilibili.com/v/popular/rank/douga?spm_id_from=333.851.b_62696c695f7265706f72745f646f756761.39' # 爬取网站

headers ={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'} # UA伪装

该处是分析排行榜a标签的代码

def Ranking():
    pa =requests.get(url=url_Bzhan,headers=headers)
    soup = BeautifulSoup(pa.text, 'html.parser')
    # print(soup.prettify())
    bs = []

    bs = soup.find_all('a', {'class': 'title'})
    # print(bs)
    for i in range(len(bs)):
        data = bs[i]
        print(data)

Ranking()

a标签中既有排行榜的url，又有名称，这是我们需要提取这两个所需要的内容。

def Ranking():
    pa =requests.get(url=url_bzhan,headers=headers)
    soup = BeautifulSoup(pa.text, 'html.parser')
    # print(soup.prettify()) 检验请求的网站源代码
    bs = []
    bs = soup.find_all('a', {'class': 'title'})
        for i in range(len(bs)):
            data = bs[i]
            wenzi = BeautifulSoup(bs[i].text,'html.parser')
            # re只能匹配字符串，但是前面获得的数据不是，所以将它转化为string类型
            shuju = re.findall(r'<a class="title" href="(.*?)" target="_blank">', str(data))
            i = int(i)+1 # 序号
            all_data = str(i) + "\t" + str(wenzi) + "\t" + str(shuju)
            # 序号 + 标题 + 网址 中间用tab分割开来
            print(all_data)

Ranking()

分析之后的样子

最后一步，将内容导出成文档的格式

完整代码：

import requests
from bs4 import BeautifulSoup
import re

url_bzhan = 'https://www.bilibili.com/v/popular/rank/douga?spm_id_from=333.851.b_62696c695f7265706f72745f646f756761.39'

headers ={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'}

def Ranking():
    pa =requests.get(url=url_bzhan,headers=headers)
    soup = BeautifulSoup(pa.text, 'html.parser')
    bs = []
    bs = soup.find_all('a', {'class': 'title'})
    with open('D:/爬B站/B站.txt', 'w', encoding='utf-8') as f:  #存储位置自己定
        for i in range(len(bs)):
            data = bs[i]
            wenzi = BeautifulSoup(bs[i].text,'html.parser')
            shuju = re.findall(r'<a class="title" href="(.*?)" target="_blank">', str(data))
            i = int(i)+1
            all_data = str(i) + "\t" + str(wenzi) + "\t" + str(shuju)
            f.write(str(all_data) + "\n") 

Ranking()

总结

没啥总结的就祝大家身体健康吧。

python爬虫开发基础#2——以B站排行榜为例分析网站

文章目录

前言

一、分析网站

二、爬取信息

1.引入库

2.分析网页

总结

猜你喜欢