爬虫项目4[爬取斗鱼直播数据]

不用通过页面源码获取,直接找数据的入口

斗鱼直播是一个典型使用ajax的页面,对于这样的页面简单粗暴,直接在网页控制台的xhr里面找入口
请求requests 解析json()
在线json校验工具:https://www.bejson.com/

来到第一页发现没有什么特别瞩目的网页,继续往下找在这里插入图片描述
来到第二页,发现了一个名为2的xhr文件,大胆猜想这玩意可能和页码有关,再看一页试试在这里插入图片描述
来到第三页,果然还有,这种页面肯定藏有猫腻,不妨看看响应结果在这里插入图片描述
果不其然是json数据的格式,这下就好办了,直接构造请求头获取json数据,再对数据进行清洗就ok,
在这里插入图片描述
代码如下:

import requests
from lxml import etree

base_url = "https://www.douyu.com/gapi/rkc/directory/2_1/{}"

headers = {
    "authority": "www.douyu.com",
    "method": "GET",
    "scheme": "https",
    "accept": "application/json, text/plain, */*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "zh-CN,zh;q=0.9",
    "cookie": "dy_did=99d9bec8e3161267ca6f1b2700091501; acf_did=99d9bec8e3161267ca6f1b2700091501; smidV2=201910091127005e69063c81f439757b5c6853e98eb85600415c32cf59babd0; Hm_lvt_e99aee90ec1b2106afe7ec3b199020a7=1583281439; PHPSESSID=pifc2v49pv7eh3pfqh68vdmrp6; acf_auth=c805VIqQqC4NURXP%2BsXkVVLLs71Z3tGdFmlmwKvDfJddlPpBpHsZCb%2BAinbPuBGFqbJVR3zwn6rtV9neXmKxQjGRrSK212Jf4UlJNS5TrfPY6WwlpuI5I14; dy_auth=9679Wnn3NsJb2QR5Af1AKQpGbSYw6kgSwcujMSyG3AxQ3PSOPIINFiu%2FO7usyWfaQEGgY8xUgDHUVuTM0kSDrg4nj9Bg2Ib1AERZgYFzofeYDUjGrez85lo; wan_auth37wan=2d3ba7e8c7b7%2F2QURm%2FaQBYqJqHh6FwGQ26YRXP0y5n%2FjrR0gvtyc7%2FfBM%2FfhL%2F53HJ6mUBypKwmSw1Rk5ajw0Fx%2BpMyNOEG8bIiilruQGrYqED4kIA; acf_uid=329673281; acf_username=329673281; acf_nickname=%E7%94%A8%E6%88%B761411317; acf_own_room=0; acf_groupid=1; acf_phonestatus=1; acf_avatar=https%3A%2F%2Fapic.douyucdn.cn%2Fupload%2Favatar%2Fdefault%2F03_; acf_ct=0; acf_ltkid=69931249; acf_biz=1; acf_stk=391ed8ca5549845e; acf_ccn=b08c364a0d5c5aae33f1c5361ce1cfb6; Hm_lpvt_e99aee90ec1b2106afe7ec3b199020a7=1583281831",
    "referer": "https://www.douyu.com/g_LOL",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36",
    "x-requested-with": "XMLHttpRequest"
}

page = 6   #这是一个动态的数据,根据实际情况来顶
if __name__ == "__main__":
    for i in range(page):
        url = base_url.format(i+1)
        response = requests.get(url,headers=headers)  #返回的是json形式的数据
        datas = response.json()["data"]["rl"]
        for data in datas:     #简单的在控制台显示
            room = data["rid"]  #房间号
            name = data["rn"]   #房间名
            zhubo = data["nn"]  #主播
            print(room,name,zhubo)

效果如下:
在这里插入图片描述

发布了62 篇原创文章 · 获赞 13 · 访问量 2965

猜你喜欢

转载自blog.csdn.net/Yanghongru/article/details/104646980