Python battle star show / dance video in Yan value area (everyone knows it ^.^)

foreword

Hello! Hello everyone, this is the Demon King~

Knowledge points:

  1. The basic process of crawler
  2. re regular expression is simple to use
  3. requests
  4. JSON data parsing method
  5. Video data saving

development environment]:

  • Python 3.8
  • Pycharm

module use]:

  • requests >>> pip install requests third-party modules
  • re

win + R Enter cmd Enter the installation command pip install If the module name is popular, it may be because the network connection timed out to switch the domestic mirror source

The basic process of the crawler (fixed):

1. Data source analysis

  1. Determine what the crawling content is? (target URL, data in the URL)
    video content

  2. Carry out packet capture analysis through developer tools, and analyze the data we want to obtain by requesting that url address
    . I. Through analysis, we can know what is the video playback url address?
    II. Through the video playback address, go to analyze and find, the video data packet is in Which?
    III. By comparing the request parameters of the two video data packets, we can know that the video content can be obtained as long as all the video IDs are obtained
    (picture ID, video ID, music ID or any ID can be obtained from the list page)
    IV. To analyze the video ID, you can Where to get it from (generally you can get it on the list page)

I want to get the video playback address >>> to get the video packet >>> to get the video ID

2. Code implementation steps Send request to obtain data analysis data save data

  1. send request, for dance video listing page send request
  2. Get data, the server returns the data content
  3. Parse the data, extract the video ID of the data content we want
  4. Send request, pass the video ID into the video packet to send the request
  5. Get data, the server returns the data content
  6. Parse the data, extract the video title and video playback address of the data content we want
  7. Save data, save video content locally
  8. Multi-page data collection

code

# 导入数据请求模块
import requests   # 第三方模块 pip install requests 需要自行安装
# 导入re正则表达式
import re   # 内置模块 不需要安装
# 导入格式化输出模块
import pprint   # 内置模块 不需要安装


# 1. 发送请求, 对于舞蹈视频列表页面发送请求
for page in range(1, 11):
    print(f'正在爬取第{
    
    page}页的数据内容')
    url = f'https://v.huya.com/g/all?set_id=51&order=hot&page={
    
    page}'
    # 爬虫是模拟浏览器对于服务器发送请求, 然后获取服务器返回数据内容
    # user-agent: 用户代理 表示浏览器基本身份信息  (一种简单反反爬手段)
    headers = {
    
    
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
    }
    # 通过requests模块里面get请求方式对于url地址发送请求, 并且携带上headers请求进行伪装, 最后用自定义变量response接收返回数据
    response = requests.get(url=url, headers=headers)
    # <Response [200]> 表示请求成功, 请求网址成功了  *** 200状态码表示请求成功, 但是不一定能够得到数据
    # 2. 获取数据, 服务器返回数据内容 response.text 获取响应文本数据
    # print(response.text)
    # 3. 解析数据, 提取我们想要数据内容 视频ID
    # 解析方式: css re xpath
    # <li data-vid="676382675">  想要数据 可以(.*?) 从response.text 里面去找寻这样数据内容
    # .*?  是可以匹配任意字符(除了\n换行符以外)  如果你只是单纯提取数字 最好用 \d+ 匹配一个或者多个数字
    video_ids = re.findall('<li data-vid="(\d+)">', response.text)  # 返回列表数据
    for video_id in video_ids:  # 通过for循环遍历 提取列表里面元素 一个一个提取
        # print(video_id)
        # 4. 发送请求, 把视频ID传入到视频数据包里面发送请求
        # 5. 获取数据, 服务器返回数据内容
        # f 字符串格式化方法 {
    
    } 占位符
        video_info = f'https://liveapi.huya.com/moment/getMomentContent?videoId={
    
    video_id}&uid=&_=1647433310180'
        json_data = requests.get(url=video_info, headers=headers).json()
        # 有反爬 就有 反反爬
        # print(json_data)
        # pprint.pprint(json_data)
        # 根据冒号左边的内容, 提取冒号右边的内容
        # 6. 解析数据
        title = json_data['data']['moment']['title']
        video_url = json_data['data']['moment']['videoInfo']['definitions'][0]['url']
        # 7. 保存数据 >>> 发送请求 并且获取数据
        """
        response.text   >>> 文本数据返回字符串数据
        response.json() >>> json字典数据
        response.content >>> 二进制数据
        """
        video_content = requests.get(url=video_url, headers=headers).content
        with open('video\\' + title + '.mp4', mode='wb') as f:
            f.write(video_content)
        print(title, video_url)

epilogue

Well, this article of mine ends here!

If you have more suggestions or questions, feel free to comment or private message me! Let's work hard together (ง •_•)ง

Follow the blogger if you like it, or like and comment on my article! ! !

Guess you like

Origin blog.csdn.net/python56123/article/details/124149582