Python crawler beginners introductory teaching (17): crawling yy site-wide small video

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Preamble content

Python crawler beginners introductory teaching (1): crawling Douban movie ranking information

Python crawler novice introductory teaching (2): crawling novels

Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data

Python crawler novice introductory teaching (4): crawling 51job.com recruitment information

Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B

Python crawler novice introductory teaching (6): making word cloud diagrams

Python crawler beginners introductory teaching (7): crawling Tencent video barrage

Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF

Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

Python crawler novice introductory teaching (ten): crawling the other shore 4K ultra-clear wallpaper

Python crawler beginners introductory teaching (11): recent king glory skin crawling

Python crawler novice introductory teaching (12): the latest skin crawling of League of Legends

Python crawler beginners introductory teaching (13): crawling high-quality ultra-clear wallpapers

Python crawler beginners' introductory teaching (14): crawling audio novel website data

Python crawler beginners' introductory teaching (15): crawling website music materials

Python crawler beginners' introductory teaching (16): crawling good-looking videos

Basic development environment

  • Python 3.6
  • Pycharm

Use of related modules

import os
import requests

Install Python and add it to the environment variables, pip installs the required related modules.

1. Determine the target demand

 

 


Search YY on Baidu and click on the category to select the small video. The short video of the self-portrait of the lady in it is the data we need.

 

2. Web page data analysis

The website loads the data after sliding down the webpage. It has been explained in the previous crawling article about the good-looking video that YY video is also changing the soup and not the medicine.

 


As shown in the figure, the url address selected by the box is the playback address of the short video.

 


Packet interface address:

https://api-tinyvideo-web.yy.com/home/tinyvideosv2?callback=jQuery112409962628943012035_1613628479734&appId=svwebpc&sign=&data=%7B%22uid%22%3A0%2C%22page%22%3A1%2C%22pageSize%22%3A10%7D&_=1613628479736

Data request parameters of the second page:

 


Data request parameters on the third page:

 


Obviously, this is based on the page change in the data parameter.

Construct a page turning loop, get the video url address and the name of the publisher, and save it locally.

Three, code implementation

1. Request data interface

import requests
url = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'
params = {
    'callback': 'jQuery112409962628943012035_1613628479734',
    'appId': 'svwebpc',
    'sign': '',
    'data': '{"uid":0,"page":0,"pageSize":10}',
    '_': '1613628479737',
}
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, params=params, headers=headers)

The question is, is the returned data json data?

 


As shown in the figure above, many people must think that this is just a json data when they see such data?

 


JSONDecodeError:  json decoding error, it is not a json data, but a string.

 


By checking the response, you can see that the data returned to us is an extra piece of  jQuery112409962628943012035_1613628479734()
. The json data is contained in it. If you want to extract the data, there are three ways.

1. Return response.text, use regular expressions to extract the URL address and the name of the publisher

video_url = re.findall('"resurl":"(.*?)"', response.text)
user_name = re.findall('"username":"(.*?)"', response.text)

 


2. Return response.text, use regular expressions to extract the data in jQuery112409962628943012035_1613628479734(), then use the json module to convert the string to json data, and then traverse to extract the data.

string = re.findall('jQuery112409962628943012035_1613628479734\((.*?)\)', response.text)[0]
json_data = json.loads(string)
result = json_data['data']['data']
pprint.pprint(result)

 


3. Delete the callback in the requested url address, you can directly get the json data

import pprint
import requests

url = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'
params = {
    'appId': 'svwebpc',
    'sign': '',
    'data': '{"uid":0,"page":1,"pageSize":10}',
    '_': '1613628479737',
}
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, params=params, headers=headers)
json_data = response.json()
result = json_data['data']['data']
pprint.pprint(result)

2. Save the data

    for index in result:
        video_url = index['resurl']
        user_name = index['username']
        video_content = requests.get(url=video_url, headers=headers).content
        with open('video\\' + user_name + '.mp4', mode='wb') as f:
            f.write(video_content)
            print(user_name)

Note:  There are special characters in the user name, an error will be reported when saving

 

 


So you need to use regular expressions to replace special characters

def change_title(title):
    pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]")  # '/ \ : * ? " < > |'
    new_title = re.sub(pattern, "_", title)  # 替换为下划线
    return new_title

Complete implementation code

import re

import requests
import re


def change_title(title):
    pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]")  # '/ \ : * ? " < > |'
    new_title = re.sub(pattern, "_", title)  # 替换为下划线
    return new_title


page = 0
while True:
    page += 1
    url = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'
    params = {
        'appId': 'svwebpc',
        'sign': '',
        'data': '{"uid":0,"page":%s,"pageSize":10}' % str(page),
        '_': '1613628479737',
    }
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=url, params=params, headers=headers)
    json_data = response.json()
    result = json_data['data']['data']
    for index in result:
        video_url = index['resurl']
        user_name = index['username']
        new_title = change_title(user_name)
        video_content = requests.get(url=video_url, headers=headers).content
        with open('video\\' + new_title + '.mp4', mode='wb') as f:
            f.write(video_content)
            print(user_name)

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113867436