Preface
The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Preamble content
Python crawler beginners introductory teaching (1): crawling Douban movie ranking information
Python crawler novice introductory teaching (2): crawling novels
Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data
Python crawler novice introductory teaching (4): crawling 51job.com recruitment information
Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B
Python crawler novice introductory teaching (6): making word cloud diagrams
Python crawler beginners introductory teaching (7): crawling Tencent video barrage
Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF
Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation
Python crawler novice introductory teaching (ten): crawling the other shore 4K ultra-clear wallpaper
Python crawler beginners introductory teaching (11): recent king glory skin crawling
Python crawler novice introductory teaching (12): the latest skin crawling of League of Legends
Python crawler beginners introductory teaching (13): crawling high-quality ultra-clear wallpapers
Python crawler beginners' introductory teaching (14): crawling audio novel website data
Python crawler beginners' introductory teaching (15): crawling website music materials
Python crawler beginners' introductory teaching (16): crawling good-looking videos
Basic development environment
- Python 3.6
- Pycharm
Use of related modules
import os
import requests
Install Python and add it to the environment variables, pip installs the required related modules.
1. Determine the target demand
Search YY on Baidu and click on the category to select the small video. The short video of the self-portrait of the lady in it is the data we need.
2. Web page data analysis
The website loads the data after sliding down the webpage. It has been explained in the previous crawling article about the good-looking video that YY video is also changing the soup and not the medicine.
As shown in the figure, the url address selected by the box is the playback address of the short video.
Packet interface address:
https://api-tinyvideo-web.yy.com/home/tinyvideosv2?callback=jQuery112409962628943012035_1613628479734&appId=svwebpc&sign=&data=%7B%22uid%22%3A0%2C%22page%22%3A1%2C%22pageSize%22%3A10%7D&_=1613628479736
Data request parameters of the second page:
Data request parameters on the third page:
Obviously, this is based on the page change in the data parameter.
Construct a page turning loop, get the video url address and the name of the publisher, and save it locally.
Three, code implementation
1. Request data interface
import requests
url = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'
params = {
'callback': 'jQuery112409962628943012035_1613628479734',
'appId': 'svwebpc',
'sign': '',
'data': '{"uid":0,"page":0,"pageSize":10}',
'_': '1613628479737',
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, params=params, headers=headers)
The question is, is the returned data json data?
As shown in the figure above, many people must think that this is just a json data when they see such data?
JSONDecodeError: json decoding error, it is not a json data, but a string.
By checking the response, you can see that the data returned to us is an extra piece of jQuery112409962628943012035_1613628479734()
. The json data is contained in it. If you want to extract the data, there are three ways.
1. Return response.text, use regular expressions to extract the URL address and the name of the publisher
video_url = re.findall('"resurl":"(.*?)"', response.text)
user_name = re.findall('"username":"(.*?)"', response.text)
2. Return response.text, use regular expressions to extract the data in jQuery112409962628943012035_1613628479734(), then use the json module to convert the string to json data, and then traverse to extract the data.
string = re.findall('jQuery112409962628943012035_1613628479734\((.*?)\)', response.text)[0]
json_data = json.loads(string)
result = json_data['data']['data']
pprint.pprint(result)
3. Delete the callback in the requested url address, you can directly get the json data
import pprint
import requests
url = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'
params = {
'appId': 'svwebpc',
'sign': '',
'data': '{"uid":0,"page":1,"pageSize":10}',
'_': '1613628479737',
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, params=params, headers=headers)
json_data = response.json()
result = json_data['data']['data']
pprint.pprint(result)
2. Save the data
for index in result:
video_url = index['resurl']
user_name = index['username']
video_content = requests.get(url=video_url, headers=headers).content
with open('video\\' + user_name + '.mp4', mode='wb') as f:
f.write(video_content)
print(user_name)
Note: There are special characters in the user name, an error will be reported when saving
So you need to use regular expressions to replace special characters
def change_title(title):
pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]") # '/ \ : * ? " < > |'
new_title = re.sub(pattern, "_", title) # 替换为下划线
return new_title
Complete implementation code
import re
import requests
import re
def change_title(title):
pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]") # '/ \ : * ? " < > |'
new_title = re.sub(pattern, "_", title) # 替换为下划线
return new_title
page = 0
while True:
page += 1
url = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'
params = {
'appId': 'svwebpc',
'sign': '',
'data': '{"uid":0,"page":%s,"pageSize":10}' % str(page),
'_': '1613628479737',
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, params=params, headers=headers)
json_data = response.json()
result = json_data['data']['data']
for index in result:
video_url = index['resurl']
user_name = index['username']
new_title = change_title(user_name)
video_content = requests.get(url=video_url, headers=headers).content
with open('video\\' + new_title + '.mp4', mode='wb') as f:
f.write(video_content)
print(user_name)