Crawl a small video of 6 rooms
aims
Crawl small videos of 6 rooms and save them to a local folder in batches.
Project preparation
Software: Pycharm
third-party library: requests, fake_useragent
website address: https://v.6.cn/minivideo/
Website analysis
Open the website and
analyze the loading type of the web page.
F12 opens the developer mode.
This is the captured packet.
Determined as a dynamic loading type.
Page number analysis
The first data packet: The
second data packet: The
third data packet:
https://v.6.cn/minivideo/getMiniVideoList.phpact=recommend&page=1&pagesize=25
https://v.6.cn/minivideo/getMiniVideoList.phpact=recommend&page=2&pagesize=25
https://v.6.cn/minivideo/getMiniVideoList.phpact=recommend&page=3&pagesize=25
It can be seen that each page changes with the page number page=.
Anti-climb analysis
Multiple accesses to the same ip address will face the risk of being blocked. Here, fake_useragent is used to generate random User-Agent request headers for access.
Code
1. Import the corresponding third-party library, define a class to inherit object, define the init method to inherit self, and the main function main to inherit self.
import requests
from fake_useragent import UserAgent
class sixroom(object):
def __init__(self):
self.url = 'https://v.6.cn/minivideo/getMiniVideoList.php?act=recommend&page={}&pagesize=25'
ua = UserAgent(verify_ssl=False)
for i in range(1, 100):
self.headers = {
'User-Agent': ua.random
}
def main(self):
pass
if __name__ == '__main__':
spider = sixroom()
spider.main()
2. Send a request to get the web page.
def get_html(self,url):
response = requests.get(url, headers=self.headers)
html = response.json()
return html
3. Analyze the web page to obtain the link address of the small video and save it locally.
def parse_html(self,html):
content_list=html['content']['list']
for data in content_list:
alias=data['alias']
title=data['title']
playurl=data['playurl']
filename=alias+'.'+title
print(filename)
r=requests.get(playurl,headers=self.headers)
with open('F:/pycharm文件/document/'+filename+'.mp4','wb') as f:
f.write(r.content)
4. Main function and function call.
def main(self):
end_page = int(input("要爬多少页:"))
for page in range(1, end_page + 1):
url = self.url.format(page)
print("第%s页。。。。" % page)
html = self.get_html(url)
self.parse_html(html)
print("第%s页爬取完成" % page)
Effect display
The complete code is as follows:
import requests
from fake_useragent import UserAgent
class sixroom(object):
def __init__(self):
self.url = 'https://v.6.cn/minivideo/getMiniVideoList.php?act=recommend&page={}&pagesize=25'
ua = UserAgent(verify_ssl=False)
for i in range(1, 100):
self.headers = {
'User-Agent': ua.random
}
def get_html(self,url):
response = requests.get(url, headers=self.headers)
html = response.json()
return html
def parse_html(self,html):
content_list=html['content']['list']
for data in content_list:
alias=data['alias']
title=data['title']
playurl=data['playurl']
filename=alias+'.'+title
print(filename)
r=requests.get(playurl,headers=self.headers)
with open('F:/pycharm文件/document/'+filename+'.mp4','wb') as f:
f.write(r.content)
def main(self):
end_page = int(input("要爬多少页:"))
for page in range(1, end_page + 1):
url = self.url.format(page)
print("第%s页。。。。" % page)
html = self.get_html(url)
self.parse_html(html)
print("第%s页爬取完成" % page)
if __name__ == '__main__':
spider = sixroom()
spider.main()
Disclaimer: Only for your own study and reference use.