Reptile project actual combat 9: crawling 6 rooms small video

aims

Crawl small videos of 6 rooms and save them to a local folder in batches.

Project preparation

Software: Pycharm
third-party library: requests, fake_useragent
website address: https://v.6.cn/minivideo/

Website analysis

Open the website and
Insert picture description here
analyze the loading type of the web page.
F12 opens the developer mode.
Insert picture description here
This is the captured packet.
Insert picture description here
Determined as a dynamic loading type.

Page number analysis

The first data packet: The
Insert picture description here
second data packet: The
Insert picture description here
third data packet:
Insert picture description here

https://v.6.cn/minivideo/getMiniVideoList.phpact=recommend&page=1&pagesize=25
https://v.6.cn/minivideo/getMiniVideoList.phpact=recommend&page=2&pagesize=25
https://v.6.cn/minivideo/getMiniVideoList.phpact=recommend&page=3&pagesize=25

It can be seen that each page changes with the page number page=.

Anti-climb analysis

Multiple accesses to the same ip address will face the risk of being blocked. Here, fake_useragent is used to generate random User-Agent request headers for access.

Code

1. Import the corresponding third-party library, define a class to inherit object, define the init method to inherit self, and the main function main to inherit self.

import requests
from fake_useragent import UserAgent
class sixroom(object):
    def __init__(self):
        self.url = 'https://v.6.cn/minivideo/getMiniVideoList.php?act=recommend&page={}&pagesize=25'
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 100):
            self.headers = {
    
    
                'User-Agent': ua.random
            }
    def main(self):
        pass
if __name__ == '__main__':
    spider = sixroom()
    spider.main()

2. Send a request to get the web page.

    def get_html(self,url):
        response = requests.get(url, headers=self.headers)
        html = response.json()
        return html

3. Analyze the web page to obtain the link address of the small video and save it locally.

    def parse_html(self,html):
        content_list=html['content']['list']
        for data in content_list:
            alias=data['alias']
            title=data['title']
            playurl=data['playurl']
            filename=alias+'.'+title
            print(filename)
            r=requests.get(playurl,headers=self.headers)
            with open('F:/pycharm文件/document/'+filename+'.mp4','wb') as f:
                f.write(r.content)

4. Main function and function call.

    def main(self):
        end_page = int(input("要爬多少页:"))
        for page in range(1, end_page + 1):
            url = self.url.format(page)
            print("第%s页。。。。" % page)
            html = self.get_html(url)
            self.parse_html(html)
            print("第%s页爬取完成" % page)

Effect display

Insert picture description here
Insert picture description here
The complete code is as follows:

import requests
from fake_useragent import UserAgent
class sixroom(object):
    def __init__(self):
        self.url = 'https://v.6.cn/minivideo/getMiniVideoList.php?act=recommend&page={}&pagesize=25'
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 100):
            self.headers = {
    
    
                'User-Agent': ua.random
            }
    def get_html(self,url):
        response = requests.get(url, headers=self.headers)
        html = response.json()
        return html
    def parse_html(self,html):
        content_list=html['content']['list']
        for data in content_list:
            alias=data['alias']
            title=data['title']
            playurl=data['playurl']
            filename=alias+'.'+title
            print(filename)
            r=requests.get(playurl,headers=self.headers)
            with open('F:/pycharm文件/document/'+filename+'.mp4','wb') as f:
                f.write(r.content)
    def main(self):
        end_page = int(input("要爬多少页:"))
        for page in range(1, end_page + 1):
            url = self.url.format(page)
            print("第%s页。。。。" % page)
            html = self.get_html(url)
            self.parse_html(html)
            print("第%s页爬取完成" % page)
if __name__ == '__main__':
    spider = sixroom()
    spider.main()

Disclaimer: Only for your own study and reference use.

Guess you like

Origin blog.csdn.net/qq_44862120/article/details/107826614