Python每日爬虫案例:爬取梨视频网站,下载完整版小视频并保存本地

前言:
1、爬取网站:梨视频
2、说明:该网站属于商业网址,本案例仅用于学习测试,不用于其他用途。
3、技术路线:requests+re+os
4、代码


'''
爬梨视频网站,下载视频保存到本地

 version:01
 author:金鞍少年
 date:2020-03-19
'''

import requests
import re
import os

class PearVideo:
    def __init__(self):
        self.url = 'https://www.pearvideo.com/popular'
        self.path =r'.\video'
        self.headers = {
            "Referer": "https://www.pearvideo.com/",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
        }

     # 去掉特殊字符,方便文件名保存
    def clean_zh_text(self, text):
        # keep English, digital and Chinese
        comp = re.compile('[^\u4e00-\u9fa5]')
        return comp.sub('', text)

    # 获取HTMl
    def getHTMl(self,url):
        re = requests.get(url, headers=self.headers)
        if re.status_code == 200:
            return re.text

    # 获取页面url
    def get_Pageurl(self,Html):
        urls = []
        Page_id = re.findall('<a href="(.*?)" class="popularembd actplay">',Html)
        for i in Page_id:
            url = r'https://www.pearvideo.com/'+ i
            urls.append(url)
        return urls

    # 将视频保存到本地
    def Download_video(self, urls):
        for url in urls:
            html = self.getHTMl(url)
            video_url = re.findall(',sdUrl="",ldUrl="",srcUrl="(.*?)",', html)[0]
            video_name = re.findall('data-type="2" data-title="(.*?)" ', html)[0]

            name = self.clean_zh_text(video_name)  # 去掉视频名特殊字符
            path = os.path.join(self.path, name)  # 拼接保存地址

            res = requests.get(video_url)
            with open(path+'.pm4', 'wb')as f:
                f.write(res.content)
                print('下载:%s 视频成功!'% name)

    # 业务逻辑
    def fun(self):
        self.Download_video(self.get_Pageurl(self.getHTMl(self.url)))

if __name__ == '__main__':
    g = PearVideo()
    g.fun()
    print('下载任务完成!')

总结:
1、不足:目前python爬虫功力还是不足,做不到爬取动态内容,加油奥利给!
2、小亮点:封装了一个功能,用正则表达式做文本预处理,去掉标题特殊符号,不然系统还不好保存视频

发布了72 篇原创文章 · 获赞 79 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/weixin_42444693/article/details/104977698