基于multiprocessing.dummy线程池爬取梨视频的视频信息

爬取思路:
- 1.分析网站是否ajax解析数据,清除页面刷新网站,看XHR有没有对应的网站,发现为空,表明数据不是由ajax数据解析来的,我们可以直接由网站url来抓取数据,即

url = "https://www.pearvideo.com/category_4"

顺便获取headers

headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE"}`
![dcacd1d2ddecbcbdb8caf52cc61366b2.png](en-resource://database/541:1)

dcacd1d2ddecbcbdb8caf52cc61366b2.png- 2.我们的需求是下载前面第一排的视频,我们要先获取他的视频详情页,我们如果用requests.get获取到现在url的源码数据,我们会发现里面并没有我们想要的视频下载链接,我们猜想是要点进视频网站之后才可以获取里面的mp4数据,分析到这里,我们要先获取这个url的源代码,然后在用xpath将将这四个视频的详情网站抓取下来

page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
li_list= tree.xpath('//ul[@id="listvideoListUl"]/li')

以上代码我们就已经获取到这四个详情网站的列表,但是我们从图上可以看到,这网站是不完整的,我们需要手动对他进行拼接,在对列表进行遍历

for li in li_list:
detail_url="https://www.pearvideo.com/"+li.xpath('./div/a/@href')[0]

cca3d0d5f91b697a09f2d15ace6d0e2c.pngcddfeeb772a82a080092442b29d59bd2.png  

- 3.抓取完这四个视频的详情网站之后,我们针对某一个视频进行具体分析,其他的也类似.
9e52e37515edc880a6aa79824f2d80ef.png  
我们从网站的源码上直接搜到了他这网站视频的mp4,我们想要下载的就是这个mp4文件,获取到这个detail_url之后,我们可以用正则去抓取这个mp4文件

ex = 'srcUrl="(.*?)",vdoUrl=srcUrl'
video_url = re.findall(ex,detail_page_text,re.S)[0]

- 4.之后我们需要将这些函数进行下载读取,有with open以wb形式写入文件即可
以上为下载视频的大致思路,难点是我们需要封装函数,并且发现这样直接写的话,下载小视频就挺花时间的了,我们如果下载的是大视频,可能会占用大量时间内存,这是我们需要使用线程池来解决多进程占用资源的问题
导包 from multiprocessing.dummy import Pool,有四个视频就弄四个线程池pool=Pool(4),了解pool.map(func,list)这个函数的用法就是这个func的函数去执行list里面的所有

import re
import time
import random
import requests
from lxml import etree
from multiprocessing.dummy import Pool

start_time = time.time()
url = "https://www.pearvideo.com/category_4"
headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE"}
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
li_list= tree.xpath('//ul[@id="listvideoListUl"]/li')
video_url_list = list()
for li in li_list:
    detail_url="https://www.pearvideo.com/"+li.xpath('./div/a/@href')[0]
    detail_page_text=requests.get(url=detail_url,headers=headers).text
    ex = 'srcUrl="(.*?)",vdoUrl=srcUrl'
    video_url = re.findall(ex,detail_page_text,re.S)[0]
    video_url_list.append(video_url)

def request_video(url):
    '''
    向视频连接发送请求
    '''
    return requests.get(url =url,headers=headers).content

def save_video(content):
    '''
    将视频的二进制数据保存到本地
    '''
    video_name  =str(random.randint(100,999))+".mp4"
    with open(video_name,"wb")as f:
        f.write(content)
# 使用线程池将视频的二进制数据下载下来
pool = Pool(4)
# pool.map(func,list)
content_list = pool.map(request_video,video_url_list)
# 使用线程池将视频的二进制数据保存到本地
pool.map(save_video,content_list)

print("执行耗时:",time.time()-start_time)
基于multiprocessing.dummy线程池爬取梨视频的视频信息
import re
import time
import random
import requests
from lxml import etree

start_time = time.time()
url = "https://www.pearvideo.com/category_4"
headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE"}
def request_video(url):
    '''
    向视频连接发送请求
    '''
    return requests.get(url =url,headers=headers).content

def save_video(content):
    '''
    将视频的二进制数据保存到本地
    '''
    video_name  =str(random.randint(100,999))+".mp4"
    with open(video_name,"wb")as f:
        f.write(content)

page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
li_list= tree.xpath('//ul[@id="listvideoListUl"]/li')
video_url_list = list()
for li in li_list:
    detail_url="https://www.pearvideo.com/"+li.xpath('./div/a/@href')[0]
    detail_page_text=requests.get(url=detail_url,headers=headers).text
    ex = 'srcUrl="(.*?)",vdoUrl=srcUrl'
    video_url = re.findall(ex,detail_page_text,re.S)[0]
    video_url_list.append(video_url)

    content = request_video(video_url)
    save_video(content)

print("执行耗时:",time.time()-start_time)
爬取梨视频的视频信息

猜你喜欢

转载自www.cnblogs.com/groundcontrol/p/12678125.html