爬取思路:
- 1.分析网站是否ajax解析数据,清除页面刷新网站,看XHR有没有对应的网站,发现为空,表明数据不是由ajax数据解析来的,我们可以直接由网站url来抓取数据,即
url = "https://www.pearvideo.com/category_4"
顺便获取headers
headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE"}` ![dcacd1d2ddecbcbdb8caf52cc61366b2.png](en-resource://database/541:1)
- 2.我们的需求是下载前面第一排的视频,我们要先获取他的视频详情页,我们如果用requests.get获取到现在url的源码数据,我们会发现里面并没有我们想要的视频下载链接,我们猜想是要点进视频网站之后才可以获取里面的mp4数据,分析到这里,我们要先获取这个url的源代码,然后在用xpath将将这四个视频的详情网站抓取下来
page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list= tree.xpath('//ul[@id="listvideoListUl"]/li')
以上代码我们就已经获取到这四个详情网站的列表,但是我们从图上可以看到,这网站是不完整的,我们需要手动对他进行拼接,在对列表进行遍历
for li in li_list: detail_url="https://www.pearvideo.com/"+li.xpath('./div/a/@href')[0]
- 3.抓取完这四个视频的详情网站之后,我们针对某一个视频进行具体分析,其他的也类似.
我们从网站的源码上直接搜到了他这网站视频的mp4,我们想要下载的就是这个mp4文件,获取到这个detail_url之后,我们可以用正则去抓取这个mp4文件
ex = 'srcUrl="(.*?)",vdoUrl=srcUrl' video_url = re.findall(ex,detail_page_text,re.S)[0]
- 4.之后我们需要将这些函数进行下载读取,有with open以wb形式写入文件即可
以上为下载视频的大致思路,难点是我们需要封装函数,并且发现这样直接写的话,下载小视频就挺花时间的了,我们如果下载的是大视频,可能会占用大量时间内存,这是我们需要使用线程池来解决多进程占用资源的问题
导包 from multiprocessing.dummy import Pool,有四个视频就弄四个线程池pool=Pool(4),了解pool.map(func,list)这个函数的用法就是这个func的函数去执行list里面的所有
import re import time import random import requests from lxml import etree from multiprocessing.dummy import Pool start_time = time.time() url = "https://www.pearvideo.com/category_4" headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE"} page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list= tree.xpath('//ul[@id="listvideoListUl"]/li') video_url_list = list() for li in li_list: detail_url="https://www.pearvideo.com/"+li.xpath('./div/a/@href')[0] detail_page_text=requests.get(url=detail_url,headers=headers).text ex = 'srcUrl="(.*?)",vdoUrl=srcUrl' video_url = re.findall(ex,detail_page_text,re.S)[0] video_url_list.append(video_url) def request_video(url): ''' 向视频连接发送请求 ''' return requests.get(url =url,headers=headers).content def save_video(content): ''' 将视频的二进制数据保存到本地 ''' video_name =str(random.randint(100,999))+".mp4" with open(video_name,"wb")as f: f.write(content) # 使用线程池将视频的二进制数据下载下来 pool = Pool(4) # pool.map(func,list) content_list = pool.map(request_video,video_url_list) # 使用线程池将视频的二进制数据保存到本地 pool.map(save_video,content_list) print("执行耗时:",time.time()-start_time)
import re import time import random import requests from lxml import etree start_time = time.time() url = "https://www.pearvideo.com/category_4" headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE"} def request_video(url): ''' 向视频连接发送请求 ''' return requests.get(url =url,headers=headers).content def save_video(content): ''' 将视频的二进制数据保存到本地 ''' video_name =str(random.randint(100,999))+".mp4" with open(video_name,"wb")as f: f.write(content) page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list= tree.xpath('//ul[@id="listvideoListUl"]/li') video_url_list = list() for li in li_list: detail_url="https://www.pearvideo.com/"+li.xpath('./div/a/@href')[0] detail_page_text=requests.get(url=detail_url,headers=headers).text ex = 'srcUrl="(.*?)",vdoUrl=srcUrl' video_url = re.findall(ex,detail_page_text,re.S)[0] video_url_list.append(video_url) content = request_video(video_url) save_video(content) print("执行耗时:",time.time()-start_time)