Python crawling small videos

Destination URL: Pear Video

This article will not be made public, and will be set to be visible! Excuse me!

Then we find the technology page: https://www.pearvideo.com/category_8 . In fact, whichever page you want is fine, as long as you like it. Hehe...

This is a dynamic website, so let's go straight to the network and go to XHR:

Insert picture description here
Finding the rules, this shouldn’t be difficult, I’ll just post the website directly, if you want to exercise, you can find it: https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=8&start=0

This is the destination URL we are looking for. The 0 at the back represents the number of pages, so that when you open this page, you will find that it is a static page. This is best done, just go to:

Insert picture description here
code show as below:

import requests
import parsel,re
import os


 
target = "https://www.pearvideo.com/videoStatus.jsp?contId="

url = "https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=9&start=0"
res = requests.get(url)
res.encoding="utf-8"
html = parsel.Selector(res.text)
lists = html.xpath('/html/body/li/div/a/@href').getall()
for each in lists:

    print("https://www.pearvideo.com/"+each)

output;
https://www.pearvideo.com/video_1703486
https://www.pearvideo.com/video_1703189
https://www.pearvideo.com/video_1703161
https://www.pearvideo.com/video_1702880
https://www.pearvideo.com/video_1702773
...

I got it smoothly, and then entered the playback page, but found that I could not find the MP4 video, what should I do? After a lot of effort (after tearing off dozens of hairs), I found that it was in another website

Insert picture description here
What to do? Of course, you have to find a way to get this URL. After careful analysis, I found that this URL is very strange. The only thing that is a little familiar is the string of numbers. The string of numbers behind the URL of the playback page we got earlier is compared with this. They are exactly the same. If this is the case, it will be easy. Let's just connect it by splicing. Look at the code:

for each in lists:
    url_num = each.replace('video_',"")
    urls = target+url_num
    print(urls)
``

```python
output:
https://www.pearvideo.com/videoStatus.jsp?contId=1703486
https://www.pearvideo.com/videoStatus.jsp?contId=1703189
https://www.pearvideo.com/videoStatus.jsp?contId=1703161
https://www.pearvideo.com/videoStatus.jsp?contId=1702880
https://www.pearvideo.com/videoStatus.jsp?contId=1702773
https://www.pearvideo.com/videoStatus.jsp?contId=1702633
...

It's coming out, it seems a little bit different, what is behind &mrd=***************** No, what should I do? If you don’t have one, don’t chant. Friends who have read the Baidu picture I posted understand that there are some things in the website that are unnecessary. It is purely for us to play crawlers and disgust us. But no way, after all, we are going to crawl other people's data.

The URL problem was solved, but I clicked in and found this stuff:

Insert picture description here

Well, obviously, we have encountered the anti-climbing mechanism. This is easy to do. Just give what you want. The code is as follows:

	headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'Referer': 'https://www.pearvideo.com/video_'+ str(url_num)
    }
    html = requests.get(urls,headers=headers).text
    print(html)

Insert picture description here
Get it done! !

Finally, let's see if MP4 can be played:
Insert picture description here
West Eight! 404! ! Well, it’s a little troublesome here. I have to find the data and change the timestamp inside to'cont-number'. I feel that I have written a lot and my hands are a little tired, so I just uploaded the code directly:


import requests
import parsel,re
import os


 
target = "https://www.pearvideo.com/videoStatus.jsp?contId="

url = "https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=9&start=0"
res = requests.get(url)
res.encoding="utf-8"
html = parsel.Selector(res.text)
lists = html.xpath('/html/body/li/div/a/@href').getall()
# print(lists[2:])
# 提取视频后面的数字,数字是最重要的,需要传给 Referer 和 urls
for each in lists:
    url_num = each.replace('video_',"")
    urls = target+url_num
    # print(urls)
    headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'Referer': 'https://www.pearvideo.com/video_'+ str(url_num)
    }
    html = requests.get(urls,headers=headers).text
   
    cont = 'cont-' + str(url_num)

    # 提取 mp4 视频
    srcUrl = re.findall(f'"srcUrl":"(.*?)"',html)[0]
    # 替换视频里面的时间戳,改为可以真正播放的数据
    new_url = srcUrl.replace(srcUrl.split("-")[0].split("/")[-1],cont)
    print(new_url)
    

    # 使用视频后缀当视频名称
    filename = srcUrl.split("/")[-1]

    # 保存到本地
    with open("./images/"+filename,"wb") as f:
        f.write(requests.get(new_url).content)

Insert picture description here

If you don’t understand anything, you can leave a message and everyone can communicate together

Guess you like

Origin blog.csdn.net/weixin_51211600/article/details/109289024