Python crawler: Pippi shrimp short video download without watermark (new version)

Insert picture description here

The latest update date of the crawling rules written in this blog post is:2021-2-28

Reminder: Please indicate the author and the original link for reprinting! ! !

CSDN personal homepage: high IQ idiot
Original address: https://blog.csdn.net/qq_44700693/article/details/113826111


Small cold

I don’t know if my friends who are watching this section have seen the Python crawler I wrote before : Pippi shrimp short video download without watermark .

In this article, I found the web version of the video without watermark request API by opening the sharing link , but it didn't last long, and it expired after a few months.

The reason I did that before was because I found that when the video sharing link is opened on the browser, the web version of the video itself loads the video without watermark, so of course it is the best to parse the web page, but now, the web page The video on the client side is automatically loaded as a watermarked video, and the video link field of the original web page without watermark video interface has also become a null value.

So far, many people have sent me private messages about whether there is the latest analysis method. After that, I have studied the web version, but they have not succeeded. Whether it is the encryption of the parameters or the meaning of the request field, I am at a loss.

Analysis process

In the small greeting, I said that the reason why I chose to parse the web version interface before was because the web version of the video itself loads the video without watermark, so I chose to directly parse the web version. Now, whether it is The web version of the mobile phone or the web version of the computer video are all loaded videos with watermarks, so we will start directly from the APP this time .

Tips for getting started: Pippi Shrimp’s Android client is certificate-locked . And how to solve this problem, please see my Ning blog post: Fiddler: Summary of the new and old versions of Fiddler .

Packet capture configuration:
       Fiddler Everywhere
       Night God Emulator (Android 5)

证书锁定(SSL/TLS Pinning)顾名思义,将服务器提供的SSL/TLS证书内置到移动端开发的APP客户端中,当客户端发起请求时,通过比对内置的证书和服务器端证书的内容,以确定这个连接的合法性。

When everything is ready, we can start the packet capture analysis:
Insert picture description here
here, I have to mention the big pit that I stepped on when I tried to analyze the previous few times. I really don’t know if it is my cause or this API. Is too confusing~~
I almost always start from the following API :
Insert picture description here
That’s because I found the following information in the return body of this API :
Insert picture description here
origin_video_download !!!Isn’t this the watermark-free video link field on the web page~~
Choose whatever you want A link opens in the browser: It's
Insert picture description here
...Successful? ? ?
Even if it is successful, these parameters are...not to mention the fields of the request header, the request body alone is enough for me.
Insert picture description here
Although I can rely on experience to analyze it myself, but...I just failed anyway.

Then the next few analysis are planted here, although a bit unwilling, but it is true.


Main code

And this time, I found a new interface:
Next to the packet capture above, we loaded the software after opening it, then deleted all the packet capture information, and clicked on the virtual machine to enter the details page of a certain video:
Insert picture description here
Then stop capturing packets.

Then slowly find the valuable information:
Insert picture description here
Then analyze the API :

Here I wrote a small script that can quickly convert a string into a dictionary:

import re
def parse_header(s_header):
   form = {
     
     }
   s_h = re.findall(r'(.*?):(.*?)\n', s_header)
   for a in s_h:
       form[a[0].strip()] = a[1].strip()
   return form

It can be used directly if needed.

So I got the following version of the code:

import requests

api_url = 'https://i-lq.snssdk.com/bds/cell/cell_comment/'
headers = {
    
    'Accept-Encoding': 'gzip',
           'X-SS-QUERIES': '******',
           'X-SS-REQ-TICKET': '1614499328380',
           'x-vc-bdturing-sdk-version': '2.0.1',
           'passport-sdk-version': '30',
           'sdk-version': '2',
           'User-Agent': 'ttnet okhttp/3.10.0.2',
           'Cookie': '******',
           'X-Khronos': '1614499329',
           'X-Gorgon': '040400a50005d14c4b04f9fa5ac0c9ec9617070b6fcfe1bff0f2',
           'Host': 'i-lq.snssdk.com',
           'Connection': 'Keep-Alive'}

param = {
    
    
    'cell_type': '1',
    'cell_id': '6884917158271260935',
    'offset': '0',
    'api_version': '1',
    'iid': '1284657529238599',
    'device_id': '1179092996603726',
    'ac': 'wifi',
    'mac_address': '54%3ABF%3A64%3A48%3A8C%3A87',
    'channel': 'baidu',
    'aid': '1319',
    'app_name': 'super',
    'version_code': '331',
    'version_name': '3.3.1',
    'device_platform': 'android',
    'ssmix': 'a',
    'device_type': 'LIO-AN00',
    'device_brand': 'Android',
    'language': 'zh',
    'os_api': '22',
    'os_version': '5.1.1',
    'uuid': '863064547881401',
    'openudid': '5c2a72be0ea16ab0',
    'manifest_version_code': '331',
    'resolution': '900*1600',
    'dpi': '320', 
    'update_version_code': '33150',
    '_rticket': '1614499328336',
    'cdid': '6e3fcc11-5cc4-494f-acbe-d9887dd59e08',
    'app_region': 'CN',
    'sys_region': 'CN',
    'time_zone': 'Asia%2FShanghai',
    'app_language': 'ZH',
    'carrier_region': 'CN',
    'last_channel': '',
    'last_update_version_code': '0',
    'ts': '1614499328'
}

def parse_url():
    response = requests.get(api_url, headers=headers, params=param)
    video = response.json()['data']['cell_comments'][0]['comment_info']['item']['video']
    video_name = video['text']
    video_url = video['video_high']['url_list'][0]['url']
    print("video_name:" + video_name)
    print("video_url:" + video_url)

if __name__ == '__main__':
    parse_url()

operation result:

video_name:像极了当年没有手机的自己
video_url:http://v3-ppx.ixigua.com/a7e2629a5d88663e2f4ae8804fb67e01/603b62d7/video/m/220de8d705970384c929ba5f75e150f20b9116625db80000645ef6706767/?a=1319&br=1280&bt=320&cd=0%7C0%7C0&ch=0&cr=0&cs=0&cv=1&dr=6&ds=6&er=&l=202102281630490101351550433E018EB7&lr=&mime_type=video_mp4&pl=0&qs=0&rc=am80ZzNxamQ1eDMzaWYzM0ApPDU0NDdnaGQ8Nzw7ZzdnPGcyay5mZG8yNDNfLS1jMTBzczMzYC5gXmJjYDUyNGI2YjE6Yw%3D%3D&vl=&vr=

After inspection, the obtained link is indeed a watermark-free link, so the next step is to analyze the parameters...
Actually, there is no need to analyze it~
There are many parameters that are invalid. After trying in turn, the second version of the code is obtained:

import requests

path = "./PPX/"

api_url = 'https://i-lq.snssdk.com/bds/cell/cell_comment/'
headers = {
    
    
           'User-Agent': 'ttnet okhttp/3.10.0.2',
           'Host': 'i-lq.snssdk.com',
           'Connection': 'Keep-Alive'
           }
           
param = {
    
    
    'cell_id': '6884917158271260935',
    'aid': '1319',
    'app_name': 'super',
}

def parse_url():
    response = requests.get(api_url, headers=headers, params=param)
    video = response.json()['data']['cell_comments'][0]['comment_info']['item']['video']
    video_name = video['text']
    video_url = video['video_high']['url_list'][0]['url']
    print("video_name:" + video_name)
    print("video_url:" + video_url)

if __name__ == '__main__':
    parse_url()

Less than 30 lines of code...

In order to be more robust, flat and compatible with sharing links, we have the following third version of the code:

import random
import requests

path = "./PPX/"


class PpxNew:
    api_url = 'https://i-lq.snssdk.com/bds/cell/cell_comment/'
    headers = {
    
    
        'User-Agent': 'ttnet okhttp/3.10.0.2',
        'Host': 'i-lq.snssdk.com',
        'Connection': 'Keep-Alive'
    }

    def __init__(self, s_url):
        if '/item/' in s_url:
            self.cell_id = s_url.split('?')[0].split('/')[-1]
        elif '/s/' in s_url:
            self.rel_url = requests.get(s_url, headers={
    
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}).url
            self.cell_id = self.rel_url.split('?')[0].split('/')[-1]
        self.param = {
    
    
            'cell_id': self.cell_id,
            'aid': '1319',
            'app_name': 'super',
        }

    def parse_url(self):
        response = requests.get(self.api_url, headers=self.headers, params=self.param)
        video = response.json()['data']['cell_comments'][0]['comment_info']['item']['video']
        video_name = video['text']
        if video_name == '':
            video_name = int(random.random() * 2 * 1000)
        if len(str(video_name)) > 20:
            video_name = video_name[:20]
        video_url = video['video_high']['url_list'][0]['url']
        with open(path + str(video_name) + ".mp4", 'wb')as fp:
            fp.write(requests.get(video_url).content)
        print("【皮皮虾】: {}.mp4 无水印视频下载完成!".format(video_name))


if __name__ == '__main__':
    s_url = 'https://h5.pipix.com/s/eJXwbxC/'
    PpxNew(s_url).parse_url()

Guess you like

Origin blog.csdn.net/qq_44700693/article/details/113826111