Python crawler: multi-platform short video to watermark downloader

Python crawler: multi-platform short video to watermark downloader

The crawling plan described in this tutorial is finalized on October 26, 2020

Solemnly reiterate: The technology introduced in this article is for learning purposes only and should not be maliciously attacked on major short video platforms. Any loss caused to the major short video platform servers is at your own risk.

Python crawler: multi-platform short video to watermark downloader

The video is compressed, the original video HD address: https://www.bilibili.com/video/BV1cr4y1w7yz/

Please indicate the author and original address for reprinting~

Author: West Ya Man
CSDN profile: West Ya Man
article addresses: https://blog.csdn.net/qq_41707308/article/details/109293116

Features

This software is written in Python, the requests library is used for data crawling, and the GUI interface is written using PyQt5 to add ease of use to ordinary users:

  1. Perfectly remove the watermark , you don’t have to worry about the watermark and annoying credits anymore when you collect good-looking videos;
  2. Multi-platform support , support for Douyin, Kuaishou, Weishi, Pippi Funny, covering small video software commonly used by most users;
  3. GUI interface design will bring a quick and easy-to-use experience, and Xiaobai will also use it;
  4. An increase of the progress bar display function, you can visually see the process of running the program;
  5. Take multiple threads get data crawling to prevent suspended animation interface;

Breakdown of each platform

Douyin

First get the sharing link, open a video on Douyin (all shared videos can use the same method), select the share button in the lower right corner, and click the copy link.

How to get the link:
Insert picture description here

Copy link:
Insert picture description here

By looking at the copied link, you can find that there is not only a URL address in the link, but also some text symbols. Therefore, the first step is to extract the URL address and file name contained in the copied link:

    def compile_name_url(url_text):
        # 正则匹配分享链接,获取链接非空字符前的几个字符作为文件名
        video_name = re.match(r'\S*', url_text)
        if video_name:
            video_name = video_name.group()
            print(video_name)

        # 正则匹配分享链接,获取链接
        first_url = re.search(r'https://v.douyin.com/.*?/', url_text)
        if first_url:
            first_url = first_url.group()
            print('第一次url==》', first_url)

        return first_url, video_name

Copy the link to the browser and capture the data packet analysis data: you can find that the required target video link is in the json data returned in the third picture, and you only need to get the corresponding field.

Insert picture description here
Insert picture description here
Insert picture description here

However, things are not as simple as we thought. After grabbing the link, the URL obtained is directly copied to the browser. The result is still watermarked (as shown in the figure below). If you look carefully at the URL address, you will find the playwm field, and wm It is the abbreviation of water mark, which is watermark. If you visit again after removing wm, you will get an address without watermark. That's it!

Insert picture description here

Code:

    def first_request(first_url):
        headers = {
    
    
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'sec-fetch-dest': 'document',
            'sec-fetch-moe': 'navigate',
            'sec-fetch-site': 'none',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4195.1 Mobile Safari/537.36'
        }

        response = requests.get(first_url, headers=headers)

        print('第二次url==》', response.url)  # 获取第一次重定向后的url

        # 截取第二次请求需要的参数
        content = re.search(r'\d\d*\d', response.url).group()
        params = {
    
    "item_ids": content}

        return params

    def second_request(params):
        response = requests.get('https://www.iesdouyin.com/web/api/v2/aweme/iteminfo/', params=params)
        result = response.json()  # 获得第二次请求的json,并从json中查找第三次请求需要的url
        # print(result)
        second_url = result['item_list'][0]['video']['play_addr']['url_list'][0]

        play_url = re.sub(r'playwm', 'play', second_url)
        print('第三次url==》', play_url)
        return play_url

quick worker

Kuaishou platform is simpler, the same way to obtain links:

How to get the link:
Insert picture description here

Copy link:
Insert picture description here

Extract the URL and video name in the same way as TikTok. I will not repeat the code here, and start directly analyzing the link request content:

Insert picture description here
Insert picture description here

The returned text contains a video without watermark, and the name is also obvious "srcNoMark" (no watermark resource), just make a request to this URL. We can enjoy the singing of Miss Ai Bei! ! !

Code:

    def second_request(second_url):
        headers = {
    
    
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Connection': 'keep-alive',
            'Cookie': 'did=web_209e6a4e64064f659be838aca3178ec1; didv=1603355622000',
            'Host': 'c.kuaishou.com',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Upgrade-Insecure-Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4195.1 Mobile Safari/537.36'
        }
        response = requests.get(second_url, headers=headers)

        content = response.text
        video_url = re.search(r'"srcNoMark":"https://txmov2.a.yximgs.com/.*?\"', content).group()[13:-1]
        print(video_url)

        return video_url

Microvision

The same routine, first get the link:

How to get the link:
Insert picture description here

How to get the link:
Insert picture description here

Next, copy the link to the browser and capture the packet for analysis. It can be found that the URL address where the video is located is stored in a requested json file. Then I found out that this is a post request with parameters, which must be analyzed on the web page

Insert picture description here
Insert picture description here

First go back to the first request to find out whether there are parameters. After careful observation, it is found that the required parameter feedid is actually part of a link, so you have to change the crawling strategy and get the link first. After getting the parameters, continue to the next step to get the video.

Insert picture description here

Code:

    def compile_name_url(url_text):
        video_name = re.findall(r'(\w*)', url_text)
        feedid = re.findall(r'feed/(\w*)', url_text)
        if video_name and feedid:
            video_name = video_name[0]
            feedid = feedid[0]

        return feedid, video_name

    def first_request(feedid):
        headers = {
    
    
            'accept': 'application/json',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'content-length': '84',
            'content-type': 'application/json',
            'cookie': 'RK=BuAQ1v+yV3; ptcz=6f7072f84fa03d56ea047b407853df6a5375d719df1031ef066d11b09fb679e4; pgv_pvi=8434466816; pgv_pvid=1643353500; tvfe_boss_uuid=3b10306bf3ae662b; o_cookie=1074566721; pac_uid=1_1074566721; ied_qq=o1074566721; LW_sid=k1Y5n9s3Y0K866h7P246v4k6o8; LW_uid=u1v5i9V3p0L806m7R226s4W7F1; eas_sid=J1p5G9s3A0h8Z6c7l2a6x4E7w7; iip=0; ptui_loginuin=1074566721; person_id_bak=5881015637151283; person_id_wsbeacon=5920911274348544; wsreq_logseq=341295039',
            'origin': 'https://h5.weishi.qq.com',
            'sec-fetch-dest': 'empty',
            'sec-fetch-mode': 'cors',
            'sec-fetch-site': 'same-origin',
            'user-agent': 'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4195.1 Mobile Safari/537.36',
            'x-requested-with': 'XMLHttpRequest'
        }
        rejson = {
    
    
            'datalvl': "all",
            'feedid': feedid,
            'recommendtype': '0',
            '_weishi_mapExt': '{}'
        }
        first_url = 'https://h5.weishi.qq.com/webapp/json/weishi/WSH5GetPlayPage'
        response = requests.post(first_url, headers=headers, json=rejson)
        result = response.json()
        video_url = result['data']['feeds'][0]['video_url']

        return video_url

Pippi funny

In the same step, get the link first.

How to get the link:
Insert picture description here

Copy link:
Insert picture description here

Observing the shared string, it can be seen that the name of the video is not added this time, and the names of all videos can only be obtained from the requested content. First put the link in the browser to analyze the data.

Insert picture description here
Insert picture description here

This time I also found that this is a post request, and it also requires parameters. In view of the experience of Weishi, I carefully checked the original sharing link. As expected, the two changed parameters are hidden in the link. Just get the pid and mid parameters, and then initiate a request to get the video.

Insert picture description here

Code:

    def compile_name_url(url_text):
        headers = {
    
    
            'Host': 'share.ippzone.com',
            'Origin': 'http://share.ippzone.com',
            'Referer': url_text,
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 Edg/84.0.522.52'
        }
        mid = re.findall(r'mid=(\d*)', url_text)
        pid = re.findall(r'pid=(\d*)', url_text)
        if mid and pid:
            mid = int(mid[0])
            pid = int(pid[0])

        parmer = {
    
    
            'mid': mid,
            'pid': pid,
            'type': 'post'
        }
        url = 'http://share.ippzone.com/ppapi/share/fetch_content'
        r = requests.post(url, headers=headers, json=parmer)
        result=r.json()
        video_name = result['data']['post']['content'].replace(' ', '')
        video_url = result['data']['post']['videos'][str(result['data']['post']['imgs'][0]['id'])]['url']
        return video_url, video_name

to sum up

This is the first time writing a blog as a programming beginner. The writing may not be good or clear enough. It is purely technical exchange and sharing. The content shared is only for learning and research, and malicious attacks on major video platform servers are not allowed.

Finally, I hope everyone reprints please indicate the source~

Author: West Ya Man
CSDN profile: West Ya Man
article addresses: https://blog.csdn.net/qq_41707308/article/details/109293116

Guess you like

Origin blog.csdn.net/qq_41707308/article/details/109293116