Simple use of python to capture videos from the Douyin app

Record how to use python to crawl app data. This article uses crawling TikTok video app as an example.

Programming tool: pycharm

app capture tool: mitmproxy

App automation tool: appium

Operating environment: windows10

Ideas:

Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code! ??¤
QQ group: 704929215

Assuming that the tools we need have been configured

1. Use mitmproxy to capture the mobile app to get the content we want

2. Use appium automated testing tools to drive apps to simulate human actions (slide, click, etc.)

3. Combine 1 and 2 to achieve the effect of automated crawlers

1. mitmproxy/mitmdump packet capture

Make sure that mitmproxy has been installed, and the phone and PC are in the same local area network, and the CA certificate of mitmproxy is also configured. There are many related configuration tutorials on the Internet, I will skip it here.

Because mitmproxy does not support the windows system, one of its components, mitmdump, is used here. It is the command line interface of mitmproxy. You can use it to connect to our Python scripts and implement post-monitoring processing with Python.

After configuring mitmproxy, enter mitmdump on the console and open the Douyin app on the phone, mitmdump will present all the requests on the phone, as shown below

You can go down in the Douyin app and look at the request displayed by mitmdump. You will find that the prefixes are

http://v1-dy.ixigua.com/;http://v3-dy.ixigua.com/;http://v9-dy.ixigua.com/

The URLs with these 3 types of prefixes are our target Douyin video URLs.

Next, we need to write a python script to download the video. You need to use mitmdump -s scripts.py (here is the python file name) to execute the script.

import requests
# 文件路径
path = 'D:/video/'
num = 1788
 
 
def response(flow):
    global num
    # 经测试发现视频url前缀主要是3个
    target_urls = ['http://v1-dy.ixigua.com/', 'http://v9-dy.ixigua.com/',
                   'http://v3-dy.ixigua.com/']
    for url in target_urls:
        # 过滤掉不需要的url
        if flow.request.url.startswith(url):
            # 设置视频名
            filename = path + str(num) + '.mp4'
            # 使用request获取视频url的内容
            # stream=True作用是推迟下载响应体直到访问Response.content属性
            res = requests.get(flow.request.url, stream=True)
            # 将视频写入文件夹
            with open(filename, 'ab') as f:
                f.write(res.content)
                f.flush()
                print(filename + '下载完成')
            num += 1

The code is relatively rough, but the basic logic is still relatively clear, so that we can download the Tik Tok video, but this method has a flaw, that is, to get the video requires people to constantly slide the Tik Tok’s next video. At this time we can use a powerful appium automated testing tool to solve.

2. Appium simulates the mobile phone

Make sure to configure the Android and SDK on which appium depends. There are many tutorials on the Internet, so I won’t talk about them here.

The usage of appium is very simple. First of all, we open appium, the startup interface is as follows

Click the Start Server button to start the appium service

Connect the Android phone to the PC via the data cable, and at the same time turn on the USB debugging function, you can enter the adb command (you can find it online) to test the connection. If the following results appear, the connection is successful

model is the name of the device, which will be used for subsequent configuration. Then click the button pointed by the arrow in the figure below and a configuration page will appear

In the JSON Representation in the lower right corner, configure the Desired Capabilities parameters of the startup app, which are paltformName, deviceName, appPackage, and appActivity.

platformName: Platform name, usually Android or iOS.

deviceName: device name, specific type of phone

appPackage: App package name

appActivity: the name of the entry Activity, usually starting with.

platformName and deviceName are relatively easy to obtain, while appPackage and appActivity can be obtained by the following methods.

Enter the adb logcat>D:\log.log command on the console, and open the Douyin app on the mobile phone, then open the log.log file in the D drive, and look for the Displayed keyword

From the above figure, we can know that com.ss.android.ugc.aweme behind Displayed corresponds to appPackage, and .main.MainActivity corresponds to appActivity. Finally, our configuration results are as follows:

{
  "platformName": "Android",
  "deviceName": "Mi_Note_3",
  "appPackage": "com.ss.android.ugc.aweme",
  "appActivity": ".main.MainActivity"
}

Then click Start Session to start the Douyin app on the Android phone and enter the startup page. At the same time, a debugging window will pop up on the PC. From this window, you can preview the current phone page and simulate various operations on the phone. The point, so skip it.

Below we will use a python script to drive the app and run it directly in pycharm

from appium import webdriver
from time import sleep
 
 
class Action():
    def __init__(self):
        # 初始化配置,设置Desired Capabilities参数
        self.desired_caps = {
            "platformName": "Android",
            "deviceName": "Mi_Note_3",
            "appPackage": "com.ss.android.ugc.aweme",
            "appActivity": ".main.MainActivity"
        }
        # 指定Appium Server
        self.server = 'http://localhost:4723/wd/hub'
        # 新建一个Session
        self.driver = webdriver.Remote(self.server, self.desired_caps)
        # 设置滑动初始坐标和滑动距离
        self.start_x = 500
        self.start_y = 1500
        self.distance = 1300
 
    def comments(self):
        sleep(2)
        # app开启之后点击一次屏幕,确保页面的展示
        self.driver.tap([(500, 1200)], 500)
 
    def scroll(self):
        # 无限滑动
        while True:
            # 模拟滑动
            self.driver.swipe(self.start_x, self.start_y, self.start_x, 
                              self.start_y-self.distance)
            # 设置延时等待
            sleep(2)
 
    def main(self):
        self.comments()
        self.scroll()
 
 
if __name__ == '__main__':
 
    action = Action()
    action.main()

The following is the crawling process. ps: Duplicate videos are occasionally crawled

———————————————————————————————————————————

 

Guess you like

Origin blog.csdn.net/Python_sn/article/details/110429213