Record how to use python to crawl app data. This article uses crawling TikTok video app as an example.
Programming tool: pycharm
app capture tool: mitmproxy
App automation tool: appium
Operating environment: windows10
Ideas:
Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code! ??¤
QQ group: 704929215
Assuming that the tools we need have been configured
1. Use mitmproxy to capture the mobile app to get the content we want
2. Use appium automated testing tools to drive apps to simulate human actions (slide, click, etc.)
3. Combine 1 and 2 to achieve the effect of automated crawlers
1. mitmproxy/mitmdump packet capture
Make sure that mitmproxy has been installed, and the phone and PC are in the same local area network, and the CA certificate of mitmproxy is also configured. There are many related configuration tutorials on the Internet, I will skip it here.
Because mitmproxy does not support the windows system, one of its components, mitmdump, is used here. It is the command line interface of mitmproxy. You can use it to connect to our Python scripts and implement post-monitoring processing with Python.
After configuring mitmproxy, enter mitmdump on the console and open the Douyin app on the phone, mitmdump will present all the requests on the phone, as shown below
You can go down in the Douyin app and look at the request displayed by mitmdump. You will find that the prefixes are
http://v1-dy.ixigua.com/;http://v3-dy.ixigua.com/;http://v9-dy.ixigua.com/
The URLs with these 3 types of prefixes are our target Douyin video URLs.
Next, we need to write a python script to download the video. You need to use mitmdump -s scripts.py (here is the python file name) to execute the script.
import requests
# 文件路径
path = 'D:/video/'
num = 1788
def response(flow):
global num
# 经测试发现视频url前缀主要是3个
target_urls = ['http://v1-dy.ixigua.com/', 'http://v9-dy.ixigua.com/',
'http://v3-dy.ixigua.com/']
for url in target_urls:
# 过滤掉不需要的url
if flow.request.url.startswith(url):
# 设置视频名
filename = path + str(num) + '.mp4'
# 使用request获取视频url的内容
# stream=True作用是推迟下载响应体直到访问Response.content属性
res = requests.get(flow.request.url, stream=True)
# 将视频写入文件夹
with open(filename, 'ab') as f:
f.write(res.content)
f.flush()
print(filename + '下载完成')
num += 1
The code is relatively rough, but the basic logic is still relatively clear, so that we can download the Tik Tok video, but this method has a flaw, that is, to get the video requires people to constantly slide the Tik Tok’s next video. At this time we can use a powerful appium automated testing tool to solve.
2. Appium simulates the mobile phone
Make sure to configure the Android and SDK on which appium depends. There are many tutorials on the Internet, so I won’t talk about them here.
The usage of appium is very simple. First of all, we open appium, the startup interface is as follows
Click the Start Server button to start the appium service
Connect the Android phone to the PC via the data cable, and at the same time turn on the USB debugging function, you can enter the adb command (you can find it online) to test the connection. If the following results appear, the connection is successful
model is the name of the device, which will be used for subsequent configuration. Then click the button pointed by the arrow in the figure below and a configuration page will appear
In the JSON Representation in the lower right corner, configure the Desired Capabilities parameters of the startup app, which are paltformName, deviceName, appPackage, and appActivity.
platformName: Platform name, usually Android or iOS.
deviceName: device name, specific type of phone
appPackage: App package name
appActivity: the name of the entry Activity, usually starting with.
platformName and deviceName are relatively easy to obtain, while appPackage and appActivity can be obtained by the following methods.
Enter the adb logcat>D:\log.log command on the console, and open the Douyin app on the mobile phone, then open the log.log file in the D drive, and look for the Displayed keyword
From the above figure, we can know that com.ss.android.ugc.aweme behind Displayed corresponds to appPackage, and .main.MainActivity corresponds to appActivity. Finally, our configuration results are as follows:
{
"platformName": "Android",
"deviceName": "Mi_Note_3",
"appPackage": "com.ss.android.ugc.aweme",
"appActivity": ".main.MainActivity"
}
Then click Start Session to start the Douyin app on the Android phone and enter the startup page. At the same time, a debugging window will pop up on the PC. From this window, you can preview the current phone page and simulate various operations on the phone. The point, so skip it.
Below we will use a python script to drive the app and run it directly in pycharm
from appium import webdriver
from time import sleep
class Action():
def __init__(self):
# 初始化配置,设置Desired Capabilities参数
self.desired_caps = {
"platformName": "Android",
"deviceName": "Mi_Note_3",
"appPackage": "com.ss.android.ugc.aweme",
"appActivity": ".main.MainActivity"
}
# 指定Appium Server
self.server = 'http://localhost:4723/wd/hub'
# 新建一个Session
self.driver = webdriver.Remote(self.server, self.desired_caps)
# 设置滑动初始坐标和滑动距离
self.start_x = 500
self.start_y = 1500
self.distance = 1300
def comments(self):
sleep(2)
# app开启之后点击一次屏幕,确保页面的展示
self.driver.tap([(500, 1200)], 500)
def scroll(self):
# 无限滑动
while True:
# 模拟滑动
self.driver.swipe(self.start_x, self.start_y, self.start_x,
self.start_y-self.distance)
# 设置延时等待
sleep(2)
def main(self):
self.comments()
self.scroll()
if __name__ == '__main__':
action = Action()
action.main()
The following is the crawling process. ps: Duplicate videos are occasionally crawled
———————————————————————————————————————————