Python crawler actual combat using Airtest and mitmdump to crawl app data



foreword

App crawling data is easier than web crawling, because the anti-crawler capability is not so strong, and most of the data is transmitted in the form of JSON, which makes parsing easier. On the web side, we can monitor various network requests and response processes through the browser's developer tools. If we want to view these contents on the app side, we need to use packet capture software. Commonly used packet capture software includes WireShark, Filddler, Charles, mitmproxy, AnyProxy, etc., and their principles are basically the same. We can put the mobile phone under the monitoring of the packet capture software by setting a proxy, so that we can see all the requests and responses that occur during the running of the App, which is equivalent to analyzing Ajax. If the URLs, parameters, etc. of these requests are regular, then we can sum up the regularity and directly use the program to simulate crawling. If they are not regular, then we can use another tool mitmdump to connect to the Python script to directly process the Response. In addition, the crawling of the App must not be done by humans, it also needs to be automated, so we also need to control the animation of the App, and the library used here is Airtest.


1. Preliminary preparation

  1. Install Airtest (for automated clicks)
  2. Install the Night Simulator, the simulated mobile phone system is preferably Android 5, my Night Simulator is version 7.0.2.7
  3. Install mitmdump (for processing response data)

Two, ideas

1. Configure the Ye Shen Simulator

  1. The adb version in the Yeshen simulator (adb.exe) and Airtest (nox_adb.exe) need to be exactly the same, copy one of them to another file to ensure that the adb version is exactly the same

    The adb location of the Yeshen simulator: ...\bin
    The adb location of Airtest: ...\airtest\core\android\static\adb\windows

  2. Enable developer options and enable USB debugging
    insert image description here

  3. Configure the agent and install the mitmdump certificate in the Yeshen simulator, and the normal monitoring is as shown below
    insert image description here

2. Use mitmdump to find out the data

  • After successfully installing mitmdump, open the web side of mitmdump by executing the command on cmd
    mitmweb
    
  • Start successfully
    insert image description here
  • Find the url with the data you want to collect
    insert image description here

3. Realize automatic click through Airtest

  1. You can successfully connect to the Yeshen simulator through Airtest, as shown in the figure below
    insert image description here

  2. pit i encountered

    Unable to find the Yeshen emulator adb to connect, it may be that the adb version is different

    adb is the same and still can’t find it, it may be that USB debugging is not turned on

    If you still can’t find it after USB debugging is turned on, then restart the Yeshen emulator and airtest

3. Code

1. Monitor data code

  • Start the command mitmdump.exe -s path
 mitmdump.exe -s .\mitmproxy\gg.py

The code is as follows (example):

# 启动命令
# mitmdump.exe -s .\mitmproxy\gg.py

import json
from mitmproxy import ctx
import pymongo
# pymongo有自带的连接池和自动重连机制,但是仍需要捕捉AutoReconnect异常并重新发起请求。
from pymongo.errors import AutoReconnect
from retry import retry
# 指定 mongodb 的连接IP,库名,集合
MONGO_CONNECTION_STRING = 'mongodb://192.168.27.101:27017'

client = pymongo.MongoClient(MONGO_CONNECTION_STRING)
db = client['crawle_case']
collection = db['tsy_2']
'''
AutoReconnect:捕捉到该错误时进行重试,这个参数可以是一个元组,里面放上多个需要重试的条件
tries:重试次数
delay:两次重试的间隔时间
'''
@retry(AutoReconnect, tries=4, delay=1)
def save_data(data):
    """
    将数据保存到 mongodb
    使用 update_one() 方法修改文档中的记录。该方法第一个参数为查询的条件,第二个参数为要修改的字段。
    upsert:
    是一种特殊的更新,如果没有找到符合条件的更新条件的文档,就会以这个条件和更新文档为基础创建一个新的文档;如果找到了匹配的文档,就正常更新,upsert非常方便,不必预置集合,同一套代码既能用于创建文档又可以更新文档
    """
    # # 存在则更新,不存在则新建,
    # collection.update_one({
    
    
    #     # 保证 数据 是唯一的
    #     '游戏名ID': data.get('游戏名ID')
    # }, {
    
    
    #     '$set': data
    # }, upsert=True)
    collection.insert_one(data)
def response(flow):
    url = 'https://app.taoshouyou.com/api/trades/gettradeslist'
    if flow.request.url.startswith(url):
        text = flow.response.text
        print(flow)
        data = json.loads(text)
        print(data)
        for a in data.get('data').get('list'):
            # print(a)
            id = a.get('id')
            name = a.get('name')
            gamename = a.get('gamename')
            shopname = a.get('shopname')
            officialprice = a.get('officialprice')
            discount = a.get('discount')
            price = a.get('price')
            areaname = a.get('areaname')
            goodsname = a.get('goodsname')
            clientname = a.get('clientname')
            shopUrl = a.get('shopUrl')
            wb_data = {
    
    
                '游戏名ID': id,
                '游戏标题': name,
                '游戏名': gamename,
                '店铺': shopname,
                '原价': officialprice,
                '折扣': discount,
                '现价': price,
                '区服': areaname,
                '账号类型': goodsname,
                '客服端': clientname,
                '商品url': shopUrl,
            }
            save_data(wb_data)

2. Airtest sliding behavior code

The code is as follows (example):

# -*- encoding=utf8 -*-
__author__ = "Administrator"
import time
from airtest.core.api import *

auto_setup(__file__)
while True:
    gg1 = 426,375
    gg2 = 426,1453
    swipe(gg2,gg1,duration=0.01)
    time.sleep(1)

4. Execution effect

  1. Start the Night Simulator and open the APP
  2. Start monitoring data code
  3. Start Airtest sliding behavior code
  4. successful monitoring
    insert image description here
  5. Successfully collected data
    insert image description here

Summarize

The above is what I will talk about today. This article only briefly introduces the simple collection of APP data, which is limited to personal learning.

Guess you like

Origin blog.csdn.net/weixin_45688123/article/details/126855427