How to obtain article information through UI automation?

For study and research, I analyzed the articles and videos of a certain account, and tried to use automated methods to see if I could obtain the corresponding information.

There are multiple ways to obtain articles of a certain number:

The first is to search the account through the Sogou browser (this method can only get one article per day, which is basically useless.):

picture

If you want to learn interface automation testing, here I recommend a set of videos for you. This video can be said to be the number one interface automation testing tutorial on the entire network at station B. At the same time, the number of online users has reached 1,000, and there are notes to collect and use. Technical exchanges of various masters: 798478386    

[Updated] The most detailed collection of practical tutorials for automated testing of Python interfaces taught by station B (the latest version of actual combat)_哔哩哔哩_bilibili [Updated] The most detailed collection of practical tutorials for automated testing of Python interfaces taught by station B (actual combat) The latest version) has a total of 200 videos, including: 1. [Interface Automation] The current market situation of software testing and the ability standards of testers. , 2. [Interface Automation] Fully skilled in the Requests library and the underlying method call logic, 3. [Interface Automation] interface automation combat and the application of regular expressions and JsonPath extractors, etc. For more exciting videos, please pay attention to the UP account. https://www.bilibili.com/video/BV17p4y1B77x/?spm_id_from=333.337&vd_source=488d25e59e6c5b111f7a1a1a16ecbe9a 

The second method requires you to register a subscription account yourself, and there are restrictions on the registration account:

1. One email can only apply for one account;

2. The same mobile phone number can be bound to 5 accounts;

3. The maximum number of personal accounts registered with the same ID card is 1;

4. The maximum number of registered accounts for the same enterprise, individual industrial and commercial household, and other organizational materials is 2;

5. The same government and media type can register and authenticate 50 accounts;

6. The maximum number of accounts registered by the same overseas entity is 1.

It means that a person can only register one number, and can apply for a media or government with 50 numbers. For research and study, one is enough.

Not to mention the registration process, after registering and logging in, click "draft box" ---> 'write new picture and text' ---> click the 'hyperlink' at the top --> click 'select another account' at the account- -->Enter the name of the account that needs to get articles , click Query, you can see at this time, all the articles of this account will be obtained:

picture

For example, if we query "CCTV", the articles will be displayed in reverse order according to the release date. This is the data we want to obtain. Our automated operation will be simulated to this place, and the rest is just to analyze the interface:

Through request analysis, the request to obtain articles is:

https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin=35&count=5&fakeid=MTI0MDU3NDYwMQ==&type=9&query=&token=441171486&lang=zh_CN&f=json&ajax=1

picture

The data returned by the interface is very complete, and basically all the data you want are available.

What we have to do:

1. Automatically simulate the user operation to the current action, then obtain the request connection, and analyze the request data

2. To directly call this interface to obtain data, then it is necessary to construct the request parameters of this interface, such as account id, cookie, etc., login maintenance, etc.

Comparing the two methods, both have corresponding advantages. For automation, you only need to enter the account name after logging in. If you call the interface, different accounts need to maintain the corresponding id in advance, and it is easier to be blocked if you call the interface directly.

The currently researched ban rules are: each account can call the interface 60 times, and it will be banned for 1 hour when it reaches 60 times. It is useless to change the IP, and it can continue after 1 hour. After being banned many times, the ban time will change. long, up to 24 hours.

Some people say, without calling the interface, how to obtain these requests after automation? There are two ways:

  1. The advantage of installing mitmproxy compared to Charles and fiddler is that it can be mocked by command line or script

Mitmproxy can not only capture packets like Charles, but also perform secondary development on the request data and enter a high degree of secondary customization

You can first check the relevant documents on the official website

mitmproxy official website: https://www.mitmproxy.org/
mitmproxy is very powerful, but we don't need it here.

1.seleniumwire 

Selenium Wire extends Selenium's Python bindings to give you access to the underlying requests made by the browser. You write code the same way you write Selenium, but you get additional APIs to inspect requests and responses and change them dynamically. mitmproxy official website:

https://github.com/wkeeling/selenium-wire

  • Pure Python, user-friendly API

  • HTTP and HTTPS requests captured

  • Intercept requests and responses

  • Modify headers, parameters, body content on the fly

  • Capture websocket messages

  • HAR format supported

  • Proxy server support

seleniumwire is enough for us. 

After knowing how to do it, let's sort out what we need to prepare:

1. Environment installation

(1) python environment (omitted)

(2) chromedriver driver download:

http://chromedriver.storage.googleapis.com/index.html

(3) mysql environment (omitted)

2. Database design

(1) weichat_news: news main table


CREATE TABLE `weichat_news` (
  `id` int NOT NULL AUTO_INCREMENT,
  `news_title` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '文章名称',
  `news_cover` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '文章封面图',
  `news_digest` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '文章简要',
  `news_content_link` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '文章内容连接',
  `account_id` int NOT NULL COMMENT '微信账号id',
  `weichat_name` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '微信账号名称',
  `news_create_time` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '文章创建时间',
  `news_update_time` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '文章更新时间',
  `update_time` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '更新时间',
  `insert_time` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '文章插入时间',
  `is_video` int NOT NULL DEFAULT '0' COMMENT '是否执行video扫描',
  `have_video` int NOT NULL DEFAULT '0' COMMENT '是否含有视频',
  `is_content` int NOT NULL DEFAULT '0' COMMENT '是否获取内容',
  `is_push` int DEFAULT '0',
  `is_delete` int NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `title` (`news_title`) USING BTREE,
  KEY `link` (`news_content_link`) USING BTREE,
  KEY `idx_account_id` (`account_id`) USING BTREE,
  KEY `idx_1` (`have_video`,`is_delete`,`is_push`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci COMMENT='文章';

(2) weichat_account: the account table that needs to be obtained


CREATE TABLE `weichat_account` (
  `id` int NOT NULL AUTO_INCREMENT,
  `account` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '账号名称',
  `collection_id` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  `is_delete` int DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci COMMENT='账号';

(3) run_account: account execution table


CREATE TABLE `run_account` (
  `id` int NOT NULL AUTO_INCREMENT,
  `account_id` int NOT NULL COMMENT '账号id',
  `account` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '账号名称',
  `patch` int DEFAULT NULL,
  `run_time` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '执行时间',
  `is_delete` int DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci COMMENT='执行情况';

(4) news_video: video table


CREATE TABLE `news_video` (
  `id` int NOT NULL AUTO_INCREMENT,
  `news_id` int NOT NULL COMMENT '文章id',
  `news_title` varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '文章名称',
  `account_id` int NOT NULL COMMENT '公众id',
  `account` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '账号名称',
  `original_url` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '原视频url',
  `cover` varchar(300) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  `vid` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  `width` int NOT NULL COMMENT '视频宽度',
  `height` int NOT NULL COMMENT '视频高度',
  `video_quality_level` int NOT NULL COMMENT '视频级别',
  `video_quality_wording` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '清晰度',
  `qiniu_url` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '七牛转存url',
  `insert_time` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '插入时间',
  `is_delete` int DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci COMMENT='视频';

(5) news_content: news content table


CREATE TABLE `news_content` (
  `id` int NOT NULL AUTO_INCREMENT,
  `news_id` int NOT NULL,
  `news_title` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8_general_ci DEFAULT NULL,
  `content` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
  `insert_time` datetime DEFAULT NULL,
  `source` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  `is_delete` int DEFAULT '0',
  PRIMARY KEY (`id`),
  UNIQUE KEY `news_id` (`news_id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci COMMENT='文章详情表';

(6) account_token: the table used to store the account background


CREATE TABLE `account_token` (
  `id` int NOT NULL AUTO_INCREMENT,
  `account` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '账号名称',
  `token` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '登录token',
  `update_time` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT 'token更新时间',
  `freq_control_time` datetime DEFAULT NULL,
  `is_delete` int DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci COMMENT='采集所用账号token';

3. Sub-module design

(1) Obtain basic article information through automation, such as: cover image, article access link, article time, etc.

(2) Automatically open the article link to obtain the address of the video contained in the article. There are two types of videos in the article details:

A. Ordinary videos for article details can generally be directly analyzed and downloaded (the more regular videos in the article details below):

picture

B. The TX video inserted in the article details is in the m3u8 format. You need to download the ts file to synthesize mp4. If it is a TX video, you must first click the play button when automating, otherwise you will not be able to get the m3u8 address:

The two video processing methods are different. My processing idea is: all articles are connected to automation, and selenium is used to display and wait. If you find a video button, click it, and then get all the request data for subsequent analysis.

(3) Directly request the address of the original article, and parse the content through bs4 and lxml

  A. Remove unwanted formatting

  B. Transfer the picture to Qiniu (replace the picture in the original content after the transfer)

  C. Extract the original video tag (you need to replace this after the video is downloaded and transferred to Qiniu and transcoded later)

  D. Extract other tags, such as news source, editor, reporter and other fields

(4) Since the account is banned, you can consider joining an account. Once the account is banned, log out of the current account, log in with a new account, and send the login QR code to DingTalk for scanning

  A. During automation, the account is blocked and there is no data when querying articles, and the interface returns with a prompt "freq control"

     B. When the account is banned, insert a record into the table and update the ban time. When retrieving the account, it is determined that the ban time is greater than 1 hour.

  C. In fact, which account is used depends on who is scanning with the mobile phone. The browser has a login cache. Before the account is blocked, we have been using the browser cache. In this way, when it is automated, we don’t have to scan the QR code every time. Yes, I will talk about it later.

  D. When you need to log in, send the screenshot of the QR code to the DingTalk group. Here, the page automation waiting time can be set longer, such as 10 minutes.

  Main module classification:

picture

 Partial code sharing:

1 Get the main method of the article:


def get_news_main():
    """
    获取文章
    :return:
    """

    # 查询需要的号
    wei_account = read_sql.deal_mysql("get_account.sql")
    max_patch = read_sql.deal_mysql("get_max_patch.sql")[0][0]

    if max_patch is None:
        max_patch = 1
    else:
        if len(wei_account) == 0:
            max_patch = max_patch + 1
            wei_account = read_sql.deal_mysql("get_account_all.sql")
    for i in wei_account:
        # now() > DATE_ADD(freq_control_time, INTERVAL 1 HOUR) // 这里获取没有被封禁的账号继续执行任务
        crawl_account_l = read_sql.deal_mysql("get_account_token.sql")
        if crawl_account_l:
            # 随机取一个用于爬取的账号
            crawl = random.randrange(len(crawl_account_l))
            # 所用来爬取账号的id
            crawl_account_id = crawl_account_l[crawl][0]
            # 所用来爬取账号的账号
            crawl_account = crawl_account_l[crawl][1]
            # # 所用来爬取账号的登录token
            crawl_token = crawl_account_l[crawl][2]
            # 待爬取的账号id
            account_id = i[0]
            # 待爬取的账号名称
            account_name = i[1]
            # 需要爬取的页数   //为1爬取两页,0开始
            page = 0
            # 自动化操作路径
            driver = news.get_news(crawl_token, account_name, page, crawl_account_id, crawl_account)
            try:
                time.sleep(1)
                # 获取文章数据
                news_data, freq = analyze_news.analyze_news(driver)
                # 数据插入数据库
                in_news.insert_news(account_id, account_name, news_data)
                # 获取了的账号存入run_account表
                in_run.insert_run_account(account_id, account_name, max_patch)
                # 如果账号被封禁则更新被禁时间,并退出当前账号
                if freq is True:
                    ua.update_account_token_freq(crawl_account_id)
                    # 被封禁的账号退出登录
                    driver = lg.login_out(driver)
            except Exception as e:
                print(e)
            finally:
                driver.quit()
        else:
            print('账号封禁中,暂不能执行任务!!')
            break

 (1) Automatic acquisition method


# -*- coding = utf-8 -*-
# ------------------------------
# @time: 2022/5/5 17:11
# @Author: drew_gg
# @File: get_wei_chat_news.py
# @Software: wei_chat_news
# ------------------------------

import os
import time
import random
import configparser
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from get_news import get_login_token as gt
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

pl = os.getcwd().split('wei_chat_news')
# chromedriver地址
driver_path = pl[0] + "wei_chat_news\\charomedriver\\chromedriver.exe"
# 主配置文件地址
config_path = pl[0] + "wei_chat_news\\common_config\\main_config.ini"
config = configparser.ConfigParser()
config.read(config_path)


def get_news(token, account_name, page_num, crawl_account_id, crawl_account):
    """
    获取新闻
    :param token: 登录后token
    :param account_name: 获取文章的账号
    :param page_num: 获取文章的页数
    :param crawl_account_id: 用于获取的账号id
    :param crawl_account: 用于获取的账号
    :return:
    """
    # ************************************************** #
    # 由于微信账号的限制,一个账号只能爬取60页的数据,封禁1小时!!
    # ************************************************** #

    # seleniumwire:浏览器请求分析插件
    # 浏览器缓存文件地址(谷歌浏览器)
    profile_directory = r'--user-data-dir=%s' % config.get('wei_chat', 'chromedriver_user_data')

    # *******配置自动化参数*********#
    options = webdriver.ChromeOptions()
    # 加载浏览器缓存
    options.add_argument(profile_directory)
    # 避免代理跳转错误
    options.add_argument('--ignore-certificate-errors-spki-list')
    options.add_argument('--ignore-ssl-errors')
    options.add_argument('--ignore-ssl-errors')
    # 不打开浏览器运行
    # options.add_argument("headless")
    # *******配置自动化参数*********#

    driver = webdriver.Chrome(executable_path=driver_path, options=options)
    # 最大化浏览器
    driver.maximize_window()
    wei_chat_url = config.get('wei_chat', 'wei_chat_url') + token
    # 微信账号地址
    driver.get(wei_chat_url)
    try:
        # 点击草稿箱  //如果找不到则认为是没有登录成功
        WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.XPATH, "//a[@title='草稿箱']"))).click()
    except Exception as e:
        # 调用扫码登录
        driver = gt.get_login_token(driver, wei_chat_url, crawl_account_id, crawl_account)
        print(e)
    try:
        # 点击草稿箱
        WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.XPATH, "//a[@title='草稿箱']"))).click()
        # 点击“新的创作”
        WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.CLASS_NAME, "weui-desktop-card__icon-add"))).click()
        # 点击“写新图文”
        WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.CLASS_NAME, "icon-svg-editor-appmsg"))).click()
        time.sleep(1)
        # 切换到新窗口
        driver.switch_to.window(driver.window_handles[1])
        # 点击“超链接”
        WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.ID, "js_editor_insertlink"))).click()
        # 点击“其他账号”
        WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.XPATH, "//button[text()='选择其他账号']"))).click()
        # 输入“账号名称”
        driver.find_element(By.XPATH, value="//input[contains(@placeholder, '微信号')]").send_keys(account_name)
        # 回车查询
        driver.find_element(By.XPATH, value="//input[contains(@placeholder, '微信号')]").send_keys(Keys.ENTER)
        # 点击确认账号
    # 吃掉本次异常,不影响后续任务
    except Exception as e:
        print(e)
    try:
        e_v = "//*[@class='inner_link_account_nickname'][text()='%s']" % account_name
        WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.XPATH, e_v))).click()
    except Exception as e:
        ee = e
        print('没有该账号:', account_name)
        return driver
    for i in range(page_num):
        time.sleep(random.randrange(1, 3))
        try:
            WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.XPATH, "//a[text()='下一页']"))).click()
            try:
                # 暂无数据 //可能该账号没有文章,也可能被封了,直接返回!
                no_data_e = "//div[@class='weui-desktop-media-tips weui-desktop-tips'][text()='暂无数据']"
                nde = WebDriverWait(driver, 5, 0.2).until(EC.presence_of_element_located((By.XPATH, no_data_e)))
                if nde:
                    return driver
            except Exception as e:
                ee = e
        except Exception as e:
            ee = e
            return driver
    return driver

Guess you like

Origin blog.csdn.net/m0_73409141/article/details/131773677