微信公众号实时抓取

很多业务可能都会抓取微信公众号。
而有些时候由于对方app或者我们技能的限制，导致并不能简单的脱壳处理。
今天我们换一种思路进行公众号抓取。

阅读流程

效果演示
抓取思路整理
源码地址
关键源码解读
总结

效果演示

在这里插入图片描述

抓取思路整理

使用Appium自动化控制手机，模拟用户对微信公众号列表进行相关操作
使用mitmproxy中间人代理拦截内容，解析出公众号列表页
使用python对公众号内容进行抓取

源码地址

微信公众号抓取项目地址

关键源码解析

appium部分。首先我们需要找出每个界面所对应的Activity和每个Activity界面的按钮。

from appium import webdriver
import time
from selenium.webdriver.support import expected_conditions as EC
from appium.webdriver.common.touch_action import TouchAction
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
desired_caps={
  "platformName": "Android",
  "deviceName": "a",
  "appPackage": "com.tencent.mm",
  "appActivity": "com.tencent.mm.ui.LauncherUI",
    'noReset':True
}
url = 'http://localhost:4723/wd/hub'

driver = webdriver.Remote(url,desired_capabilities=desired_caps)
driver.wait_activity('.ui.LauncherUI',timeout=10)
WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, '//*[@text="通讯录"]'))
    ).click()
driver.find_element_by_xpath('//*[@text="公众号"]').click()
driver.wait_activity('.plugin.brandservice.ui.BrandServiceIndexUI',timeout=10)
while True:
    try:
      items = driver.find_elements_by_xpath('//*[@resource-id="com.tencent.mm:id/a2y"]')
      for item in items:
            item.click()
            driver.wait_activity('.ui.chatting.ChattingUI',timeout=10)
            driver.find_element_by_id('com.tencent.mm:id/jy').click()
            driver.wait_activity('.plugin.profile.ui.ContactInfoUI',timeout=10)
            # driver.find_element_by_id('com.tencent.mm:id/b0u').click()
            TouchAction(driver).press(x=569, y=2000).move_to(x=390, y=792).release().perform()
            driver.find_elements_by_xpath('//*[@resource-id="com.tencent.mm:id/b0r"]')[-1].click()
            driver.wait_activity('.plugin.profile.ui.ContactInfoUI', timeout=10)
            driver.back()
            driver.back()
            driver.back()
    except Exception as e:
      pass

    time.sleep(1)

mitm代理中间人代码

import sys
sys.path.append('..')
sys.path.append('../..')
sys.path.append('../../..')
import re
import redis

from wechat.settings import QUEUES

QUEUE_CONF = QUEUES['tasks']
r = redis.Redis(**QUEUE_CONF)

class WeChatProxyHandler():
    url = 'https://mp.weixin.qq.com/mp/profile_ext?action=home'
    def response(self,flow):
        if (flow.request.url.find(self.url))!=-1:
            for line in flow.response.text.split('\n'):
                line = line.strip()
                if line.find('var msgList') != -1:
                    line = eval(re.sub('&quot;', '"', line[len('var msgList = ') + 1:-2]))
                    urls = [item.get('app_msg_ext_info', {}).get('content_url') for item in line['list']]
                    urls = [re.sub('\\\/', '/', url) for url in urls if url]
                    r.lpush('wechat', *urls)
addons=[
    WeChatProxyHandler()
]

总结

中间人代理可以帮助我们做很多事情
- 使用splash的时候可以把请求耗时的内容给拦截掉
- 通过js注入，可以实现自动分页抓取

阳光下的小树

发布了33 篇原创文章 · 获赞 21 · 访问量 4万+

私信关注

微信公众号实时抓取

阅读流程

效果演示

抓取思路整理

源码地址

关键源码解析

总结

猜你喜欢