selenium+mitmproxy 爬取websocket数据

selenium+mitmproxy 爬取websocket数据（数据来自bet365）

主要思路是mitmproxy开启代理，拦截websocket数据，selenium启动浏览器，设置代理

此图就是bet365足球页面

可以看到bet365的足球数据是基于websocket传输的，开始我的方案就是普通web爬虫一样，解析页面，但是这个网站分类及其繁琐，数据变化很快，所以这个方案效率肯定是远远不够的，所以想到挂代理，拦截浏览器数据。下面是一些关键的代码。

1.mitmproxy 设置代理，启动命令 mitmdump - s proxy.py

#proxy.py
from mitmproxy import websocket, http

class Counter:
    def __init__(self):
        self.num = 0      
    #网站做了很多反扒工作，直接selenium是打不开网页的
    def response(self, flow: http.HTTPFlow):
        if 'https://www.365066365.com' in flow.request.url:
            html = flow.response.text
            html = html.replace("webdriver", "webdirver")
            flow.response.set_text(html)
    #因为用了香港未批嗯，所以网页返回的字体是繁体，这里处理一下
    def request(self, flow: http.HTTPFlow):
        if 'https://www.365066365.com' in flow.request.url:
            cookies = flow.request.headers.get("Cookie")
            cookies = cookies.replace("lng=2", "lng=10")
            flow.request.headers.update({
    
    "Cookie": cookies})
    #mitmproxy 里面拦截websocket的方法，拿到数据我这里是简单存在本地
    def websocket_message(self, flow: websocket.WebSocketFlow):
        with open("source_data.json", "w", encoding="utf-8") as f:
            json.dump([item.content for item in flow.messages],f, ensure_ascii=False)
addons = [
    Counter()
]

2.selemium 设置浏览器代理，启动浏览器，下面是关键代码

#我这里使用火狐浏览器是因为，谷歌的火狐的无头浏览器在服务器部署问题没啥问题，反而谷歌一堆问题
if __name__ == '__main__':
    # 启动之前先开启代理ls
    # mitmdump - s proxy.py
    profile = webdriver.FirefoxProfile()
    profile.set_preference('network.proxy.type', 1)
    profile.set_preference('network.proxy.http', '127.0.0.1')
    profile.set_preference('network.proxy.http_port', 8080)
    profile.set_preference('network.proxy.ssl', '127.0.0.1')
    profile.set_preference('network.proxy.ssl_port', 8080)
    profile.update_preferences()
    option = webdriver.FirefoxOptions()
    option.add_argument('--disable-gpu')
    browser = webdriver.Firefox(profile, firefox_options=option)
    browser.maximize_window()

3.之后数据就有了，按照我前面保存的文件result_data.json,当然下面这个数据是我解析过的

在这里插入图片描述

selenium+mitmproxy 爬取websocket数据

selenium+mitmproxy 爬取websocket数据（数据来自bet365）

1.mitmproxy 设置代理，启动命令 mitmdump - s proxy.py

2.selemium 设置浏览器代理，启动浏览器，下面是关键代码

3.之后数据就有了，按照我前面保存的文件result_data.json,当然下面这个数据是我解析过的

总结：这个方法应该是适用于所有的websocket，当然这个方法太重了，现在还是研究逆向js,直接和网站服务器建立websocket连接。

猜你喜欢