Selenium+mitmproxy crawls websocket data

The main idea is that mitmproxy opens the proxy, intercepts websocket data, selenium starts the browser, and sets the proxy

This picture is the bet365 football page

It can be seen that the football data of bet365 is transmitted based on websocket. At the beginning, my solution was the same as a normal web crawler, parsing pages, but the classification of this website is cumbersome and the data changes quickly, so the efficiency of this solution is definitely far from enough. So I thought of proxying and intercepting browser data. Here are some key codes.

1. mitmproxy set the proxy, start the command mitmdump-s proxy.py

#proxy.py
from mitmproxy import websocket, http

class Counter:
    def __init__(self):
        self.num = 0      
    #网站做了很多反扒工作,直接selenium是打不开网页的
    def response(self, flow: http.HTTPFlow):
        if 'https://www.365066365.com' in flow.request.url:
            html = flow.response.text
            html = html.replace("webdriver", "webdirver")
            flow.response.set_text(html)
    #因为用了香港未批嗯,所以网页返回的字体是繁体,这里处理一下
    def request(self, flow: http.HTTPFlow):
        if 'https://www.365066365.com' in flow.request.url:
            cookies = flow.request.headers.get("Cookie")
            cookies = cookies.replace("lng=2", "lng=10")
            flow.request.headers.update({
    
    "Cookie": cookies})
    #mitmproxy 里面拦截websocket的方法,拿到数据我这里是简单存在本地
    def websocket_message(self, flow: websocket.WebSocketFlow):
        with open("source_data.json", "w", encoding="utf-8") as f:
            json.dump([item.content for item in flow.messages],f, ensure_ascii=False)
addons = [
    Counter()
]

2.selemium set the browser proxy, start the browser, the following is the key code

#我这里使用火狐浏览器是因为,谷歌的火狐的无头浏览器在服务器部署问题没啥问题,反而谷歌一堆问题
if __name__ == '__main__':
    # 启动之前先开启代理ls
    # mitmdump - s proxy.py
    profile = webdriver.FirefoxProfile()
    profile.set_preference('network.proxy.type', 1)
    profile.set_preference('network.proxy.http', '127.0.0.1')
    profile.set_preference('network.proxy.http_port', 8080)
    profile.set_preference('network.proxy.ssl', '127.0.0.1')
    profile.set_preference('network.proxy.ssl_port', 8080)
    profile.update_preferences()
    option = webdriver.FirefoxOptions()
    option.add_argument('--disable-gpu')
    browser = webdriver.Firefox(profile, firefox_options=option)
    browser.maximize_window()

3. After that, the data will be available. According to the file result_data.json I saved earlier, of course, the following data is what I parsed

Insert picture description here

Summary: This method should be applicable to all websockets. Of course, this method is too heavy. Now we are still studying reverse js and directly establishing a websocket connection with the website server.

Guess you like

Origin blog.csdn.net/weixin_45673647/article/details/115006926