Reptile artifact! Use it to process and save Ajax data in real time

We often encounter such a problem when doing crawlers:

The website's data is loaded through Ajax, but the Ajax interface is encrypted, and it can't be cracked without any effort. At this time, if we want to bypass the cracking and grab data, for example, we have to use Selenium. Selenium can complete some operations such as simulating clicks and page turning, but it is not easy to obtain Ajax data. The data is extracted through the rendered HTML. very troublesome.

Perhaps you will think to yourself: If only I could use Selenium to drive the page and at the same time save the data requested by Ajax.

Naturally, there is a way, for example, you can add a layer of agents, just use mitmdump for real-time processing.

But if there is no agent, there is no good way?

Here we introduce a tool called AjaxHook, with which we can intercept all the data requested by Ajax, as long as an Ajax request occurs, it can intercept the request and response, so that we can achieve real-time processing of Ajax data .

Ajax Hook

Hook is no stranger to everyone, I won't go into it here. If you don't understand, you can search for "Hook Technology" and you can find a handful of information.

Then Ajax Hook, as the name implies, is Hook Ajax request. The two most important parts of Ajax? Of course, it is Request and Response. With Hook, we can process both before initiating Request and after getting Response.

The basic point of action is shown in the figure:

So how do we make Hook Ajax requests? Then naturally you need to go deep into the native implementation of Ajax. Ajax is actually implemented by using the XMLHttpRequest object. To hook Ajax's Request and Response, it is actually doing some processing on some of its attributes, such as send, onreadystatechange, and so on.

It sounds troublesome, don’t worry, someone has already written this, we just use it directly, the GitHub address is: https://github.com/wendux/Ajax-hook.

In fact, the internal implementation principle is very simple. In fact, I just mentioned it briefly. If you want to understand it in depth, you can read this article: https://www.jianshu.com/p/7337ac624b8e.

OK, how do you use this?

The author of Ajax-hook provides two main methods, one is proxy and the other is hook, all of which work by Hook XMLHttpRequest.

Here is the official introduction:

Both proxy and hook methods can be used to intercept the global XMLHttpRequest. The difference between them is that the interception granularity of hooks is fine, which can be specific to a certain method, attribute, and callback of the XMLHttpRequest object, but it is more troublesome to use. In many cases, not only business logic needs to be scattered among the callbacks, but also error-prone. The proxy has a high degree of abstraction and constructs a request context. The request information config can be directly obtained in each callback, which is simpler and more efficient to use.

In most cases, we recommend using the proxy method unless the proxy method cannot meet your needs.

Then let's take a look at the usage of proxy method, its usage is as follows:

proxy ({
    //请求发起前进入
    onRequest: (config, handler) => {
        console.log(config.url)
        handler.next(config);
    },
    //请求发生错误时进入,比如超时;注意,不包括http状态码错误,如404仍然会认为请求成功
    onError: (err, handler) => {
        console.log(err.type)
        handler.next(err)
    },
    //请求成功后进入
    onResponse: (response, handler) => {
        console.log(response.response)
        handler.next(response)
    }
})

It is clear that Ajax-hook provides us with three methods for replication, onRequest, onResponse, and onError are the processing before the request is initiated, the processing after the request is successful, and the processing when an error occurs.

Then if we want to do data crawling, it is actually to intercept the results of Response, then it is actually good to implement the onResponse method.

Take a closer look. This onResponse method receives two parameters, the response object and the handler object. These are all encapsulated by Ajax-hook for us. In fact, we only need to use the content in the response, such as the Response Body To print it out is actually to print out the results obtained by Ajax.

OK, then let's try it.

Case Introduction

Let's take a case of my own, the link is: https://dynamic2.scrape.center/, the interface is as follows:

This website is a movie data website. The data is loaded through Ajax, but these Ajax requests carry encrypted parameter tokens, as shown in the figure:

In fact, it is not difficult for you to solve this parameter, but it will take some time.

Then look at the return result of Ajax, as shown in the figure:

Very pure and clear! So if we can get these data directly when we get the Ajax Response, it would be a good thing.

How to do? Naturally, the Ajax-hook just mentioned is used.

So, let's use this Ajax-hook to process these data in real time.

Actual operation

First of all, the first step is that we have to be able to use Ajax-hook, how to use it? It must be necessary to introduce this Ajax-hook library. How to introduce this page in the browser?

There are many answers, such as copying JavaScript, Tampermonkey, Selenium, etc.

Here we use the simplest method, Selenium automatically executes the source code of Ajax-hook.

At this time, we need to find the source code of Ajax-hook. Go to GitHub and find it. The link is: https://raw.githubusercontent.com/wendux/Ajax-hook/master/dist/ajaxhook.min.js ,as the picture shows:

Look, the amount of code is really small.

We copy and paste this code into the console of the website https://dynamic2.scrape.center/.

At this time we will get an ah object, which represents Ajax-hook, and we can use the proxy method in it.

How to use it? Just implement the onResponse method directly and print the result of the Response. The implementation is as follows:

ah.proxy({
    //请求成功后进入
    onResponse: (response, handler) => {
        if (response.config.url.startsWith('/api/movie')) {
            console.log(response.response)
            handler.next(response)
        }
    }
})

Put this code in the console and run it. At this time, we have implemented the Hook of Ajax Response. As long as there is an Ajax request, the result of the Response will be output.

At this time, if we click to turn the page and trigger a new Ajax request, we can see that the console outputs the result of Response, as shown in the figure:

Well, now we can get Ajax data.

Data forwarding

Now the data is in the browser, how do we save it?

It's not easy to save. The easiest way is to forward this data to one of your own interfaces and save it.

Then let's simply use Flask to make an interface, remember to lift the cross-domain restriction, the implementation is as follows:

import json
from flask import Flask, request, jsonify
from flask_cors import CORS

app = Flask(__name__)
CORS(app)

@app.route('/receiver/movie', methods=['POST'])
def receive():
    content = json.loads(request.data)
    print(content)
    # to something
    return jsonify({'status': True})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80, debug=True)

Here I just wrote a simple example, I wrote an API that can receive POST requests, the address is /receiver/movie, and then print the POST data and return a response.

Of course, you can do a lot of operations here, such as cutting data and storing it in a database.

Okay, now that the server is available, let's send the data over on the Ajax-hook side.

Here we use the axios library, whose library address is https://unpkg.com/[email protected]/dist/axios.min.js, which can also be used when executed in a browser.

After introducing axios, we modified the previous proxy method to the following:

ah.proxy({
    //请求成功后进入
    onResponse: (response, handler) => {
        if (response.config.url.startsWith('/api/movie')) {
            axios.post('http://localhost/receiver/movie', {
                url: window.location.href,
                data: response.response
            })
            console.log(response.response)
            handler.next(response)
        }
    }
})

In fact, here is to call the post method of axios, and then send the current URL and Response data to the Server.

Up to now, the response result of each Ajax request will be sent to this Flask Server, and Flask Server will store and process it.

automation

OK, now we can achieve Ajax interception and data forwarding, the last step is naturally to automate the crawling.

Automation is divided into three parts:

  • Open website
  • Inject the code of Ajax-hook, axios, and proxy.
  • Automatically click the next page to turn the page.

The most important thing is the second step. We put the code of Ajax-hook, axios, and proxy just now in a hook.js file and execute it with Selenium's execute_script.

The other steps are very simple, and the final implementation is as follows:

from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.get('https://dynamic2.scrape.center/')
browser.execute_script(open('hook.js').read())
time.sleep(2)

for index in range(10):
    print('current page', index)
    btn_next = browser.find_element_by_css_selector('.btn-next')
    btn_next.click()
    time.sleep(2)

Finally, run it.

It can be found that the browser first opened the page, then simulated clicking the next page, and then went back to observe the Flask Server side, you can see that the Ajax data is received, as shown in the figure:

OK, that's it.

to sum up

At this point, we have completed:

  • Ajax Response Hook
  • Data forwarding and receiving
  • Browser automation

When we encounter similar situations in the future, we can deal with them in the same way.

Code for this section: https://github.com/Python3WebSpider/AjaxHookSpider.

Guess you like

Origin blog.csdn.net/zhangge3663/article/details/108658819