python Ajax data crawling Profile

Introduction to Ajax

Ajax (full name of Asynchronous JavaScript and XML, Asynchronous JavaScript and XML), is a use of JavaScript to exchange data with the server in the case to ensure that the page is not refreshed the page link does not change and update the technical part of the page. There are many examples of using Ajax, such as Sina Weibo and extraordinary business for more and so on.

Ajax Analysis

After a preliminary understanding of Ajax, we can know the loading process is divided into three steps: sending a request - analytical content - rendering the page. So, how do we judge a page by sending an Ajax request to dynamically load, but also how to determine the address of its request it?

In fact, to determine whether a page is loaded Ajax requests, we can use the Chrome browser toolbar. To the extraordinary business site, for example, let's bring up the Network toolbar Chrome browser, select XHR filter (in fact, this is the type of request on behalf of, the Ajax request type), then refresh the page you can see all current the Ajax requests.

Then we drop down to the bottom of the page and try to click to see more, you will find the request list more than one request, as shown, we then try to click multiple times, and there will be more new request, so we this can be determined by Ajax to load.

As a result, we will be able to analyze a request by the head of each request to obtain the specific content of the data source. Request URL as the image above is just loaded the contents of the source address of the data, we open a new page try to visit it and found the following contents:

At first glance, we think this should be a JSON data format, it would then try to resolve it in the station to see the results we expected, but also proved the request in advance of the request is the URL of the page Ajax data sources.

Ajax data acquisition

On the basis of previous analysis, we actually had to get a way to get Ajax data: URL analysis method constitutes an Ajax request, and then be re-parsed data extraction page. This is a way to directly access the source data, higher performance, but the cost of the analysis are also generally large. Because not all URL constitute law are very easy to get out, it may confuse a lot of encryption mechanisms, and often need to have the skills Js-assisted analysis.

Thus, we propose an alternative strategy: the use of selenium simulate browser behavior to get the dynamic resolution data acquisition. Here's what selenium is it? In fact, it is equivalent to a robot that can simulate a human operator browser behavior, such as clicks, typing, drag and so on. In fact, this is mainly used for the initial page testing, but found that it is consistent with the characteristics of reptiles, reptile and therefore widely used in the field. In the server view, it is the people to access the page, but is difficult to capture reptiles, so security is very high; but on the other hand, use it to get a great Ajax data costs, more complicated, less performance analysis URL.

The above two methods is commonly used to obtain data of Ajax, which method specific use, we can test a look at Ajax data source URL constitute law analysis needs to be acquired are convenient, if more rules will be able to obtain directly from requests; conversely If more complex, consider using selenium strategy (more on this in the following explanation will be given notes).

Guess you like

Origin blog.csdn.net/fei347795790/article/details/91978897