029. (7.30) Ajax data crawling

Ajax data crawling


When we use requests to fetch the page, the results we get may be different from what we see in the browser: you can see the page data normally displayed in the browser, but the results obtained using requests are not. This is because all the original HTML document requests obtained , and the results of the browser page is JavaScript after processing the data generated , there are many sources of these data, may be loaded via Ajax, it may be included in the HTML The documents in the document may also be generated after calculation by JavaScript and specific algorithms.

For Ajax loading (ie asynchronous JavaScript XML, non-programming language), data loading is an asynchronous loading method . The original page will not initially contain some data. After the original page is loaded, it will request an interface from the server again. Data, and then the data is processed and presented on the web page, which is actually sending an Ajax request .

The data is uniformly loaded through Ajax and then presented, so that the front and back ends can be separated in Web development, while reducing the pressure caused by the server directly rendering the page. Ajax pages are the trend of Web development.

Therefore, when you encounter an Ajax page, you need to analyze the Ajax request sent by the back-end interface of the web page . If you can use requests to simulate Ajax requests , you can successfully crawl data.

Ajax analysis method

Because we want to simulate Ajax requests, we need to understand where we can see these background Ajax operations, how they are sent repeatedly, and what parameters are sent.

Enter the Network tab in the browser developer tool. This is actually all records of sending requests and receiving responses between the browser and the server during the page loading process.

Ajax has its special request type, which is called xhr.

Click on an Ajax request, where there is a message in the Request Headers as X-Requested-With:XMLHttpRequest, which marks the request as an Ajax request. Click Preview to see the content of the response, which is in JSON format. And the Response tab, from which you can observe the real return data.

Because you can clearly see its Request URL, Request Headers, Response Headers, and Response Body, it becomes easy to simulate the request and extract at this time.

Finally, extracting information from the new page requested by Ajax and using json to read data is a common method.

Examples include: crawling Sina Weibo content, crawling QQ music songs...

Guess you like

Origin blog.csdn.net/u013598957/article/details/107695268