Python web crawler study notes (11): Ajax data crawling

Article Directory

Sometimes when we use requests to grab a page, the results we get may be different from what we see in the browser: you can see the page data normally displayed in the browser, but the results obtained using requests are not. This is because the requests obtained are all original HTML documents, and the pages in the browser are the results generated after JavaScript processing the data. There are many sources of these data, which may be loaded through Ajax, may be included in HTML The documents in the document may also be generated after calculation by JavaScript and specific algorithms.

In the first case, data loading is an asynchronous loading method. The original page will not initially contain some data. After the original page is loaded, it will request an interface from the server to obtain the data, and then the data will be processed and presented. On the web page, this is actually an Ajax request.

According to the trend of Web development, there are more and more pages of this form. The original HTML document of the web page does not contain any data. The data is presented after being uniformly loaded through Ajax, so that the front and back ends can be separated in Web development, and the pressure brought by the server directly rendering the page can be reduced.

Therefore, if you encounter such a page, you can directly use requests and other libraries to grab the original page, and it is impossible to obtain valid data. At this time, you need to analyze the Ajax request sent to the interface by the backend of the web page. If you can use requests to simulate Ajax requests, then It can be crawled successfully.

1. Basic introduction

Ajax, the full name is Asynchronous JavaScript and XML, that is, asynchronous JavaScript and XML. It is not a programming language, but a technology that uses JavaScript to exchange data with the server and update part of the web page while ensuring that the page will not be refreshed and the page link will not change.

For some web pages, if you keep sliding down, a loaded page will appear. After a while, new content will continue to appear below. This process is actually the process of Ajax loading.

We noticed that the page was not completely refreshed, which means that the link on the page has not changed, but there is new content on the page, which is the new Weibo that was posted later. This is the process of obtaining new data and presenting it through Ajax.

2. Basic principles

After a preliminary understanding of Ajax, let's learn more about its basic principles. The process of sending Ajax requests to web page updates can be simply divided into the following 3 steps:

(1) Send the request; (2) Analyze the content; (3) Render the web page.

We will introduce these processes in detail below.

Sending a request
We know that JavaScript can implement various interactive functions of the page, and Ajax is no exception. It is also implemented by JavaScript. In fact, the following code is executed:

var xmlhttp;
if (window.XMLHttpRequest) {
    
    
    // code for IE7+, Firefox, Chrome, Opera, Safari
    xmlhttp=new XMLHttpRequest();
} else {
    
    // code for IE6, IE5
    xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function() {
    
    
    if (xmlhttp.readyState==4 && xmlhttp.status==200) {
    
    
        document.getElementById("myDiv").innerHTML=xmlhttp.responseText;
    }
}
xmlhttp.open("POST","/ajax/",true);
xmlhttp.send();

This is JavaScript's lowest-level implementation of Ajax. In fact, it creates a new XMLHttpRequest object, then calls the onreadystatechange property to set the listener, and then calls the open() and send() methods to send a request to a link (that is, the server). After the request is sent in Python, the response result can be obtained, but the sending of the request is completed by JavaScript. Because the monitoring is set, when the server returns a response, the method corresponding to onreadystatechange will be triggered, and then in this method Just parse the response content.

Therefore, we know that the real data is actually obtained from Ajax requests again and again. If you want to grab this data, you need to know how these requests are sent, where they are sent, and what parameters are sent. If we know this, we can use Python to simulate this sending operation and get the result.

3. Actual combat

We use the program to simulate these Ajax requests and crawl the first ten pages of Weibo

from urllib.parse import urlencode
import requests
base_url = 'https://m.weibo.cn/api/container/getIndex?'
 
headers = {
    
    
    'Host': 'm.weibo.cn',
    'Referer': 'https://m.weibo.cn/u/2830678474',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}
 
def get_page(page):
    params = {
    
    
        'type': 'uid',
        'value': '2830678474',
        'containerid': '1076032830678474',
        'page': page
    }
    url = base_url + urlencode(params)
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error', e.args)

First of all, here is defined base_urlto represent the first half of the requested URL. Next, construct a parameter dictionary, where type, valueand containeridare fixed parameters and pagevariable parameters. Next, call the urlencode()method to convert the parameters into URL GET request parameters, which is similar to type=uid&value=2830678474&containerid=1076032830678474&page=2this form. Subsequently, it is base_urlcombined with the parameters to form a new URL. Next, we requestsrequest this link and add headersparameters. Then determine the status code of the response. If it is 200, directly call the json()method to parse the content and return it as JSON, otherwise no information will be returned. If an exception occurs, capture and output its exception information.

Later, we need to define a parsing method to extract the desired information from the results. For example, if we want to save the content of Weibo's id, body, number of likes, number of comments, and number of reposts this time, we can traverse the cards first. , And then get each information in mblog, assign it to a new dictionary and return it:

from pyquery import PyQuery as pq

def parse_page(json):
    if json:
        items = json.get('data').get('cards')
        for item in items:
            item = item.get('mblog')
            weibo = {
    
    }
            weibo['id'] = item.get('id')
            weibo['text'] = pq(item.get('text')).text()
            weibo['attitudes'] = item.get('attitudes_count')
            weibo['comments'] = item.get('comments_count')
            weibo['reposts'] = item.get('reposts_count')
            yield weibo

Here we use pyquery to remove the HTML tags in the body.

Finally, traverse the page, a total of 10 pages, and print out the extracted results:

if __name__ == '__main__':
    for page in range(1, 11):
        json = get_page(page)
        results = parse_page(json)
        for result in results:
            print(result)

{'id': '4544491111844364', 'text': '老婆真好啊，今天感觉工作有点累不太开心，然后老婆晚上和我开了一下视频对我笑了笑撒了撒娇，我瞬间又开心又好了，感觉心里暖暖的，我老婆最好了！', 'attitudes': 12, 'comments': 3, 'reposts': 0}
{'id': '4543259479641769', 'text': '我老婆最好看了', 'attitudes': 12, 'comments': 0, 'reposts': 0}
{'id': '4543252072761828', 'text': '不知道大家是否已经对抖音有了一种厌倦？最早的时候我觉得内容质量还行，现在没刷几个视频，很多都是广告、带货、博人眼球、摆拍、空洞的内容，质量越来越差，越看越没劲，卸了卸了，还是撸代码好玩。', 'attitudes': 10, 'comments': 8, 'reposts': 0}
{'id': '4541086507209784', 'text': '铁窗爱情3', 'attitudes': 23, 'comments': 3, 'reposts': 1}
{'id': '4541085119162017', 'text': '即便没收费，那直播搞这个操作也是太服了。', 'attitudes': 3, 'comments': 0, 'reposts': 0}
{'id': '4539580352832042', 'text': '铁窗爱情2', 'attitudes': 3, 'comments': 4, 'reposts': 0}
{'id': '4538323178106309', 'text': '老婆返校了，但是出不来，于是就有了铁窗爱情。@长泽牙妹 北京', 'attitudes': 20, 'comments': 6, 'reposts': 0}