Sesame HTTP: Ajax result extraction

Taking Weibo as an example, I will use Python to simulate these Ajax requests and crawl down the Weibo that I have sent.

1. Analyze the request

Turn on Ajax's XHR filter and swipe the page all the way to load new tweet content. As you can see, there will be constant Ajax requests.

Select one of the requests and analyze its parameter information. Click the request to enter the details page, as shown in Figure 6-11.

Figure 6-11 Details page

It can be found that this is a GET type request, and the request link is [https://m.weibo.cn/api/container/getIndex?type=uid&value=2830678474&containerid=1076032830678474&page=2). There are 4 parameters for the request: type, value, containeridand page.

Looking at other requests later, you can see that their type, valueand , are containeridconsistent. typeAlways uid, valuethe value of , is the number in the page link, which is actually the user's id. Also, there are containerid. It can be found that it is 107603 plus users id. The changed value is page, obviously this parameter is used to control paging, page=1representing the first page, page=2representing the second page, and so on.

2. Analyze the response

Then, observe the response content of this request, as shown in Figure 6-12.

Figure 6-12 Response content

This content is in JSON format, which is automatically parsed by browser developer tools for our convenience. It can be seen that the two most critical pieces of information are cardlistInfosum cards: the former contains a relatively important piece of information total. After observation, it can be found that it is actually the total number of microblogs, and we can estimate the number of pages based on this number; the latter is A list, which contains 10 elements, expand one to take a look, as shown in Figure 6-13.

Figure 6-13 List content

It can be found that this element has a relatively important field mblog. Expand it, you can find that it contains some information of Weibo, such as attitudes_count(number of likes), comments_count(number of comments), reposts_count(number of reposts), created_at(post time), text(body of Weibo), etc., and they are all in some formats content.

In this way, when we request an interface, we can get 10 microblogs, and we only need to change the pageparameters when requesting.

In this case, we only need to do a simple loop to get all the microblogs.

3. Practical drills

Here we use the program to simulate these Ajax requests and crawl all the first 10 pages of my Weibo.

First, define a method to get the result of each request. When requesting, it pageis a variable parameter, so we pass it as a parameter of the method. The relevant code is as follows:

from urllib.parse import urlencode
import requests
base_url = 'https://m.weibo.cn/api/container/getIndex?'

headers = {
    'Host': 'm.weibo.cn',
    'Referer': 'https://m.weibo.cn/u/2830678474',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

def get_page(page):
    params = {
        'type': 'uid',
        'value': '2830678474',
        'containerid': '1076032830678474',
        'page': page
    }
    url = base_url + urlencode(params)
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error', e.args)

First, here is defined base_urlto represent the first half of the requested URL. Next, construct a parameter dictionary, where type, valueand containeridare fixed parameters and pageare variable parameters. Next, the calling urlencode()method converts the parameters into the URL's GET request parameters, that is, something like type=uid&value=2830678474&containerid=1076032830678474&page=2this. Then, it base_urlis combined with the parameters to form a new URL. Next, we request the link with requests, adding headersparameters. Then judge the status code of the response, if it is 200, directly call json()the method to parse the content as JSON and return, otherwise no information is returned. If an exception occurs, capture and output its exception information.

Then, we need to define a parsing method to extract the desired information from the result. For example, this time we want to save the content of Weibo id, the text, the number of likes, the number of comments, and the number of forwards, then we can traverse first cards, and then Each information in the acquisition mblogcan be assigned as a new dictionary and returned:

from pyquery import PyQuery as pq

def parse_page(json):
    if json:
        items = json.get('data').get('cards')
        for item in items:
            item = item.get('mblog')
            weibo = {}
            weibo['id'] = item.get('id')
            weibo['text'] = pq(item.get('text')).text()
            weibo['attitudes'] = item.get('attitudes_count')
            weibo['comments'] = item.get('comments_count')
            weibo['reposts'] = item.get('reposts_count')
            yield weibo

Here we use pyquery to remove the HTML tags in the body.

Finally, traverse it page, a total of 10 pages, and print out the extracted results:

if __name__ == '__main__':
    for page in range(1, 11):
        json = get_page(page)
        results = parse_page(json)
        for result in results:
            print(result)

In addition, we can also add a method to save the result to the MongoDB database:

from pymongo import MongoClient

client = MongoClient()
db = client['weibo']
collection = db['weibo']

def save_to_mongo(result):
    if collection.insert(result):
        print('Saved to Mongo')

This completes all functions. After running the program, the sample output is as follows:

{'id': '4134879836735238', 'text': '惊不惊喜，刺不刺激，意不意外，感不感动', 'attitudes': 3, 'comments': 1, 'reposts': 0}
Saved to Mongo
{'id': '4143853554221385', 'text': '曾经梦想仗剑走天涯，后来过安检给收走了。分享单曲 远走高飞', 'attitudes': 5, 'comments': 1, 'reposts': 0}
Saved to Mongo

Looking at MongoDB, the corresponding data is also saved to MongoDB, as shown in Figure 6-14.

Figure 6-14 Save the result

In this way, we successfully crawled the Weibo list by analyzing Ajax and writing a crawler. Finally, we gave the code address of this section: https://github.com/Python3WebSpider/WeiboList .

The purpose of this section is to demonstrate the simulated request process of Ajax, and the crawling results are not the focus. There are still many areas for improvement in this program, such as dynamic calculation of page numbers, full text viewing on Weibo, etc. If you are interested, you can try it out.

Through this example, we mainly learned how to analyze Ajax requests and how to use programs to simulate grabbing Ajax requests. After understanding the grasping principle, the Ajax actual combat drill in the next section will be more handy.