Python crawls Baidu pictures and teaches you to capture dynamic data

Preface

For Xiaobai, who is just getting started with crawlers, it is a headache for dynamic loading of web pages. Dynamic loading is the most basic anti-picking method for major websites. Today, I will take Baidu image crawling as an example to give you a taste of dynamic crawlers. The key lies in the analysis of Ajax request packet capture. This article is for reference only!

The following article is from the Python Code Encyclopedia, the author Python Code Madman

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542 

Python learning exchange group: 1039645993

1. Fetch target:

Baidu NBA Picture

Python crawls Baidu pictures and teaches you to capture dynamic data

 

2. Grab the results

Python crawls Baidu pictures and teaches you to capture dynamic data

 

3. Detailed step analysis


(1) The key to analyzing whether to load dynamically is to observe whether the package in XHR has changed when scrolling the mouse wheel. If the number of breads here is updated, the page is most likely to be a dynamic request. After analysis Baidu pictures are dynamically loaded.

Python crawls Baidu pictures and teaches you to capture dynamic data

 

(2) After finding the dynamic loading package, we analyze the request of the package. The difficulty is to analyze the query parameters. Here I suggest that you find at least two sets of keywords for comparison, and find out the keyword differences between different packages. The law of its change (the tree secretly reminds everyone to look for a query parameter named pn) the whole dynamic is actually controlled by him alone. After finding the package, make a request request, and then perform data analysis to extract the image url and it will be OK. (For pictures, they must be written in binary!)

Python crawls Baidu pictures and teaches you to capture dynamic data

 

4. Complete source code

Toolkit requests and json needed for this crawl

import requests as rq
import json
import time
import os
count = 1

def crawl(page):
    global count
    if not os.path.exists('E://桌面/NBA'):
        os.mkdir('E://桌面/NBA')
    url = 'https://image.baidu.com/search/acjson?'
    header = {
        # 'Referer': 'https://image.baidu.com/search/index?ct=201326592&cl=2&st=-1&lm=-1&nc=1&ie=utf-8&tn=baiduimage&ipn=r&rps=1&pv=&fm=rs4&word',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
    }
    param = {
    "tn": "resultjson_com",
    "logid": "11007362803069082764",
    "ipn": "rj",
    "ct": "201326592",
    "is": "",
    "fp": "result",
    "queryWord": "NBA",
    "cl": "2",
    "lm": "-1",
    "ie": "utf-8",
    "oe": "utf-8",
    "adpicid": "",
    "st": "-1",
    "z": "",
    "ic": "",
    "hd": "",
    "latest": "",
    "copyright": "",
    "word": "NBA",
    "s": "",
    "se": "",
    "tab": "",
    "width": "",
    "height": "",
    "face": "0",
    "istype": "2",
    "qc": "",
    "nc": "1",
    "fr": "",
    "expermode": "",
    "force": "",
    "pn": page,
    "rn": "30",
    "gsm": "1e",
    "1615565977798": "",
    }
    response = rq.get(url, headers=header, params=param)
    result = response.text
    # print(response.status_code)
    j = json.loads(result)
    # print(j)
    img_list = []
    for i in j['data']:
        if 'thumbURL' in i:
            # print(i['thumbURL'])
            img_list.append(i['thumbURL'])
    # print(len(img_list))

    for n in img_list:
        r = rq.get(n, headers=header)
        with open(f'E://桌面/NBA/{count}.jpg', 'wb') as f:
            f.write(r.content)
        count += 1


if __name__ == '__main__':
    for i in range(30, 601, 30):
        t1 = time.time()
        crawl(i)
        t2 = time.time()
        t = t2 - t1
        print('page {0} is over!!!  耗时{1:.2f}秒!'.format(i//30, t))

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114839274