Preface
For Xiaobai, who is just getting started with crawlers, it is a headache for dynamic loading of web pages. Dynamic loading is the most basic anti-picking method for major websites. Today, I will take Baidu image crawling as an example to give you a taste of dynamic crawlers. The key lies in the analysis of Ajax request packet capture. This article is for reference only!
The following article is from the Python Code Encyclopedia, the author Python Code Madman
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Python learning exchange group: 1039645993
1. Fetch target:
Baidu NBA Picture
2. Grab the results
3. Detailed step analysis
(1) The key to analyzing whether to load dynamically is to observe whether the package in XHR has changed when scrolling the mouse wheel. If the number of breads here is updated, the page is most likely to be a dynamic request. After analysis Baidu pictures are dynamically loaded.
(2) After finding the dynamic loading package, we analyze the request of the package. The difficulty is to analyze the query parameters. Here I suggest that you find at least two sets of keywords for comparison, and find out the keyword differences between different packages. The law of its change (the tree secretly reminds everyone to look for a query parameter named pn) the whole dynamic is actually controlled by him alone. After finding the package, make a request request, and then perform data analysis to extract the image url and it will be OK. (For pictures, they must be written in binary!)
4. Complete source code
Toolkit requests and json needed for this crawl
import requests as rq
import json
import time
import os
count = 1
def crawl(page):
global count
if not os.path.exists('E://桌面/NBA'):
os.mkdir('E://桌面/NBA')
url = 'https://image.baidu.com/search/acjson?'
header = {
# 'Referer': 'https://image.baidu.com/search/index?ct=201326592&cl=2&st=-1&lm=-1&nc=1&ie=utf-8&tn=baiduimage&ipn=r&rps=1&pv=&fm=rs4&word',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}
param = {
"tn": "resultjson_com",
"logid": "11007362803069082764",
"ipn": "rj",
"ct": "201326592",
"is": "",
"fp": "result",
"queryWord": "NBA",
"cl": "2",
"lm": "-1",
"ie": "utf-8",
"oe": "utf-8",
"adpicid": "",
"st": "-1",
"z": "",
"ic": "",
"hd": "",
"latest": "",
"copyright": "",
"word": "NBA",
"s": "",
"se": "",
"tab": "",
"width": "",
"height": "",
"face": "0",
"istype": "2",
"qc": "",
"nc": "1",
"fr": "",
"expermode": "",
"force": "",
"pn": page,
"rn": "30",
"gsm": "1e",
"1615565977798": "",
}
response = rq.get(url, headers=header, params=param)
result = response.text
# print(response.status_code)
j = json.loads(result)
# print(j)
img_list = []
for i in j['data']:
if 'thumbURL' in i:
# print(i['thumbURL'])
img_list.append(i['thumbURL'])
# print(len(img_list))
for n in img_list:
r = rq.get(n, headers=header)
with open(f'E://桌面/NBA/{count}.jpg', 'wb') as f:
f.write(r.content)
count += 1
if __name__ == '__main__':
for i in range(30, 601, 30):
t1 = time.time()
crawl(i)
t2 = time.time()
t = t2 - t1
print('page {0} is over!!! 耗时{1:.2f}秒!'.format(i//30, t))