Crawling of paged data - KFC restaurant location information
Article Directory
1 Analysis
The address displayed after entering the address is the same as the original address
Indicates that pressing the query button initiates an Ajax request
- The location information refreshed by the current page must be the data requested through the ajax request
.Based on the packet capture tool to locate the data packet of the ajax request, capture from the data packet:
- request url
- request method
- Parameters carried by the request
- see response data
When I first captured the packet, I chose ALL, but it was analyzed that what was sent here was an Ajax request, so this time I chose Fetch/XHR, which is specifically for viewing Ajax requests.
Open F12, select Fetch/XHR, and click Query to view the results
It is found that the request method is post method
The return value is still in json format
2 Crawl to a page of data
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
data = {
'cname': '',
'pid': '',
'keyword': '北京',
'pageIndex': '1',
'pageSize': '10',
}
# data参数是post方法中处理参数动态化的参数
response = requests.post(url=url, headers=headers, data=data)
page_text = response.json()
for dic in page_text['Table1']:
title = dic['storeName']
addr = dic['addressDetail']
print(title, addr)
3 Crawl multiple pages of data
When the second page is clicked, the pageIndex of the requested data changes to 2, and when the third page is clicked, it changes to 3.
So write a loop to crawl all pages
Each cycle only needs to change the value of the pageIndex parameter, because the data is required to be a string type, so to avoid mistakes, give it a mandatory conversion
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
for page in range(1, 9):
data = {
'cname': '',
'pid': '',
'keyword': '北京',
'pageIndex': str(page),
'pageSize': '10',
}
# data参数是post方法中处理参数动态化的参数
response = requests.post(url=url, headers=headers, data=data)
page_text = response.json()
for dic in page_text['Table1']:
title = dic['storeName']
addr = dic['addressDetail']
print('第', page, '页:', title, addr)
Follow the column to see more details