requests-crawler multi-page crawler KFC restaurant location

Crawling of paged data - KFC restaurant location information

1 Analysis

The address displayed after entering the address is the same as the original address

Indicates that pressing the query button initiates an Ajax request

  • The location information refreshed by the current page must be the data requested through the ajax request


.Based on the packet capture tool to locate the data packet of the ajax request, capture from the data packet:

  • request url
  • request method
  • Parameters carried by the request
  • see response data

When I first captured the packet, I chose ALL, but it was analyzed that what was sent here was an Ajax request, so this time I chose Fetch/XHR, which is specifically for viewing Ajax requests.

Open F12, select Fetch/XHR, and click Query to view the results

It is found that the request method is post method

The return value is still in json format

image-20220723194720139

2 Crawl to a page of data

import requests

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'

data = {
    
    
    'cname': '',
    'pid': '',
    'keyword': '北京',
    'pageIndex': '1',
    'pageSize': '10',
}
# data参数是post方法中处理参数动态化的参数
response = requests.post(url=url, headers=headers, data=data)
page_text = response.json()

for dic in page_text['Table1']:
    title = dic['storeName']
    addr = dic['addressDetail']
    print(title, addr)

image-20220723195433964

3 Crawl multiple pages of data

When the second page is clicked, the pageIndex of the requested data changes to 2, and when the third page is clicked, it changes to 3.

So write a loop to crawl all pages

image-20220723195804249

Each cycle only needs to change the value of the pageIndex parameter, because the data is required to be a string type, so to avoid mistakes, give it a mandatory conversion

import requests

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'

for page in range(1, 9):
    data = {
    
    
        'cname': '',
        'pid': '',
        'keyword': '北京',
        'pageIndex': str(page),
        'pageSize': '10',
    }
    # data参数是post方法中处理参数动态化的参数
    response = requests.post(url=url, headers=headers, data=data)
    page_text = response.json()

    for dic in page_text['Table1']:
        title = dic['storeName']
        addr = dic['addressDetail']
        print('第', page, '页:', title, addr)

image-20220723200622683

Follow the column to see more details

Guess you like

Origin blog.csdn.net/qq_45842943/article/details/125952260
KFC