python crawler practice (2)

1. Web page packet capture

1. Know the url, response method, and crawler result form

#指定url:
post_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
# UA伪装
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
    }
# 请求发送
    response1 = requests.post(url=post_url, data=data, headers=headers)
#响应结果接收
    text = response1.text

 2. Known crawler parameter information

 This is currently the first page of query information. There are 12 pages in total.

'pageIndex': What page is the current page?
'pageSize': how many records per page
 data = {
        'cname': '',
        'pid': '',
        'keyword': '北京',
        'pageIndex': 1,
        'pageSize': '10'
    }

 2. Final code

 The main difficulty lies in determining the number of pages to cycle through.

#导包
import requests
import math
if __name__ == "__main__":
    #指定url:
    post_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
    # UA伪装
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
    }
    # 请求参数处理
    kw = input('enter a query city:')
    #第一次申请 解决页数不知道的问题
    data = {
        'cname': '',
        'pid': '',
        'keyword': kw,
        'pageIndex': 1,
        'pageSize': '10'
    }
    # 请求发送
    response1 = requests.post(url=post_url, data=data, headers=headers)
    text = response1.text
    #{"Table":[{"rowcount":113}],"Table1":[{"rown 获取到的内容是这样的,是字典里面嵌套了一个列表,列表中又嵌套了一个字典
    # 用eval将text转成字典
    dictionary = eval(text)
    # 取出[{"rowcount":113}] 是个只有一个元素的列表
    table = dictionary['Table']
    # 取出{"rowcount":113} 是个字典
    dicts = table[0]
    # 取出总查询条数,每页十条记录,所以除以十向上取整得出页数
    number_page = math.ceil(dicts['rowcount']/10)
    #正式记录
    fileName = kw + '.txt'
    file = open(fileName, "w", encoding='utf    -8')
    for i in range(1,number_page+1):
        data ={
            'cname':'',
            'pid':'',
            'keyword': kw,
            'pageIndex': i,
            'pageSize': '10'
        }
        # 请求发送
        response = requests.post(url=post_url, data=data, headers=headers)
        page_text = response.text
        file.write(page_text)
    file.close()
    print('over!!!')

 3. Code running results

 

 View the web page information, which is consistent with the crawled information.

Guess you like

Origin blog.csdn.net/qq_46012097/article/details/127755832