1. Web page packet capture
1. Know the url, response method, and crawler result form
#指定url:
post_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
# UA伪装
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
# 请求发送
response1 = requests.post(url=post_url, data=data, headers=headers)
#响应结果接收
text = response1.text
2. Known crawler parameter information
This is currently the first page of query information. There are 12 pages in total.
'pageIndex': What page is the current page?
'pageSize': how many records per page
data = {
'cname': '',
'pid': '',
'keyword': '北京',
'pageIndex': 1,
'pageSize': '10'
}
2. Final code
The main difficulty lies in determining the number of pages to cycle through.
#导包
import requests
import math
if __name__ == "__main__":
#指定url:
post_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
# UA伪装
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
# 请求参数处理
kw = input('enter a query city:')
#第一次申请 解决页数不知道的问题
data = {
'cname': '',
'pid': '',
'keyword': kw,
'pageIndex': 1,
'pageSize': '10'
}
# 请求发送
response1 = requests.post(url=post_url, data=data, headers=headers)
text = response1.text
#{"Table":[{"rowcount":113}],"Table1":[{"rown 获取到的内容是这样的,是字典里面嵌套了一个列表,列表中又嵌套了一个字典
# 用eval将text转成字典
dictionary = eval(text)
# 取出[{"rowcount":113}] 是个只有一个元素的列表
table = dictionary['Table']
# 取出{"rowcount":113} 是个字典
dicts = table[0]
# 取出总查询条数,每页十条记录,所以除以十向上取整得出页数
number_page = math.ceil(dicts['rowcount']/10)
#正式记录
fileName = kw + '.txt'
file = open(fileName, "w", encoding='utf -8')
for i in range(1,number_page+1):
data ={
'cname':'',
'pid':'',
'keyword': kw,
'pageIndex': i,
'pageSize': '10'
}
# 请求发送
response = requests.post(url=post_url, data=data, headers=headers)
page_text = response.text
file.write(page_text)
file.close()
print('over!!!')
3. Code running results
View the web page information, which is consistent with the crawled information.