Introduce
In a web crawler implemented in python, there are two modules for sending network requests, the first is the urllib module, and the second is the requests module. The urllib module is a relatively old module, which is cumbersome and inconvenient in the process of use. When the requests module appeared, it quickly replaced the urllib module. Therefore, in our course, we recommend that you use the requests module.
Requests The only non-genetically modified Python HTTP library that humans can safely enjoy.
Warning : Non-professional use of other HTTP libraries can cause dangerous side effects, including: security flaws, redundant code, reinventing the wheel, nibbling, depression, headaches, and even death.
what is requests
The requests module is a native web-based request module in python, and its main function is to simulate a browser to initiate a request. Powerful, simple and efficient usage. In the field of reptiles, it occupies half of the country.
Why use requests module
When using the urllib module, there are many inconveniences, which are summarized as follows:
1. Manually handle url encoding
2. Manually process post request parameters
3. Cumbersome handling of cookies and proxy operations
Use the requests module:
1. Automatically handle url encoding
2. Automatically process post request parameters
3. Simplify cookies and proxy operations
How to use the requests module
Environment installation: pip install requests
Use process / coding process
1, specified url
2. Initiate a request based on the requests module
3. Get the data value in the response object
4. Persistent storage
Case: Crawler program
Case 1: Simple web page collector
wd = input('>>>') param = { 'wd':wd } url = 'http://www.baidu.com/baidu' # UA伪装 header = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 OPR/67.0.3575.115 (Edition B2)' } info = requests.get(url=url,params= param,headers=header) info_text = info.text with open(r'C:\Users\Administrator\Desktop\%s.html'%wd,'w',encoding='utf-8')as f: f.writelines(info_text) print('爬取完毕')
Case 2: KFC store information
import requests import json info = [] def kfc(num): url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx" data = { "op": "keyword", 'cname': '', 'pid': '', 'keyword': '杭州', 'pageIndex': num, 'pageSize': '10', } header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", } req = requests.post(url=url, data=data, headers=header).json() info.append(req) print(req) for i in range(9): i += 1 kfc(i) txt = open(r'C:\Users\Administrator\Desktop\KFC.json', 'a', encoding='utf-8') json.dump(info, fp=txt, ensure_ascii=False) print('over')
Case 3: Relevant information of cosmetics production license
import requests import json id = [] info = [] url = 'http://125.35.6.84:81/xk/itownet/portalAction.do' header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", # 'Cookie': 'JSESSIONID=02AF3EF8CBE74529A7F6231987EE1A6A; JSESSIONID=64B83D7B541CEED78E13CF74B321D7A0' } for i in range(1, 6): i = str(i) data = { 'method': 'getXkzsList', 'on': 'true', 'page': i, 'pageSize': '15', 'productName': '', 'conditionType': '1', 'applyname': '', 'applysn': '' } req_id = requests.post(url=url, data=data, headers=header).json() for i in req_id['list']: id.append(i['ID']) for j in id: url = 'http://125.35.6.84:81/xk/itownet/portalAction.do' data = { 'method': 'getXkzsById', 'id': j } req_info = requests.post(url=url, data=data, headers=header).json() info.append(req_info) txt = open(r'C:\Users\Administrator\Desktop\juqing.json', 'a', encoding='utf-8') json.dump(info, txt, ensure_ascii=False) print('over')