Day 38 Crawler_requests module

Introduce

In a web crawler implemented in python, there are two modules for sending network requests, the first is the urllib module, and the second is the requests module. The urllib module is a relatively old module, which is cumbersome and inconvenient in the process of use. When the requests module appeared, it quickly replaced the urllib module. Therefore, in our course, we recommend that you use the requests module.

Requests The only non-genetically modified Python HTTP library that humans can safely enjoy.

Warning : Non-professional use of other HTTP libraries can cause dangerous side effects, including: security flaws, redundant code, reinventing the wheel, nibbling, depression, headaches, and even death.

 

what is requests

The requests module is a native web-based request module in python, and its main function is to simulate a browser to initiate a request. Powerful, simple and efficient usage. In the field of reptiles, it occupies half of the country.

 

Why use requests module

When using the urllib module, there are many inconveniences, which are summarized as follows:

1. Manually handle url encoding

2. Manually process post request parameters

3. Cumbersome handling of cookies and proxy operations



 

Use the requests module:

1. Automatically handle url encoding

2. Automatically process post request parameters

3. Simplify cookies and proxy operations



 

 

How to use the requests module

 

Environment installation: pip install requests

Use process / coding process

1, specified url

2. Initiate a request based on the requests module

3. Get the data value in the response object

4. Persistent storage

 

Case: Crawler program

Case 1: Simple web page collector

wd = input('>>>')
param = {
    'wd':wd
}
url = 'http://www.baidu.com/baidu'
# UA伪装
header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 OPR/67.0.3575.115 (Edition B2)'
}
info = requests.get(url=url,params= param,headers=header)
info_text = info.text
with open(r'C:\Users\Administrator\Desktop\%s.html'%wd,'w',encoding='utf-8')as f:
    f.writelines(info_text)
print('爬取完毕')

 

Case 2: KFC store information

import requests
import json

info = []


def kfc(num):
    url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx"
    data = {
        "op": "keyword",
        'cname': '',
        'pid': '',
        'keyword': '杭州',
        'pageIndex': num,
        'pageSize': '10',

    }
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    }

    req = requests.post(url=url, data=data, headers=header).json()
    info.append(req)
    print(req)


for i in range(9):
    i += 1
    kfc(i)

txt = open(r'C:\Users\Administrator\Desktop\KFC.json', 'a', encoding='utf-8')
json.dump(info, fp=txt, ensure_ascii=False)
print('over')

 

Case 3: Relevant information of cosmetics production license

import requests
import json

id = []
info = []
url = 'http://125.35.6.84:81/xk/itownet/portalAction.do'
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    # 'Cookie': 'JSESSIONID=02AF3EF8CBE74529A7F6231987EE1A6A; JSESSIONID=64B83D7B541CEED78E13CF74B321D7A0'
}
for i in range(1, 6):
    i = str(i)
    data = {
        'method': 'getXkzsList',
        'on': 'true',
        'page': i,
        'pageSize': '15',
        'productName': '',
        'conditionType': '1',
        'applyname': '',
        'applysn': ''
    }

    req_id = requests.post(url=url, data=data, headers=header).json()
    for i in req_id['list']:
        id.append(i['ID'])

for j in id:
    url = 'http://125.35.6.84:81/xk/itownet/portalAction.do'
    data = {
        'method': 'getXkzsById',
        'id': j
    }
    req_info = requests.post(url=url, data=data, headers=header).json()
    info.append(req_info)
txt = open(r'C:\Users\Administrator\Desktop\juqing.json', 'a', encoding='utf-8')
json.dump(info, txt, ensure_ascii=False)
print('over')

 

Guess you like

Origin www.cnblogs.com/ysging/p/12678581.html