[Python web crawler] 150 lectures to easily get the Python web crawler paid course notes chapter 6-the use of the basic crawler library 2 (requests library)

 requests library third-party library

1. Send GET/POST request

import requests

# 添加headers 和 查询参数 信息

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}
url = 'https://www.baidu.com/'

kw = {'wd': '中国'}
response = requests.get(url, headers=headers, params=kw)
print(response)

#查询响应内容
# print(response.text)  #返回Unicode格式数据
#
# print(response.content) #返回字节流数据
# print(response.content.decode('utf-8')) #产生乱码时

print(response.url)
print(response.encoding)    #响应字符编码

Note the difference between content and text,

response.text returns data in Unicode format

response.content returns byte stream data, when there is garbled code, you need to use decode() to decode

 

2. Proxies Agent

The use of proxy IP has been introduced in the introduction of the urllib library before. The principle of proxy in requests is the same. The difference in use is that the requests library is more concise and convenient. Put the proxy directly in the proxies attribute of the request method. Yes, as follows:

import requests

url = 'http://httpbin.org/ip'
proxy = {
    'http': '123.160.68.74:9999'
}
resp = requests.get(url, proxies=proxy)
print(resp.text)

 

3. Cookie

If a response contains cookies, you can use the cookies attribute to get the return cookie value

3.1 Using cookies to achieve simulated login

import requests

# resp = requests.get('https://www.baidu.com/')
# print(resp.cookies)
# print(resp.cookies.get_dict())

url = 'https://www.zhihu.com/hot'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36',
    'cookie': '_zap=9cb16e80-2e5a-442a-ad83-8c4e56151274; d_c0="AHBUkeFrFhGPThKJGfZWuvXPYdeDQlxWqI4=|1586342451"; _xsrf=KGrwON9rdqf1Va6QrWyiLwNOTRoK5SPY; _ga=GA1.2.925220024.1595327992; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1595327992,1595330061,1595376694; capsion_ticket="2|1:0|10:1599905310|14:capsion_ticket|44:NzhiNDdjZDFjNjBiNDAxOThhNWI3ODQ0MDJhMGQxZGU=|c18f9b858f5a3b1953d240092ab6d1be2fcdd60cb4ca8bdcb531a2161f93fb1b"; z_c0="2|1:0|10:1599905438|4:z_c0|92:Mi4xbkV5a0JRQUFBQUFBY0ZTUjRXc1dFU2NBQUFDRUFsVk5uaXVFWHdEQXZmUFJ5Y0x4WC1ySS1wQ0dYQnl5ZHh3RVhB|29705d2526c129e3642b869de321e6f086c38b17aa2c1285a131192d2de3477b"; tst=h; tshl=; q_c1=1de7075e7f0448aeb62af8961806c2f1|1599916711000|1588725781000; KLBRSID=2177cbf908056c6654e972f5ddc96dc2|1599917151|1599915145'
}
resp = requests.get(url, headers=headers)
print(resp.text)

3.2 Session, realize shared cookie

import requests

post_url = 'https://i.meishi.cc/login.php?redirect=https%3A%2F%2Fwww.meishij.net%2F'

post_data = {
    'username':'[email protected]',
    'password':'wq15290884759.'
}
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}

# 登录
session = requests.session()
session.post(post_url,headers=headers,data=post_data)


#访问个人网页
url = 'https://i.meishi.cc/cook.php?id=13686422'

resp = session.get(url)
print(resp.text)

 

4. Dealing with untrusted SSL certificates

4.1 SSL certificate

An SSL certificate is a type of digital certificate , similar to an electronic copy of a driver's license, passport, and business license. Because it is configured on the server, it is also called an SSL server certificate.

SSL certificate is to comply with the SSL protocol, issued by a trusted digital certificate authority CA, after verifying the identity of the server, with server identity verification and data transmission encryption functions.

https://baike.baidu.com/item/SSL%E8%AF%81%E4%B9%A6/5201468?fr=aladdin

If the SSL request is not trusted, an error will occur: such as the following request error

import requests

url = 'https://inv-veri.chinatax.gov.cn/'
resp = requests.get(url)

print(resp.text)

Therefore, after adding the verify attribute value for this situation, it can be accessed normally.

import requests

url = 'https://inv-veri.chinatax.gov.cn/'
resp = requests.get(url, verify=False)

print(resp.text)

 

Guess you like

Origin blog.csdn.net/weixin_44566432/article/details/108561841