Table of contents
The role of requests
Send a simple get request
Send request with header
Send a POST request
Use of cookies parameters
The difference between cookie and session
use proxy
Set request timeout
Request an SSL certificate
_____________________________________________________
here we go
A brief introduction to the requests module
1. The requests module is the most commonly used module in crawlers, a module that can be quickly accepted by crawlers.
A brief look at our visits to the web
We send a request to the server, and then the server responds with data, and requetst simulates the browser sending the request,
Simply understood as: send a network request and return the corresponding data
Before starting to use requests, I will express my thoughts in the form of drawing:
Now we have to do things, we must first know how to do it, the first step:
Download module requests:
Send a simple get request
Let's start our simple code:
url="https://www.baidu.com/?tn=02003390_19_hao_pg"
import requests
url="https://www.baidu.com/?tn=02003390_19_hao_pg"
header={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
}
# 发送请求
response=requests.get(url)
print(response)
response.encoding="utf-8"
print(response.text)
Encoding is to design the returned data encoding format to be consistent with the accepted data encoding format
text is the html type of the data returned by the server as a string
There is also content json() to get data
return response status code
Return partial request headers
Return all request headers
content get data (return byte data)
content.decode() is equivalent to text and the return type is a string
convert bytes to string
json() is to convert json data into python data type
The following code is as follows:
import requests
url="https://www.woaifanyi.com/api/2.0/save/?ajaxtimestamp=1685158773961"
header={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
}
data={
"source": "你好",
"from": "1",
"to": "2"
}
# 发送请求
response=requests.post(url,data=data,headers=header)
print(response)
print(response.json())
print(response.request.headers)
Post() is used here, which is roughly the same as get(), the only difference is the data parameter
Return the request header corresponding to the response
response.request.headers
Return all response headers
headers
Send request with header
It can be seen that the data returned above is a partial HTML data, but it is actually much more than this
In order to prevent such a situation, we can add a request header
import requests
url="https://www.baidu.com/?tn=02003390_19_hao_pg"
header={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
}
# 发送请求
response=requests.get(url,headers=header)
print(response)
response.encoding="utf-8"
print(response.text)
In this way, the data of the entire page can be obtained, but only the data of the static page can be obtained. It is difficult to rely on this for dynamic data (I will post how to obtain the data of the dynamic page later)
Send a POST request
In front of json(), I used post() to demonstrate
no time spent here
Use of cookies parameters
Cookies can be understood as user information, just like when we log in to QQ, we will not be allowed to log in manually for a while, because the qq software has our cookie information
Use: can be added to the header and sent together
code show as below:
Some cuties may find it difficult to write cookies, this website can solve it, https://spidertools.cn/#/formatHeader
import requests
url="http://ifanyi.iciba.com/index.php?c=trans&m=fy&client=6&auth_user=key_web_fanyi&sign=0d2d6b4f80839676"
header={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
"cookie":"XXXXXXXXXXX"
}
# 发送请求
response=requests.get(url,headers=header)
print(response)
response.encoding="utf-8"
# print(response.text)
# print(response.content.decode())
# print(response.status_code)
# print(response.json())
The difference between cookie and session
use proxy
Proxy IP can be divided into three categories
import requests
url="http://ifanyi.iciba.com/index.php?c=trans&m=fy&client=6&auth_user=key_web_fanyi&sign=0d2d6b4f80839676"
header={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
"cookie":"XXXXXXXXXXX"
}
proxies={
"http":"http://117.191.11.112",
"https":"http://117.191.11.112"
}
# 发送请求
response=requests.get(url,headers=header,proxies=proxies)
print(response)
response.encoding="utf-8"
When the proxy IP fails, an error will be reported:
When requesting, the proxy IP request time is too long, it will be annoying if you write cute, so we can add a response time, how long to wait
Set request timeout
import requests
url="http://ifanyi.iciba.com/index.php?c=trans&m=fy&client=6&auth_user=key_web_fanyi&sign=0d2d6b4f80839676"
header={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
"cookie":"XXXXXXXXXXX"
}
proxies={
"http":"http://117.191.11.112",
"https":"http://117.191.11.112"
}
# 发送请求
response=requests.get(url,headers=header,proxies=proxies,timeout=10)
print(response)
response.encoding="utf-8"
timeout=10 means waiting for up to 10 seconds !!! ! It is not forced to wait for 10 seconds
Request an SSL certificate
response = requests.get('https://inv-veri.xxxx.gov.cn/',verify=False)