requests module----this is a tough method, how strong is it? Just look at it and you will know

Table of contents

The role of requests

Send a simple get request

Send request with header

Send a POST request

Use of cookies parameters

The difference between cookie and session

use proxy

Set request timeout

Request an SSL certificate

_____________________________________________________

                                                here we go

A brief introduction to the requests module

1. The requests module is the most commonly used module in crawlers, a module that can be quickly accepted by crawlers.

Maybe the little cutie who has seen some tutorials before is thinking, why not start with urllib, but from
The requests module starts for the following reasons
1. The underlying implementation of requests is urllib
2.requests are common in Python2 and Python3, the method is exactly the same
3.requests is easy to use

A brief look at our visits to the web

 We send a request to the server, and then the server responds with data, and requetst simulates the browser sending the request,

Simply understood as: send a network request and return the corresponding data

Before starting to use requests, I will express my thoughts in the form of drawing:

Now we have to do things, we must first know how to do it, the first step:

Download module requests:

Send a simple get request

 Let's start our simple code:

url="https://www.baidu.com/?tn=02003390_19_hao_pg"

import  requests
url="https://www.baidu.com/?tn=02003390_19_hao_pg"
header={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
}
# 发送请求
response=requests.get(url)

print(response)
response.encoding="utf-8"
print(response.text)

Encoding is to design the returned data encoding format to be consistent with the accepted data encoding format

text is the html type of the data returned by the server as a string

There is also content json() to get data

return response status code

Return partial request headers

Return all request headers

content get data (return byte data)

content.decode() is equivalent to text and the return type is a string

convert bytes to string

  json() is to convert json data into python data type

The following code is as follows:

import  requests
url="https://www.woaifanyi.com/api/2.0/save/?ajaxtimestamp=1685158773961"
header={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
}
data={
    "source": "你好",
    "from": "1",
    "to": "2"
}
# 发送请求
response=requests.post(url,data=data,headers=header)
print(response)
print(response.json())
print(response.request.headers)

 Post() is used here, which is roughly the same as get(), the only difference is the data parameter

Return the request header corresponding to the response 

response.request.headers

Return all response headers

headers

 

Send request with header

 It can be seen that the data returned above is a partial HTML data, but it is actually much more than this

In order to prevent such a situation, we can add a request header

import  requests
url="https://www.baidu.com/?tn=02003390_19_hao_pg"
header={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
}
# 发送请求
response=requests.get(url,headers=header)

print(response)
response.encoding="utf-8"
print(response.text)

In this way, the data of the entire page can be obtained, but only the data of the static page can be obtained. It is difficult to rely on this for dynamic data (I will post how to obtain the data of the dynamic page later)

Send a POST request

In front of json(), I used post() to demonstrate

no time spent here

Use of cookies parameters

Cookies can be understood as user information, just like when we log in to QQ, we will not be allowed to log in manually for a while, because the qq software has our cookie information

Use: can be added to the header and sent together

 code show as below:

Some cuties may find it difficult to write cookies, this website can solve it, https://spidertools.cn/#/formatHeader

import  requests
url="http://ifanyi.iciba.com/index.php?c=trans&m=fy&client=6&auth_user=key_web_fanyi&sign=0d2d6b4f80839676"
header={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
    "cookie":"XXXXXXXXXXX"
}

# 发送请求
response=requests.get(url,headers=header)

print(response)
response.encoding="utf-8"
# print(response.text)

# print(response.content.decode())
# print(response.status_code)
# print(response.json())

The difference between cookie and session

The cookie data is stored on the client's browser, and the session data is stored on the server.
Cookies are not very secure, others can analyze the cookies stored locally and cheat them
The session will be saved on the server for a certain period of time. When the number of visits increases, it will take up your service more
device performance
The data saved by a single cookie cannot exceed 4K, and many browsers limit a site to save at most
20 cookies

use proxy

1. Let the server think that the same client is not requesting
2. Prevent our real address from being leaked and held accountable

Proxy IP can be divided into three categories

1. Transparent proxy (knowing that you use proxy ip, you can find ni)
2. Anonymous proxy (know that you use proxy ip, but can't find you)
3. Highly anonymous proxy (I don’t know you use the proxy IP, and I can’t find you)
code show as below:
import  requests
url="http://ifanyi.iciba.com/index.php?c=trans&m=fy&client=6&auth_user=key_web_fanyi&sign=0d2d6b4f80839676"
header={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
    "cookie":"XXXXXXXXXXX"
}
proxies={
    "http":"http://117.191.11.112",
    "https":"http://117.191.11.112"
}

# 发送请求
response=requests.get(url,headers=header,proxies=proxies)

print(response)
response.encoding="utf-8"

When the proxy IP fails, an error will be reported:

When requesting, the proxy IP request time is too long, it will be annoying if you write cute, so we can add a response time, how long to wait

Set request timeout

import  requests
url="http://ifanyi.iciba.com/index.php?c=trans&m=fy&client=6&auth_user=key_web_fanyi&sign=0d2d6b4f80839676"
header={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
    "cookie":"XXXXXXXXXXX"
}
proxies={
    "http":"http://117.191.11.112",
    "https":"http://117.191.11.112"
}

# 发送请求
response=requests.get(url,headers=header,proxies=proxies,timeout=10)

print(response)
response.encoding="utf-8"

 timeout=10 means waiting for up to 10 seconds !!! ! It is not forced to wait for 10 seconds

Request an SSL certificate

We may encounter this situation when we visit certain websites, if we use
If the requests module requests, the result will not be obtained, because the CA certificate of the website does not have
Certified by [Trusted Root Certification Authorities]
method:
We add parameters when requesting with requests
verify=False
response = requests.get('https://inv-veri.xxxx.gov.cn/',verify=False)

Summary :

The requests module can simulate a browser to access web pages, but it will not work for some dynamic web pages. The above is my personal sharing situation.

Guess you like

Origin blog.csdn.net/m0_69984273/article/details/130896941