[Python crawler] simple and powerful request library

insert image description here

1 Introduction

In modern web development, HTTP communication with the server is an important task. Python's Requests library is a simple and powerful third-party library that provides a concise API that makes sending HTTP requests very easy. This tutorial will show you how to use the Python Requests library to send various types of HTTP requests and process the responses.

1.1 HTTP request and response

Before we start, let's briefly understand the basic concepts of HTTP requests and responses. HTTP is a protocol for communication between clients and servers. The client sends a request to the server, and the server returns a corresponding response. The request includes information such as method (GET, POST, etc.), URL, request header, and request body, while the response includes information such as status code, response header, and response body.

1.2 The role and advantages of the Python Requests library

The Python Requests library is a popular third-party library for sending HTTP requests and handling responses. It provides a clean API that makes sending requests very simple. Some of Python's built-in HTTP libraries (such as urllib) can also accomplish similar tasks, but the Requests library is easier to use and provides more features and flexibility.

1.3 Install the Requests library

Before starting, make sure you have a Python interpreter installed. To install the Requests library, you can use the pip command to execute the following instructions:

pip install requests

After the installation is complete, we can start sending HTTP requests using the Requests library.

2. Send a GET request

GET request is used to fetch data from the server. Below are some examples of common GET requests.

2.1 Send a basic GET request

Sending a simple GET request is very easy using the Requests library. Just provide the target URL.

import requests

response = requests.get('https://api.example.com/data')
print(response.text)

The above code will send a GET request to https://api.example.com/data and print out the content of the response.

2.2 Add query parameters

Sometimes, we need to add query parameters to the URL to get specific data. You can use paramsparameters to specify query parameters.

import requests

payload = {
    
    'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://api.example.com/data', params=payload)
print(response.text)

The above code will add the query parameters key1=value1and key2=value2to the URL.

2.3 Set the request header

The request header contains additional information about the request, such as User-Agent, Accept, etc. Parameters can be used headersto set request headers.

import requests

headers = {
    
    'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://api.example.com/data', headers=headers)
print(response.text)

The above code sets the User-Agent request header to simulate a browser sending a request.

2.4 Handling the response

Once we get the response, we can perform various operations on it. For example, the status code, header information and content of the response can be obtained.

import requests

response

 = requests.get('https://api.example.com/data')
print(response.status_code)  # 打印状态码
print(response.headers)      # 打印响应头
print(response.text)         # 打印响应内容

The above code shows how to get the status code, response header and response content.

3. Send a POST request

POST requests are used to submit data to the server. Below are some examples of common POST requests.

3.1 Send a basic POST request

Sending a simple POST request is also easy using the Requests library. Just provide the destination URL and the data to send.

import requests

payload = {
    
    'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://api.example.com/submit', data=payload)
print(response.text)

The above code will send a POST request to https://api.example.com/submit and print out the content of the response.

3.2 Send form data

In web development, forms are often used to collect user input data. Parameters can be used datato send form data.

import requests

data = {
    
    'username': 'john', 'password': 'secret'}
response = requests.post('https://api.example.com/login', data=data)
print(response.text)

The above code sends username and password as form data to https://api.example.com/login.

3.3 Send JSON data

In addition to sending form data, data in JSON format can also be sent. Parameters can be used jsonto send JSON data.

import requests

data = {
    
    'name': 'John Doe', 'age': 30}
response = requests.post('https://api.example.com/user', json=data)
print(response.text)

The above code will dataconvert the dictionary into JSON format and send it to https://api.example.com/user.

3.4 Handling the response

Similar to GET requests, we can also perform various operations on the responses of POST requests. For example, get status code, header information and content.

import requests

response = requests.post('https://api.example.com/submit', data={
    
    'key': 'value'})
print(response.status_code)  # 打印状态码
print(response.headers)      # 打印响应头
print(response.text)         # 打印响应内容

The above code shows how to get the response information of the POST request.

4. Request session management

In some cases, we may need to maintain session state or handle cookies. The Requests library provides session objects to handle these situations.

4.1 Using session objects

Use the session object to share state between multiple requests. The session object can keep cookies, set request headers, etc.

import requests

session = requests.Session()
session.get('https://api.example.com/login')
response = session.get('https://api.example.com/dashboard')
print(response.text)

The above code creates a session object, persists the session after login, and then sends another request to the dashboard page.

4.2 Keep the session state

The session object automatically persists cookies and sends them automatically on subsequent requests. This is useful for simulating user logins and making consecutive requests.

import requests

session = requests.Session()
login_data =

 {
    
    'username': 'john', 'password': 'secret'}
session.post('https://api.example.com/login', data=login_data)
response = session.get('https://api.example.com/dashboard')
print(response.text)

The above code maintains the session state after login and uses the same session object to send subsequent requests.

4.3 Handling Cookies

The session object also handles cookies conveniently. The cookie for the current session can be obtained using cookiesattributes, or cookiesa custom cookie can be sent using parameters.

import requests

session = requests.Session()
session.get('https://api.example.com/login')
cookies = session.cookies.get_dict()  # 获取当前会话的Cookie
response = session.get('https://api.example.com/dashboard', cookies=cookies)
print(response.text)

The above code gets the current session cookie and sends it to the dashboard page.

5. Handling exceptions and errors

In HTTP requests, various exceptions and errors may occur. The Requests library provides mechanisms to handle these exceptions and errors.

5.1 Processing request timeout

If the request times out, timeoutparameters can be set to limit the waiting time of the request.

import requests

try:
    response = requests.get('https://api.example.com/data', timeout=5)
    print(response.text)
except requests.Timeout:
    print('请求超时')

The above code sets the request timeout to 5 seconds and catches Timeoutthe exception.

5.2 Handling Connection Errors

If there is an error connecting to the server, requests.ConnectionErrorthe exception can be caught.

import requests

try:
    response = requests.get('https://api.example.com/data')
    print(response.text)
except requests.ConnectionError:
    print('连接错误')

The above code catches the exception when a connection error occurs.

5.3 Handling HTTP error status codes

If the server returns an incorrect HTTP status code, response.raise_for_status()methods can be used to throw an exception.

import requests

response = requests.get('https://api.example.com/data')
try:
    response.raise_for_status()
    print(response.text)
except requests.HTTPError:
    print('HTTP错误')

The above code will check the status code of the response and throw an exception if the status code is not 2xx.

6. Advanced features and extensions

In addition to basic HTTP requests, the Requests library also provides some advanced functions and extensions to meet more complex needs.

6.1 File upload and download

Uploading and downloading files is easy with the Requests library.

import requests

# 文件上传
files = {
    
    'file': open('data.txt', 'rb')}
response = requests.post('https://api.example.com/upload', files=files)

# 文件下载
response = requests.get('https://api.example.com/download/data.txt')
with open('data.txt', 'wb') as file:
    file.write(response.content)

The above code shows an example of file upload and download.

6.2 SSL verification and certificates

The Requests library supports SSL verification and custom certificates.

import requests

response = requests.get('https://api.example.com', verify=True)  # 开启SSL验证

# 使用自定义证书
response = requests.get('https://api.example.com', cert=('client.crt', 'client.key'))

The above code shows how to enable SSL verification and use a custom certificate.

6.3 Proxy settings

If you need to send requests through a proxy server, you can use proxiesparameters to set the proxy.

import requests

proxies = {
    
    'http': 'http://127.0.0.1:8080', 'https': 'http://127.0.0.1:8080'}
response = requests.get('https://api.example.com', proxies=proxies)

The above code sends the request through the proxy server.

7. Best practices

7.1 Using session objects

Use the session object to better manage session state and share data when sending multiple related requests. This increases efficiency and reduces unnecessary duplication of operations. Especially in situations where you need to keep logged in or handle cookies, using a session object is very convenient.

import requests

session = requests.Session()
session.get('https://api.example.com/login')
# 发送其他请求...

7.2 Handling exceptions

When sending a request, it is inevitable to encounter some abnormal conditions, such as connection timeout, server error, etc. In order to ensure the robustness of the program, it is recommended to use exception handling mechanism to catch and handle these exceptions.

import requests

try:
    response = requests.get('https://api.example.com/data', timeout=5)
    response.raise_for_status()
    # 处理响应...
except requests.exceptions.Timeout:
    print('请求超时')
except requests.exceptions.HTTPError:
    print('HTTP错误')
except requests.exceptions.RequestException as e:
    print('请求异常:', str(e))

7.3 Setting the timeout period

It is important to set an appropriate timeout when sending requests. If the waiting time is too long, it may affect the response speed of the program. By setting timeoutparameters, you can limit the waiting time of the request.

import requests

response = requests.get('https://api.example.com/data', timeout=5)

It is recommended to set an appropriate timeout period according to the specific situation to avoid long-term blocking of requests.

7.4 Check the response status code

When processing a response, it is often necessary to check the status code of the response. A correct status code indicates that the request was successful, while an incorrect status code may require corresponding processing measures.

import requests

response = requests.get('https://api.example.com/data')
if response.status_code == 200:
    print('请求成功')
else:
    print('请求失败:', response.status_code)

Depending on the status code, different processing logic can be adopted, such as retrying the request, logging or throwing an exception.

8. Frequently asked questions

8.1 Certificate verification failed

In some cases, the Requests library may throw requests.exceptions.SSLErroran exception when the requested URL uses the HTTPS protocol and the certificate verification fails. This is usually because the target website's certificate is invalid or expired.

To solve this problem, you can set verifythe parameter to Falseskip certificate verification.

import requests

response = requests.get('https://api.example.com', verify=False)

Note that skipping certificate verification is a security risk and is recommended only for testing environments.

8.2 Redirect problem

By default, the Requests library handles redirection automatically. When the server returns a redirection response, the Requests library will automatically follow the redirection and return the final response result.

import requests

response = requests.get('https://api.example.com/redirect')
print(response.url)   # 打印最终

重定向后的URL

If you need to disable redirection, you can set allow_redirectsthe parameter to False.

import requests

response = requests.get('https://api.example.com/redirect', allow_redirects=False)
print(response.status_code)   # 打印重定向响应的状态码

8.3 Chinese encoding problem

Encoding issues are sometimes encountered when processing requests containing Chinese characters. In order to avoid garbled characters or encoding errors, you can use encodethe and decodemethod to specify the character encoding.

import requests

response = requests.get('https://api.example.com/data')
response.encoding = 'utf-8'  # 指定字符编码
content = response.text

According to the specific character encoding, will response.encodingbe set to the corresponding value.

connection pool exhausted

When the program frequently sends a large number of requests, it may cause the problem of connection pool exhaustion. At this time, you can improve concurrency performance by increasing the size of the connection pool.

import requests

adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session = requests.Session()
session.mount('https://', adapter)

By setting pool_connectionsand pool_maxsizeto appropriate values, the capacity of the connection pool can be increased to meet the demands of high concurrent requests.

9. Summary

This tutorial covers basic usage and advanced features of the Python Requests library. You learned how to send GET and POST requests, handle responses, manage session state, handle exceptions and errors, and explore some advanced features. By mastering the Requests library, HTTP request and response processing in web development can be easily performed.

Through the study of this tutorial, I hope to have a deeper understanding of the Python Requests library, and be able to use it flexibly to handle various HTTP communication requirements.

Reference link:

Guess you like

Origin blog.csdn.net/mingfeng4923/article/details/131077681