[Deep Learning] pytorch——Autograd

The notes are study notes compiled by myself. If there are any mistakes, please point them out~

Deep learning column link:
http://t.csdnimg.cn/dscW7

pytorch——Autograd

Introduction to Autograd
requires_grad
Computational graph
Implement linear regression using Autograd

Introduction to Autograd

autograd is the automatic differentiation engine in PyTorch, which is one of the core components of PyTorch. autogradProvides a mechanism for calculating gradients, making the training of neural networks more concise and efficient.

In deep learning, gradients are a key part of optimization algorithms such as backpropagation. By calculating the gradient of the input variable relative to the output variable, you can determine how to update the parameters of the model to minimize the loss function.

autograd works by keeping track of all operations performed on a tensor and building a directed acyclic graph (DAG), called a computation graph. This computational graph records the dependencies between tensors, as well as the gradient function of each operation. When propagating forward, autograd automatically performs the required calculations and saves intermediate results. When the .backward() function is called, autograd will automatically calculate the gradient according to the calculation graph and store the gradient in .grad of each tensor in properties.

Usingautograd is very simple. Simply set the tensor that requires gradient calculation to requires_grad=True, and then perform forward and backpropagation operations. For example:

import torch as t

x = t.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1

y.backward()

print(x.grad)  # 输出：tensor([7.])

In the above code, a tensor x is first created and set requires_grad=True, indicating that it wants to calculate about x gradient. Then, a computational graph y is defined, and the result is obtained by performing a series of operations on x. Finally, the function is called to perform backpropagation and the calculated gradient is obtained through . y.backward()x.grad

autogradThe existence of makes training neural networks more convenient, without the need to manually calculate and update gradients. At the same time, it also provides flexibility and scalability for implementing more complex calculation graphs and customized gradient functions.

requires_grad

requires_grad is an attribute of a tensor in PyTorch, used to specify whether the gradient of the tensor needs to be calculated. If the gradient needs to be calculated, it needs to be set to True, otherwise set to False. By default, the attribute value is False.

In deep learning, it is usually necessary to optimize the parameters of the model, so the gradients of these parameters need to be calculated. By setting the parameter tensor's requires_grad property to True you tell PyTorch to track its calculations and calculate the gradient. In addition to parameter tensors, other tensors whose gradients need to be calculated can also be set to requires_grad=True in order to calculate their gradients.

It should be noted that if the requires_grad attribute of a tensor is True, the computational cost will increase slightly because PyTorch needs to keep track of the tensor's Calculate and calculate its gradient. Therefore, for tensors that do not need to calculate gradients, it is best to set their requires_gradproperties to False to reduce computational costs.

Computational graph

Insert image description here
The bottom layer of PyTorchautograd uses a calculation graph. The calculation graph is a special directed acyclic graph (DAG) used to record the relationship between operators and variables. . Generally, rectangles are used to represent operators, and ovals are used to represent variables. The calculation diagram is shown in the figure. In the figure, MUL and ADD are all operators, $\textbf{a} < /span>$ ， $\textbf{b}$ ， $\textbf{c}$ Immediate change amount.
Insert image description here

Tensor.data and tensor.detach() without gradient tracking

tensor.dataBoth and tensor.detach() can be used to obtain a copy of a tensor without gradient tracking, but there are some subtle differences between them.

tensor.datais a property that returns a new tensor that shares data storage with the original tensor, but does not share gradient information. This means that operations on the returned tensor will not affect the gradient of the original tensor. However, if this new tensor is used in the computational graph, the gradient will still be propagated through the original tensor.

Here is an example description:

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1

z = y.data
z *= 2  # 操作z不会影响到y的梯度

y.backward()

print(x.grad)  # 输出：tensor([7.])

In the above code, we first create a tensorx and set requires_grad=True, indicating that we want to calculate about < The gradient of a i=3>. Then, we defined a calculation graph and assigned it to , which will not be affected by the operation The gradient of a>. Finally, we call the method to calculate the gradient relative to and store the gradient in the attribute. xyzzy.backward()xx.grad

tensor.detach() is a function that returns a new tensor with the same data content as the original tensor, but does not share gradient information. Unlike tensor.data, tensor.detach() can be applied to any tensor, not just tensors with requires_grad=True.

The following is an example of usingtensor.detach():

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1

z = y.detach()
z *= 2  # 操作z不会影响到y的梯度

y.backward()

print(x.grad)  # 输出：tensor([7.])

In the above code, we perform the same operation as the previous example, assigning y to z and passing the operation < a i=3> will not affect the gradient of . Finally, we call the method to calculate the gradient relative to and store the gradient in the attribute. zy.backward()xx.grad

In summary, both tensor.data and tensor.detach() can be used to obtain a copy of the tensor without gradient tracking, but tensor.detach()More general and can be applied to any tensor.

Gradient of non-leaf nodes

During the backpropagation process, the gradients of non-leaf nodes are cleared by default.

1. Use the .retain_grad() method: When creating a tensor, you can use the .retain_grad() method to explicitly specify that the gradient information is to be retained. Then, after backpropagation, the gradients of these non-leaf nodes can be accessed.

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x + 1

y.retain_grad()

z = y.mean()

z.backward()

grad_y = y.grad

print(grad_y)  # 输出：tensor([1.])

2. The second method: use hook. Hook is a function, the input is the gradient, there should be no return value

import torch

def variable_hook(grad):
    print('y的梯度：',grad)

x = torch.ones(3, requires_grad=True)
w = torch.rand(3, requires_grad=True)
y = x * w
# 注册hook
hook_handle = y.register_hook(variable_hook)
z = y.sum()
z.backward()

# 除非你每次都要用hook，否则用完之后记得移除hook
hook_handle.remove()

Summary of computational graph features

In PyTorch, a computational graph is a data structure used to represent computational processes.

Dynamic computation graph: PyTorch uses a dynamic computation graph, which means that the computation graph is dynamically constructed based on the actual execution process. This allows the computational graph to be flexibly constructed based on input data during each forward pass.

Automatic differentiation: PyTorch's calculation graph is not only used to represent the calculation process, but also supports automatic differentiation. Through computational graphs, PyTorch can automatically calculate gradients without manually writing the backpropagation algorithm. This greatly simplifies the training process of deep learning models.

Node-based representation: The computing graph consists of a series of nodes (Node) and edges (Edge), where nodes represent operations (such as tensor operations) or variables (such as weights), and edges represent the flow of data. Each node contains the information required for forward computation and backpropagation.

Leaf nodes and non-leaf nodes: In the computational graph, leaf nodes are nodes without input edges, usually representing input data or variables that require gradients. Non-leaf nodes are nodes with input edges that represent computational operations. During the backpropagation process, by default, only the gradients of leaf nodes will be calculated and retained, and the gradients of non-leaf nodes will be cleared.

Delayed execution: Computational graphs in PyTorch are executed on demand. That is to say, during the forward propagation process, only the nodes that actually need to be calculated will be executed, and the nodes that do not need to be calculated will be skipped. This delayed execution improves efficiency, especially for large models and complex computational graphs.

Computational graph optimization: PyTorch uses some optimization techniques internally to improve the efficiency of computational graphs. For example, shared memory caches intermediate results to avoid repeated calculations; and merges multiple operations into one operation to reduce computing and memory overhead. These optimization techniques can increase calculation speed and reduce memory footprint.

Implement linear regression using Autograd

[Deep Learning] pytorch - Linear Regression:http://t.csdnimg.cn/7KsP3

The previous article was about manually calculating the gradient. Here we use Autograd to automatically calculate the gradient.

import torch as t
%matplotlib inline
from matplotlib import pyplot as plt
from IPython import display
import numpy as np

# 设置随机数种子，保证在不同电脑上运行时下面的输出一致
t.manual_seed(1000) 

def get_fake_data(batch_size=8):
    ''' 产生随机数据：y=x*2+3，加上了一些噪声'''
    x = t.rand(batch_size, 1, device=device) * 5
    y = x * 2 + 3 +  t.randn(batch_size, 1, device=device)
    return x, y

# 随机初始化参数
w = t.rand(1,1, requires_grad=True)
b = t.zeros(1,1, requires_grad=True)
losses = np.zeros(500)

lr =0.02 # 学习率

for ii in range(500):
    x, y = get_fake_data(batch_size=4)
    
    # forward：计算loss
    y_pred = x.mm(w) + b.expand_as(y) 
    loss = 0.5 * (y_pred - y) ** 2 # 均方误差
    loss = loss.sum()
    losses[ii] = loss.item()
    
    # backward：自动计算梯度
    loss.backward()
    
    # 更新参数
    w.data.sub_(lr * w.grad.data)
    b.data.sub_(lr * b.grad.data)
    
    # 梯度清零
    w.grad.data.zero_()
    b.grad.data.zero_()
    
    if ii%50 ==0:
        # 画图
        display.clear_output(wait=True)
        x = t.arange(0, 6).view(-1, 1).float()
        y = x.mm(w.data) + b.data.expand_as(x)
        plt.plot(x.numpy(), y.numpy(),color='b') # predicted
        
        x2, y2 = get_fake_data(batch_size=100) 
        plt.scatter(x2.numpy(), y2.numpy(),color='r') # true data
        
        plt.xlim(0,5)
        plt.ylim(0,15)   
        plt.show()
        plt.pause(0.5)
        
print('w: ', w.item(), 'b: ', b.item())

Insert image description here
w: 2.036161422729492 b: 3.095750331878662

Here are the main steps of the code:

defines a get_fake_data function for generating random data with noise. The true relationship of the data is $and = x * 2 + 3$ 。
Initialize parametersw and b, and setrequires_grad=True to automatically calculate gradients.
Perform 500 training rounds, each training round includes the following steps:
- Get a mini-batch of training data fromget_fake_data function.
- Forward propagation: Calculate the predicted value of the model y_pred, that is, $Linear combination of x$ and parameters w and b.
- Calculate the mean square error loss function.
- Backpropagation: Automatically calculate the gradient of parameters w and b.
- Update parameters: Update parameters w and b by gradient descent method.
- Clear gradient: Set the gradient of the parameter to zero so that the gradient can be calculated in the next round.
- Every 50 epochs of training, visualize the current model’s predictions and a scatter plot of real data.
After training, print out the final learned parametersw and b.

plt.plot(losses)
plt.ylim(0,50)

Visualization of the change of the loss function with the number of training epochs is achieved. losses is an array with a length of 500, recording the loss function value after each round of training. plt.plot(losses) will connect these loss function values with the number of rounds into a curve, and you can intuitively see the downward trend of the loss function during the training process of the model.

plt.ylim(0,50) is used to set the range of the y-axis to ensure that the curve can be fully displayed in the image.