In-depth analysis of PyTorch Autograd: from principle to practice

This article takes a deep dive into the core principles and capabilities of Autograd in PyTorch. From basic concepts, Tensor interaction with Autograd, to the construction and management of computational graphs, to the details of backpropagation and gradient calculation, and finally covering the advanced features of Autograd.

Follow TechLead and share all-dimensional knowledge of AI. The author has 10+ years of Internet service architecture, AI product development experience, and team management experience. He holds a master's degree from Tongji University in Fudan University, a member of Fudan Robot Intelligence Laboratory, a senior architect certified by Alibaba Cloud, a project management professional, and research and development of AI products with revenue of hundreds of millions. principal

file

1. Pytorch and automatic differentiation Autograd

Automatic Differentiation (Autograd for short) is one of the core technologies in the field of deep learning and scientific computing. It not only plays a vital role in the training process of neural networks, but also plays a key role in numerical solutions to various engineering and scientific problems.

1.1 Basic principles of automatic differentiation

In mathematics, differential calculus is a method of calculating the local rate of change of a function and is widely used in physics, engineering, economics and other fields. Automatic differentiation is a technology that automatically calculates the derivative or gradient of a function through a computer program.

The key to automatic differentiation is to decompose a complex function into a series of combinations of simple functions, and then apply the Chain Rule to derive the derivation. This process differs from numerical differentiation (which uses finite difference approximation) and symbolic differentiation (which performs symbolic derivation) in that it can accurately calculate derivatives while avoiding the expression expansion problem of symbolic differentiation and the loss of accuracy of numerical differentiation.

import torch

# 示例:简单的自动微分
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()

# 打印梯度
print(x.grad)  # 输出应为 2*x + 3 在 x=2 时的值,即 7

1.2 Application of automatic differentiation in deep learning

In deep learning, the core of training a neural network is to optimize the loss function, that is, adjust the network parameters to minimize the loss. This process requires computing the gradient of the loss function with respect to the network parameters, and automatic differentiation plays a key role here.

Taking a simple linear regression model as an example, the goal of the model is to find a set of parameters that make the model's predictions as close as possible to the actual data. In this process, automatic differentiation helps us effectively calculate the gradient of the loss function with respect to the parameters, and then update the parameters through the gradient descent method.

# 示例:线性回归中的梯度计算
x_data = torch.tensor([1.0, 2.0, 3.0])
y_data = torch.tensor([2.0, 4.0, 6.0])

# 模型参数
weight = torch.tensor([1.0], requires_grad=True)

# 前向传播
def forward(x):
    return x * weight

# 损失函数
def loss(x, y):
    y_pred = forward(x)
    return (y_pred - y) ** 2

# 计算梯度
l = loss(x_data, y_data)
l.backward()

print(weight.grad)  # 打印梯度

1.3 The importance and impact of automatic differentiation

The introduction of automatic differentiation technology greatly simplifies the gradient calculation process, allowing researchers to focus on model design and training without having to manually calculate complex derivatives. This has contributed to the rapid development of deep learning, especially when training large neural networks.

In addition, automatic differentiation has also shown its strong potential in non-deep learning fields, such as applications in physical simulation, financial engineering, and bioinformatics.

2. The core mechanism of PyTorch Autograd

file
PyTorch Autograd is a powerful tool that allows researchers and engineers to efficiently compute derivatives with minimal manual intervention. Understanding its core mechanism will not only help you make better use of this tool, but also help developers avoid common mistakes and improve model performance and efficiency.

2.1 Interaction between Tensor and Autograd

file
In PyTorch, Tensor is the cornerstone of building neural networks, and Autograd is the key to implementing neural network training. Understanding how Tensor and Autograd work together is critical to a deep understanding and effective use of PyTorch.

Tensor: the core of PyTorch

Tensors in PyTorch are similar to NumPy arrays, but they have an additional superpower - the ability to automatically calculate gradients in the Autograd system.

  • Tensor properties: Each Tensor has a requires_grad property. When set to True, PyTorch will track all operations on this Tensor and automatically calculate gradients.

Autograd: an engine for automatic differentiation

Autograd is PyTorch's automatic differentiation engine, responsible for tracking those operations important for calculating gradients.

  • Computational graph: Under the hood, Autograd tracks operations by building a computational graph. This graph is a directed acyclic graph (DAG) that records all the operations involved in creating the final output Tensor.

Tensor and Autograd working together

When a Tensor is operated on and a new Tensor is generated, PyTorch automatically builds a computational graph node representing the operation.

  • Example: Tracking of simple operations

    import torch
    
    # 创建一个 Tensor,设置 requires_grad=True 来跟踪与它相关的操作
    x = torch.tensor([2.0], requires_grad=True)
    
    # 执行一个操作
    y = x * x
    
    # 查看 y 的 grad_fn 属性
    print(y.grad_fn)  # 这显示了 y 是通过哪种操作得到的
    

    Here y is obtained through a multiplication operation. PyTorch automatically tracks this operation and makes it part of the computational graph.

  • Backpropagation and gradient calculation

    When we call the .backward() method on the output Tensor, PyTorch will automatically calculate the gradient and store it in the .grad attribute of each Tensor.

    # 反向传播,计算梯度
    y.backward()
    
    # 查看 x 的梯度
    print(x.grad)  # 应输出 4.0,因为 dy/dx = 2 * x,在 x=2 时值为 4
    

2.2 Construction and management of computational graphs

file
In deep learning, understanding the construction and management of computational graphs is key to understanding the automatic differentiation and neural network training process. PyTorch uses dynamic computational graphs, one of its core features, which provides great flexibility and intuitiveness.

Basic concepts of computational graphs

Computational graph is a graphical representation method used to describe the operation (such as addition, multiplication) relationship between data (Tensor). In PyTorch, whenever an operation is performed on a Tensor, a node is created that represents the operation and the input and output Tensors of the operation are connected.

  • Node: Represents data operations, such as addition and multiplication.
  • Edge: represents the data flow, that is, Tensor.

Characteristics of dynamic computation graphs

PyTorch's computational graph is dynamic, that is, the construction of the graph occurs at runtime. This means that the graph is constructed in real time as the code is executed, potentially resulting in a new graph for each iteration.

  • Example: Creation of dynamic graphics

    import torch
    
    x = torch.tensor(1.0, requires_grad=True)
    y = torch.tensor(2.0, requires_grad=True)
    
    # 一个简单的运算
    z = x * y
    
    # 此时,一个计算图已经形成,其中 z 是由 x 和 y 通过乘法操作得到的
    

Backpropagation and computational graphs

During the training process of deep learning, backpropagation is performed through computational graphs. When the .backward() method is called, PyTorch will start from that point and propagate backward along the graph, calculating the gradient of each node.

  • Example: Backpropagation process

    # 继续上面的例子
    z.backward()
    
    # 查看梯度
    print(x.grad)  # dz/dx,在 x=1, y=2 时应为 2
    print(y.grad)  # dz/dy,在 x=1, y=2 时应为 1
    

Management of computational graphs

In practical applications, the management of computational graphs is an important aspect of optimizing memory and computing efficiency.

  • Clear the graph: By default, PyTorch will automatically clear the calculation graph after calling .backward(). This means that each .backward() call is an independent computation. This helps save memory for tasks involving multiple iterations.

  • Disable gradient tracking: In some cases, such as during model evaluation or inference phases, gradients do not need to be calculated. Use torch.no_grad() to temporarily disable gradient calculations, improving computational efficiency and reducing memory usage.

    with torch.no_grad():
        # 在这个块内,所有计算都不会跟踪梯度
        y = x * 2
        # 这里 y 的 grad_fn 为 None
    

2.3 Details of backpropagation and gradient calculation

Backpropagation is the core algorithm used in deep learning to train neural networks. In PyTorch, this process relies on the Autograd system to automatically calculate gradients. Understanding the details of backpropagation and gradient calculation is crucial, not only to help us better understand how neural networks learn, but also to guide us in more effective model design and debugging.

Basics of backpropagation

The purpose of the backpropagation algorithm is to calculate the gradient of the loss function relative to the network parameters. In PyTorch, this is typically accomplished by calling the .backward() method on the loss function.

  • Chain Rule: Backpropagation is based on the chain rule, which is used to calculate the derivatives of composite functions. In a computation graph, traverse backward from output to input, multiplying by the derivatives along the path.

PyTorch implementation of backpropagation

Here is a simple PyTorch example illustrating the basic process of backpropagation:

import torch

# 创建 Tensor
x = torch.tensor(1.0, requires_grad=True)
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

# 构建一个简单的线性函数
y = w * x + b

# 计算损失
loss = y - 5

# 反向传播
loss.backward()

# 检查梯度
print(x.grad)  # dy/dx
print(w.grad)  # dy/dw
print(b.grad)  # dy/db

In this example, the loss.backward() call triggers the backpropagation process of the entire calculation graph, calculating loss relative to x, w and b gradients.

gradient accumulation

In PyTorch, gradients are cumulative by default. This means that every time .backward() is called, the gradient will be added to the previous value instead of being replaced.

  • Gradient clearing: In most training loops, we need to clear the gradient before each iteration step to prevent gradient accumulation from affecting the gradient calculation of the current step.
# 清零梯度
x.grad.zero_()
w.grad.zero_()
b.grad.zero_()

# 再次进行前向和反向传播
y = w * x + b
loss = y - 5
loss.backward()

# 检查梯度
print(x.grad)  # dy/dx
print(w.grad)  # dy/dw
print(b.grad)  # dy/db

high level gradient

PyTorch also supports higher-order gradient calculations, where the gradient itself is differentiated again. This is useful in some advanced optimization algorithms and applications of second derivatives.

# 启用高阶梯度计算
z = y * y
z.backward(create_graph=True)

# 计算二阶导数
x_grad = x.grad
x_grad2 = torch.autograd.grad(outputs=x_grad, inputs=x)[0]
print(x_grad2)  # d^2y/dx^2

3. Complete solution to Autograd features

PyTorch's Autograd system provides a set of powerful features that make it an important tool in deep learning and automatic differentiation. These features not only increase programming flexibility and efficiency, but also make complex optimization and calculations feasible.

Dynamic Graph

The Autograd system in PyTorch is based on dynamic computation graphs. This means that the computational graph is built dynamically on each execution, which provides greater flexibility compared to static graphs.

  • Example: Adaptability of Dynamic Graphics

    import torch
    
    x = torch.tensor(1.0, requires_grad=True)
    if x > 0:
        y = x * 2
    else:
        y = x / 2
    y.backward()
    

    This code demonstrates the dynamic graph features of PyTorch. Depending on the value of x, the calculation path can change, which is difficult to achieve in a static graph framework.

Customized automatic differentiation function

PyTorch allows users to create custom automatic differentiation functions by inheriting torch.autograd.Function, which provides the possibility for complex or special forward and backward propagation.

  • Example: Custom automatic differentiation function

    class MyReLU(torch.autograd.Function):
        @staticmethod
        def forward(ctx, input):
            ctx.save_for_backward(input)
            return input.clamp(min=0)
    
        @staticmethod
        def backward(ctx, grad_output):
            input, = ctx.saved_tensors
            grad_input = grad_output.clone()
            grad_input[input < 0] = 0
            return grad_input
    
    x = torch.tensor([-1.0, 1.0, 2.0], requires_grad=True)
    y = MyReLU.apply(x)
    y.backward(torch.tensor([1.0, 1.0, 1.0]))
    print(x.grad)  # 输出梯度
    

    This example shows how to define a custom ReLU function and its gradient calculation.

requires_gradandno_grad

In PyTorch, requires_grad attribute is used to specify whether the gradient of a certain Tensor needs to be calculated. torch.no_grad() The context manager is used to temporarily disable the construction of all computational graphs.

  • Example: Use requires_grad Sum no_grad

    x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
    
    with torch.no_grad():
        y = x * 2  # 在这里不会追踪 y 的梯度计算
    
    z = x * 3
    z.backward(torch.tensor([1.0, 1.0, 1.0]))
    print(x.grad)  # 只有 z 的梯度被计算
    

    In this example, the calculation of y will not affect the gradient because it is in the torch.no_grad() block.

Performance optimization and memory management

PyTorch's Autograd system also includes features for performance optimization and memory management, such as gradient checkpointing (to reduce memory usage) and delayed execution (to optimize performance).

  • Example: Gradient Checkpoint

    Use torch.utils.checkpoint to reduce memory usage on large networks.

    import torch.utils.checkpoint as checkpoint
    
    def run_fn(x):
        return x * 2
    
    x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
    y = checkpoint.checkpoint(run_fn, x)
    y.backward(torch.tensor([1.0, 1.0, 1.0]))
    

    This example shows how to use gradient checkpoints to optimize memory usage.

Follow TechLead and share all-dimensional knowledge of AI. The author has 10+ years of Internet service architecture, AI product development experience, and team management experience. He holds a master's degree from Tongji University in Fudan University, a member of Fudan Robot Intelligence Laboratory, a senior architect certified by Alibaba Cloud, a project management professional, and research and development of AI products with revenue of hundreds of millions. principal

Guess you like

Origin blog.csdn.net/magicyangjay111/article/details/134988525