Pytorch English official document study notes (3, Torch.nn and torch.optim)

1. Use of nn.Module

Every module in PyTorch subclasses the nn.Module.
Every module defined by itself must be a subclass of nn.Module.

Pytorch implements the __call__ method in nn.Module, and calls the forward function in the __call__ method.
Mainly comes with parameters and methods:
model.state_dict() method and model.parameters() method
weight and bias parameters

class Ethan(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, input):
        out_put = input + 1
        return out_put

if __name__ == '__main__':
    ethan=Ethan()
    x=torch.tensor(1.0)
    output=ethan(x)
    #此步代码调用的是pytorch内置的__call__()函数，然后这个__call__函数会做很多事情，其中包括调用forward函数
    print(output)

The eval and train functions that come with torch.nn.Module

torch.nn.Module.eval
Sets the module in evaluation mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.
This is equivalent with self.train(False).

torch.nn.Module.train
Sets the module in training mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

二、nn.functional.conv2d() → Tensor

torch.nn.functional.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1) 
→ Tensor

Parameters
input – input tensor of shape ( $\text minibatch$ , $\text in\_channels$ , $\text iH$ , $\text to W$ )

weight – filters of shape ( $\text out\_channels$ , $in_channels groups \frac{\text{in\_channels}}{\text{groups}}$ , $\text kH$ , $\text kW$ )

bias – optional bias tensor of shape ( $out_channels \text{out\_channels}$ ). Default: None

stride – the stride of the convolving kernel. Can be a single number or a tuple ( $\text sH$ , $\text sW$ ).Default: 1
stride represents the distance the kernel moves one step

padding – implicit paddings on both sides of the input. Can be a string {‘valid’, ‘same’}, single number or a tuple ( $\text padH$ , $\text padW$ ). Default: 0. padding=‘valid’ is the same as no padding. padding=‘same’ pads the input so the output has the same shape as the input. However, this mode doesn’t support any stride values other than 1.
padding指在tensor四周都填充一层，默认填充0

WARNING：For padding=‘same’, if the weight is even-length and dilation is odd in any dimension, a full pad() operation may be needed internally. Lowering performance.

dilation – the spacing between kernel elements. Can be a single number or a tuple ( $\text dH$ , $\text dW$ . Default: 1

groups – split input into groups, $in_channels \text{in\_channels}$ should be divisible by the number of groups. Default: 1

	input=torch.tensor([[1,2,0,3,1],
                        [0,1,2,3,1],
                        [1,2,1,0,0],
                        [5,2,3,1,1],
                        [2,1,0,1,1]])
    kernel=torch.tensor([[1,2,1],
                         [0,1,0],
                         [2,1,0]])

    input=torch.reshape(input,(1,1,5,5))
    kernel=torch.reshape(kernel,(1,1,3,3))

    output1=nn.functional.conv2d(input,kernel,stride=1)
    print(output1)

    output2=nn.functional.conv2d(input,kernel,stride=2)
    print(output2)

    output3 = nn.functional.conv2d(input, kernel, stride=1,padding=1)
    print(output3)
	# tensor([[[[10, 12, 12],
    #           [18, 16, 16],
    #           [13, 9, 3]]]])
    # tensor([[[[10, 12],
    #           [13, 3]]]])
    # tensor([[[[1, 3, 4, 10, 8],
    #           [5, 10, 12, 12, 6],
    #           [7, 18, 16, 16, 8],
    #           [11, 13, 9, 3, 4],
    #           [14, 13, 9, 7, 4]]]])

三、torch.nn.Conv2d()、torch.nn.ConvTranspose2d()

1、torch.nn.Conv2d()

torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, 
				dilation=1, groups=1, bias=True, padding_mode='zeros', 
				device=None, dtype=None)

Parameters：
in_channels (int) – Number of channels in the input image

out_channels (int) – Number of channels produced by the convolution
When out_channel>1, different convolution kernels will be randomly generated for training each time.

kernel_size (int or tuple) – Size of the convolving kernel

stride (int or tuple, optional) – Stride of the convolution. Default: 1

padding (int, tuple or str, optional) – Padding added to all four sides of the input. Default: 0

padding_mode (string, optional) – ‘zeros’, ‘reflect’, ‘replicate’ or ‘circular’. Default: ‘zeros’

dilation (int or tuple, optional) – Spacing between kernel elements. Default: 1
$\color{red}dilation default is 1$

The dilation demonstration rendering is as follows. The shape of a convolution kernel with kernel_size 3 when dilation is 1 is 5×5, so the value of dilation will affect H out _and W _{out .}
Insert image description here

groups (int, optional) – Number of blocked connections from input channels to output channels. Default: 1

bias (bool, optional) – If True, adds a learnable bias to the output. Default: True

The calculation formulas of H _out and W _out are as follows:
When the parameters padding, dilation and kernel_size are int, padding[0]=padding[1], dilation[0]=dilation[1], kernel_size[0]=kernel_size[1] are all int values in the parameters.

Calculation formula of Hout and Wout

This link very intuitively shows the role of the three parameters of stride, padding, and dilation.
Insert image description here

dataset=datasets.CIFAR10("./数据集",train=False,download=True,
						transform=transforms.ToTensor())
dataLoader=DataLoader(dataset,batch_size=64,drop_last=True)
class Ethan(nn.Module):
    def __init__(self):
        super(Ethan,self).__init__()
        self.conv=nn.Conv2d(in_channels=3,out_channels=6,kernel_size=3,
        					stride=1,padding=0)

    def forward(self,input):
        output=self.conv(input)
        return output

ethan=Ethan()
step=0
writer=SummaryWriter("./log1")
for data in dataLoader:
    imgs,targets=data
    output=ethan(imgs)
    print(imgs.shape)#torch.Size([64, 3, 32, 32])
    print(output.shape)#torch.Size([64, 6, 30, 30])
    #由Conv2d的形参stride=1等可以推断出为何是[64, 6, 30, 30]
    writer.add_images("input",imgs,step)
    output=torch.reshape(output,(-1,3,30,30))#add_image要求channel为3
    print(output.shape)#torch.Size([128, 3, 30, 30])
    writer.add_images("output",output,step)
    step+=1

Insert image description here

Insert image description here
The effect of each run is different, because nn.Conv2d() randomly generates a new convolution kernel each time, so it will lead to different results.
In this simple model, we did not train the parameters of the randomly generated convolution kernel.

2、torch.nn.ConvTranspose2d()

torch.nn.ConvTranspose2d(in_channels, out_channels, kernel_size, 
	stride=1, padding=0, output_padding=0, groups=1, bias=True, 
	dilation=1, padding_mode='zeros', device=None, dtype=None)

Parameters：
in_channels (int) – Number of channels in the input image

out_channels (int) – Number of channels produced by the convolution

kernel_size (int or tuple) – Size of the convolving kernel

stride (int or tuple, optional) – Stride of the convolution. Default: 1

padding (int or tuple, optional)– dilation * (kernel_size - 1) - padding zero-padding will be added to both sides of each dimension in the input. Default: 0

output_padding (int or tuple, optional) – Additional size added to one side of each dimension in the output shape. Default: 0

groups (int, optional) – Number of blocked connections from input channels to output channels. Default: 1

bias (bool, optional) – If True, adds a learnable bias to the output. Default: True

dilation (int or tuple, optional) – Spacing between kernel elements. Default: 1

H_out和W_out的计算公式如下:
$H_{out}=(H_{in}-1)×stride[0]-2×padding[0]+dilation[0]×(kernel\_size[0]-1)+output\_padding[0]+1$

$W_{out}=(W_{in}-1)×stride[1]-2×padding[1]+dilation[1]×(kernel\_size[1]-1)+output\_padding[1]+1$

The operations performed by device convolution:
① Perform interpolation operation on the input feature map to obtain a new feature map;
② Randomly initialize a convolution kernel of a certain size; finally, use a randomly initialized convolution kernel of a certain size in the new Convolution operation is performed on the feature map.
The key is how to get a new feature map, which is divided into stride=1 and stride>1. When stride>1, (stride-1) rows will be inserted into each adjacent row in the original feature map, and (stride-) will be inserted into adjacent columns. 1) Column; when stride=1, it will be filled with two layers around it. The default is 0 (based on). Then it is filled in around according to the padding value, and finally a new feature map is obtained.

no_padding_no_strides_transposed
padding_strides_transposed

input = torch.rand(1,2,3,4)
model1=nn.ConvTranspose2d(in_channels=2,out_channels=4,kernel_size=4,stride=1,padding=0)
model2 = nn.ConvTranspose2d(in_channels=2, out_channels=4, kernel_size=4, stride=1, padding=1)
output1 = model1(input)
output2 = model2(input)
print (output1.shape)#torch.Size([1, 4, 6, 7])
print (output2.shape)#torch.Size([1, 4, 4, 5])

There are also upsampling methods :
1.
CLASS torch.nn.Upsample(size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None)

Parameters:
align_corners (bool, optional) – if True, the corner pixels of the input and output tensors are aligned, and thus preserving the values at those pixels. This only has effect when mode is ‘linear’, ‘bilinear’, ‘bicubic’, or ‘trilinear’. Default: False
Insert image description here

The difference between mode='nearest' and mode='bilinear' is easy to understand by looking at the example below.

>>> input = torch.arange(1, 5, dtype=torch.float32).view(1, 1, 2, 2)
>>> input
tensor([[[[1., 2.],
          [3., 4.]]]])

>>> m = nn.Upsample(scale_factor=2, mode='nearest')
>>> m(input)
tensor([[[[1., 1., 2., 2.],
          [1., 1., 2., 2.],
          [3., 3., 4., 4.],
          [3., 3., 4., 4.]]]])

>>> m = nn.Upsample(scale_factor=2, mode='bilinear')  # align_corners=False
>>> m(input)
tensor([[[[1.0000, 1.2500, 1.7500, 2.0000],
          [1.5000, 1.7500, 2.2500, 2.5000],
          [2.5000, 2.7500, 3.2500, 3.5000],
          [3.0000, 3.2500, 3.7500, 4.0000]]]])

>>> m = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
>>> m(input)
tensor([[[[1.0000, 1.3333, 1.6667, 2.0000],
          [1.6667, 2.0000, 2.3333, 2.6667],
          [2.3333, 2.6667, 3.0000, 3.3333],
          [3.0000, 3.3333, 3.6667, 4.0000]]]])

>>> # Try scaling the same data in a larger tensor
>>> input_3x3 = torch.zeros(3, 3).view(1, 1, 3, 3)
>>> input_3x3[:, :, :2, :2].copy_(input)
tensor([[[[1., 2.],
          [3., 4.]]]])
>>> input_3x3
tensor([[[[1., 2., 0.],
          [3., 4., 0.],
          [0., 0., 0.]]]])

>>> m = nn.Upsample(scale_factor=2, mode='bilinear')  # align_corners=False
>>> # Notice that values in top left corner are the same with the small input (except at boundary)
>>> m(input_3x3)
tensor([[[[1.0000, 1.2500, 1.7500, 1.5000, 0.5000, 0.0000],
          [1.5000, 1.7500, 2.2500, 1.8750, 0.6250, 0.0000],
          [2.5000, 2.7500, 3.2500, 2.6250, 0.8750, 0.0000],
          [2.2500, 2.4375, 2.8125, 2.2500, 0.7500, 0.0000],
          [0.7500, 0.8125, 0.9375, 0.7500, 0.2500, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]]])

>>> m = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
>>> # Notice that values in top left corner are now changed
>>> m(input_3x3)
tensor([[[[1.0000, 1.4000, 1.8000, 1.6000, 0.8000, 0.0000],
          [1.8000, 2.2000, 2.6000, 2.2400, 1.1200, 0.0000],
          [2.6000, 3.0000, 3.4000, 2.8800, 1.4400, 0.0000],
          [2.4000, 2.7200, 3.0400, 2.5600, 1.2800, 0.0000],
          [1.2000, 1.3600, 1.5200, 1.2800, 0.6400, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]]])

2、
torch.nn.functional.interpolate(input, size=None, scale_factor=None, mode=‘nearest’, align_corners=None, recompute_scale_factor=None, antialias=False)

3. torch.nn.MaxPool2d()

torch.nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, 
					return_indices=False, ceil_mode=False)

kernel_size – the size of the window to take a max over
$\color{red}stride If not set, the default value is equal to kernel_size$

ceil_mode – when True, will use ceil instead of floor to compute the output shape
The figure below shows the role of ceil_mode when it is true. The following figure describes it. When the kernel partially exceeds the input image, whether the output should be considered is the role of ceil_mode.

The calculation formulas of H _out and W _out are as follows:
Insert image description here

class Ethan(nn.Module):
    def __init__(self):
        super(Ethan,self).__init__()
        self.maxpool=nn.MaxPool2d(kernel_size=3,ceil_mode=False)
    def forward(self,input):
        output=self.maxpool(input)
        return output
input = torch.tensor([[1, 2, 0, 3, 1],
                      [0, 1, 2, 3, 1],
                      [1, 2, 1, 0, 0],
                      [5, 2, 3, 1, 1],
                      [2, 1, 0, 1, 1]], dtype=torch.float32)  # 池化核不能处理Long型变量

if torch.cuda.is_available():
	input = input.to("cuda")
input = torch.reshape(input, (-1, 1, 5, 5))
ethan = Ethan()
output = ethan(input)
print(output)
#tensor([[[[2.]]]], device='cuda:0')
#ceil_mode=True时的输出
#tensor([[[[2., 3.],
#          [5., 1.]]]], device='cuda:0')

The function of max pooling: it can extract the features of images and greatly reduce the workload of neural network training!

4. BatchNorm2d() and LN, IN, GN. Will not change shape

BatchNorm2d is always added after the convolutional layer of the convolutional neural network to normalize the data, so that the network performance will not be unstable due to excessive data before Relu is performed.

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True,
					 track_running_stats=True, device=None, dtype=None)

Parameters:
num_features– C, Cfrom an expected input of size
:math: (N, C, H, W), that is, set to the channel value of the input

eps – a value added to the denominator for numerical stability. Default: 1e-5

momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1

affine– a boolean value that when set to True, this module has learnable affine parameters. Default: True
#The affine parameter is set to True to indicate that the weight and bias parameters that come with the nn.Module model will be used

track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics, and initializes statistics buffers running_mean and running_var as None. When these buffers are None, this module always uses batch statistics. in both training and eval modes. Default: True

The mathematical principle of the BatchNorm2d() function is as follows:
Insert image description here
each channel has a corresponding weight (γ in the above figure) and bias (β in the above figure) parameters, E[x], Var[x]

See this blog for the BatchNorm2d() instance operation process.

    def _block(self, in_channels, out_channels, kernel_size, stride, padding):
        return nn.Sequential(
            nn.Conv2d(
                in_channels,
                out_channels,
                kernel_size,
                stride,
                padding,
                bias=False,
            ),
            #nn.BatchNorm2d(out_channels),
            nn.LeakyReLU(0.2),
        )

Comparison blog about BN, LN, IN, GN

官方LayerNorm文档
CLASS torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, device=None, dtype=None)

Official GroupNorm document
CLASS torch.nn.GroupNorm(num_groups, num_channels, eps=1e-05, affine=True, device=None, dtype=None)

官方InstanceNorm2d文档
CLASS torch.nn.InstanceNorm2d(num_features, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False, device=None, dtype=None

Insert image description here

5. Nonlinear activation function

nn.ReLU(inplace=False)
inplace – can optionally do the operation in-place. Default: False determines whether to directly replace the original variable.
There are also nonlinear activation functions such as Sigmoid, Tanh, LeakyReLU, etc. Their function is to make the network have more nonlinear characteristics.

torch.nn.LeakyReLU(negative_slope=0.01, inplace=False)

Insert image description here

class Ethan(nn.Module):
    def __init__(self):
        super(Ethan,self).__init__()
        self.sigmoid=Sigmoid()
    def forward(self,input):
        output=self.sigmoid(input)
        return output
dataset=datasets.CIFAR10("./数据集",train=False,download=True,transform=transforms.ToTensor())
dataLoader=DataLoader(dataset,batch_size=64)
ethan = Ethan()

writer=SummaryWriter("./log2")
step=0
for data in dataLoader:
    imgs,target=data
    writer.add_images("input",imgs,step)
    output = ethan(imgs)
    writer.add_images("output",output,step)
    step+=1
writer.close()

6. nn.Linear()

nn.Linear(in_features, out_features, bias=True, 
		  device=None, dtype=None)
bias指的就是下图中b参数是否有

Applies a linear transformation to the incoming data: $y = xA^T + b$ Insert image description here
Description of the red line in the above pictureLinear only changes the last dimension

Insert image description here

m = nn.Linear(2, 3)
input = torch.tensor([[-0.6547, -1.5076],[-1.9709, -2.0016]])
output = m(input)
print(output.size())
print(output)

The above code is run multiple times, and the results of print(output) are different. When the model is trained, the parameters of the Linear linear layer are also trained.

'''
介绍flatten函数
torch.flatten(input, start_dim=0, end_dim=- 1)
Parameters:
	input (Tensor) – the input tensor.
	start_dim (int) – the first dim to flatten
	end_dim (int) – the last dim to flatten
'''
t = torch.tensor([[[1, 2],
                   [3, 4]],
                  [[5, 6],
                   [7, 8]]])
torch.flatten(t)
#tensor([1, 2, 3, 4, 5, 6, 7, 8])
torch.flatten(t, start_dim=1)
#tensor([[1, 2, 3, 4],
#        [5, 6, 7, 8]])

7. nn.Sequential (see 9 for specific examples)

torch.nn.Sequential(*args)
The following two ways to construct Sequential have the same effect. OrderedDict can only give aliases to each operation inside.

model = nn.Sequential(
          nn.Conv2d(1,20,5),
          nn.ReLU(),
          nn.Conv2d(20,64,5),
          nn.ReLU()
        )

# Using Sequential with OrderedDict. This is functionally the
# same as the above code
model = nn.Sequential(OrderedDict([
          ('conv1', nn.Conv2d(1,20,5)),
          ('relu1', nn.ReLU()),
          ('conv2', nn.Conv2d(20,64,5)),
          ('relu2', nn.ReLU())
        ]))

8. nn.Dropout

torch.nn.Dropout(p=0.5, inplace=False)
Parameters
p – probability of an element to be zeroed. Default: 0.5
inplace – If set to True, will do this operation in-place. Default: False

Shape:
Input: ()(∗). Input can be of any shape
Output: ()(∗). Output is of the same shape as input

Furthermore, the outputs are scaled by a factor of $\frac{1}{1-p}$ during training. This means that during evaluation the module simply computes an identity function.(It means that other elements that are not 0 will be multiplied by $\frac{1}{1-p}$ )

	m = nn.Dropout(p=0.5)
    input = torch.randn(3, 4)
    print(input)
    output = m(input)
    print(output)
'''
tensor([[-0.6425, -0.2633, -0.6924, -1.8469],
        [ 1.0353, -1.3861,  1.1678,  1.1759],
        [ 0.9972, -0.5695, -0.1986,  0.2483]])
tensor([[-1.2851, -0.0000, -0.0000, -0.0000],
        [ 2.0705, -2.7722,  2.3356,  0.0000],
        [ 1.9943, -1.1390, -0.0000,  0.0000]])
没有变成0的元素都乘了1/(1-0.5)=2倍
'''

9. Optimizer torch.optim and backward()

Each optimizer basically has two parameters params and (lr) learning rate.
Take the cifar10 data model as an example.
Insert image description here

device = "cuda" if torch.cuda.is_available() else "cpu"
dataset=datasets.CIFAR10("./数据集",train=False,download=True,transform=transforms.ToTensor())
dataLoader=DataLoader(dataset,batch_size=1,drop_last=True)
class Ethan(nn.Module):
    def __init__(self):
        super(Ethan,self).__init__()
        self.model=nn.Sequential(
            nn.Conv2d(3,32,5,padding=2),
            nn.MaxPool2d(2),
            nn.Conv2d(32,32,5,padding=2),
            nn.MaxPool2d(2),
            nn.Conv2d(32,64,5,padding=2),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(1024,64),
            nn.Linear(64,10)
        )

    def forward(self,input):
        output=self.model(input)
        return output
loss=nn.CrossEntropyLoss()
if torch.cuda.is_available():
	loss = loss.to("cuda")
ethan=Ethan().to(device)
optim=torch.optim.SGD(ethan.parameters(),lr=0.01)
for epoch in range(20):
    running_loss=0.0
    for data in dataLoader:
        img,target=data
        img=img.to(device)
        '''
        一定要有这步处理，否则会报如下错
        Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same
        '''
        output=ethan(img)
        if torch.cuda.is_available():
            output = output.to("cuda")
        if torch.cuda.is_available():
            target = target.to("cuda")
        result_loss=loss(output,target)
        
        optim.zero_grad()#有些调用的是model.zero_grad()
        result_loss.backward()#损失函数调用backward()把ethan模型的梯度grad参数计算出来
        optim.step()#优化器模型调用step(),对权重或特征的值进行更新
	#以随机梯度下降SGD为例:学习率lr(learning rate)来控制步幅,即:x=x-lr *x.grad
        running_loss+=result_loss
    print(running_loss)

backward(retain_graph) analysis:
Each time backward() is used, the entire calculation graph will be freed by default. Generally speaking, only one forward() and one backward() are needed for each iteration. The forward operation forward() and the backward propagation backward() exist in pairs. Generally, one backward() is enough.

However, it is not ruled out that due to the complexity of custom loss, etc., one forward() and multiple backward() with different losses are needed to accumulate the grad of the same network to update the parameters. Therefore, if after the current backward(), you do not execute forward() but execute another backward(), you need to specify the retention calculation graph, backward(retain_graph), during the current backward().

The difference between model.zero_grad() and optimizer.zero_grad():
The function of model.zero_grad() is to set the gradient of all model parameters to 0. Its source code is as follows:

for p in self.parameters():
    if p.grad is not None:
        p.grad.detach_()
        p.grad.zero_()

The function of optimizer.zero_grad() is to clear the gradient of all trainable torch.Tensor. Its source code is as follows:

for group in self.param_groups:
    for p in group['params']:
        if p.grad is not None:
            p.grad.detach_()
            p.grad.zero_()

10. Using GPU

Using GPU to run code requires .to ("cuda") operation on the network model, data, and loss function.

#方式1
device = "cuda" if torch.cuda.is_available() else "cpu"
	xx=xx.to(device)
#方式2
if torch.cuda.is_available():
	xx=xx.to("cuda")