torch — PyTorch 1.12 documentationhttps://pytorch.org/docs/stable/torch.html

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.

Pytorch-based libraries: torchvision, torchaudio, pytorch-fast-transformer

1. Pytorch

1. Data reading

Use the combination of Dataset class and DataLoader to get data iterator. During training or prediction, the data iterator can output the data required for each batch, and perform corresponding preprocessing and data enhancement operations on the data.

Dataset class

By integrating the Dataset class to customize the format, size and other attributes of the dataset, it can be used directly by the DataLoader class later, which means that whether using a custom dataset or an official packaged dataset, its essence is inherited Dataset class, and when inheriting the Dataset class, at least the following methods must be rewritten:

__init__(): Constructor, you can customize the data reading method and perform data preprocessing
__len__(): Returns the dataset size
__getitem()__: A certain data in the index data set

import torch
from torch.utils.data import Dataset
 
class MyDataset(Dataset):
    # 构造函数
    def __init__(self, data_tensor, target_tensor):
        self.data_tensor = data_tensor
        self.target_tensor = target_tensor
 
    # 返回数据集大小
    def __len__(self):
        return self.data_tensor.size(0)
 
    # 返回索引的数据与标签
    def __getitem__(self, index):
        return self.data_tensor[index], self.target_tensor[index]

Dataloader class

During the training process, it may not be possible to load all the data into the memory at one time, nor can it be loaded by only one process, so multiple processes and iterative loading are required, and Dataloader is designed based on these. Dataloader is an iterator . The most basic way to use it is to pass in a Dataset object, which will generate a batch of data according to the value of the parameter batch_size , which can save memory and realize multi-process, data scrambling and other processing. Incoming parameters:

dataset: Dataset type, input data set, required parameter
batch_size: int type, how many samples are there in each batch
shuffle: bool type, at the beginning of each epoch, whether to reshuffle the data
num_workers: int type, the number of processes loading data, 0 means all data will be loaded into the main process

def main():
    # 生成数据
    data_tensor = torch.randn(10, 3)
    target_tensor = torch.randint(2, (10,))  # 标签是0或1
 
    # 将数据封装成Dataset
    my_dataset = MyDataset(data_tensor, target_tensor)
    tensor_dataloader = DataLoader(dataset=my_dataset, batch_size=2, shuffle=True, num_workers=0)
    for data, target in tensor_dataloader:
        print(data, target)
    print('One batch tensor data: ', iter(tensor_dataloader).next())
 
main()

2. Layers layer

torch.nn.Embedding ( num_embeddings embedding dictionary size , embedding_dim embedding vector size)
torch.nn.LSTM (input_size, hidden_size, num_layers=1, bias=True, batch_first=False, dropout=0)：输入x,(h0,c0)，输出y,(h_n,c_n)
nn.CrossEntropyLoss (weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean', label_smoothing=0.0)

3. Utility functions

torch.nn.utils.clip_grad_norm_ : Only solve the gradient explosion problem, not the gradient disappearance problem

4. Frequently Asked Questions

# loss.backward() 和 optimizer.step()

loss.backward(): Backpropagate the loss loss to the input test; this step will calculate the gradient values of all variables x $\small \frac{d}{dx}loss$ and accumulate them for $\small x*grad$ backup, that is $\small x*grad=(x*grad)_{pre}+\frac{d}{dx}loss$ , the gradient in the formula $\small (x*grad)_{pre}$ refers to the gradient accumulated in the previous epoch .
optimizer.step() : Use the optimizer to update the parameter x, taking stochastic gradient descent SGD as an example, the updated formula is: $\small x = x - lr*(x*grad)$ , lr means the learning rate, and the minus sign means update along the opposite direction of the gradient;
optimizer.zero_grad(): Clear the cumulative gradient value of the optimizer for all parameters x $\small x*grad$ , generally used before loss.backward(), that is, clear $\small (x*grad)_{pre}$ .

# model.zero_grad() 与 optimizer.zero_grad()

The function of model.zero_grad() is to set the gradient of all model parameters to 0
The function of optimizer.zero_grad() is to clear the gradient of all optimized torch.Tensor
When using optimizer = optimize.Optimizer(net.parameters()) to set the optimizer, the param_groups in the optimizer is equal to the parameters() in the model. At this time, the two are equivalent. From the source code of the two It can also be seen.
When multiple models use the same optimizer, the two are different. In this case, the gradient clearing method needs to be selected according to the actual situation.
When a model uses multiple optimizers, the two are different. At this time, the gradient clearing method needs to be selected according to the actual situation.

# detach() and data()

detach() returns a new tensor, which is separated from the current calculation graph, but still points to the storage location of the original variable. Its grad_fn=None and requires_grad =False , the obtained tensor never needs to calculate its gradient, and does not have Gradient grad, even if its requires_grad is set to true again later, it will not have a gradient grad.

Note: The returned tensor shares the same memory data as the original tensor. The in-place function modification will be reflected on both tensors at the same time (because they share memory data), and it may cause errors when calling backward() on it.
The data() function has the same function as the detach() function, but it cannot guarantee in-place security.
In-place correctness check: All tensors will record the in-place operations used on them. If pytorch detects that the tensor has been saved for backward in a Function, but then it is modified by in-place operations. When this happens, pytorch will report an error during backward. This mechanism ensures that if you use in-place operations, but no error is reported during the backward process, then the calculation of the gradient is correct.

# .item()

What is returned is the value in tensor, and can only return a single value (scalar), not a vector. Convert the value of the tensor to a standard Python value (float, int), it can only be used when the tensor contains only one element

# torch.load(parameter file, map_location='cpu')

If the saved model is trained on the GPU and loaded on the CPU, an error may be reported. In this case, map_location needs to be used to dynamically remap the storage to an optional device.

# @torch.no_grad() 和 model.eval()

model.eval() , will (disables dropout and has batch norm), but its model parameters will still change , the calculation of the gradient will not be affected, the calculation flow will still store and calculate the gradient, and the corresponding weight of the model can still be updated later
The function of torch.no_grad() is to stop the work of the autograd module , that is, it will not automatically calculate and store gradients, so it can speed up the calculation process and save video memory

二、Torch vision

torchvision is a graphics library of pytorch, which serves the PyTorch deep learning framework and is mainly used to build computer vision models. The following is the composition of torchvision:

torchvision.datasets : Some functions for loading data and common dataset interfaces;
torchvision.models : Contains commonly used model structures (including pre-trained models), such as AlexNet, VGG, ResNet, etc.;
torchvision.transforms : common image transformations, such as cropping, rotation, etc.;
torchvision.utils : Other useful methods.

import torchvision
from PIL import Image
import torchvision.transforms as transforms

torchvision.datasets

The datasets package of torchvision provides a rich interface to image datasets. This package itself does not contain the dataset file itself. Its working method is to download the dataset from the network to the directory specified by the user, and then use its loader to load it into memory. Finally, this loaded dataset is returned to the user as an object

import torchvision
 
mnist_dataset = torchvision.datasets.MNIST(root='./data',
                                           train=True,
                                           transform=None,
                                           target_transform=None,
                                           download=True)

torchvision.transforms

# 图像预处理步骤
transform = transforms.Compose([
    transforms.Resize(96), # 缩放到 96 * 96 大小
    transforms.ToTensor(), # 转化为Tensor
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) # 归一化
])

torchvision.models

import torch
import torchvision.models as models

# 加载预训练模型
google_net = models.googlenet(pretrained=True)
resnet18 = models.resnet18()
alexnet = models.alexnet()
squeezenet = models.squeezenet1_0()
densenet = models.densenet_161()
 
# 提取分类层
fc_in_features = google_net.fc.in_features
print("fc_in_features:", fc_in_features)
 
# 查看分类层的输出参数
fc_out_features = google_net.fc.out_features
print("fc_out_features:", fc_out_features)
 
# 修改预训练模型的输出分类数
google_net.fc = torch.nn.Linear(fc_in_features, 10)

other functions

# 将32张图片拼接在一个网格中
grid_tensor = torchvision.utils.make_grid(image_tensor, nrow=8, padding=2)
grid_image = transforms.ToPILImage()(grid_tensor)
display(grid_image)

三、Pytorch Fast Transformers

Fast Transformers for PyTorchNonehttps://fast-transformers.github.io/

1. Transformers layers

TransformerEncoderLayer(attention, d_model, n_heads, d_ff=None, dropout=0.1, activation='relu')

forward(x, attn_mask=None, length_mask=None)
The shape of x (N,L,E)

TransformerDecoderLayer(self_attention, cross_attention, d_model, d_ff=None, dropout=0.1, activation='relu')

forward(x, memory, x_mask=None, x_length_mask=None, memory_mask=None, memory_length_mask=None)
The shape of x (N, L, E), the shape of memory (N, L', E), and the memory is the matrix of the encoder

2. Recurrent Transformers

Similar to cyclic RNN, only one element is processed at a time

RecurrentTransformerEncoderLayer(attention, d_model, d_ff=None, dropout=0.1, activation='relu', event_dispatcher='')

forward(x, state=None)
The state is a python object that varies depending on the attention implementation (e.g., linear attention, full attention)
The state uses feature maps instead of softmax in the calculation of self-attention.

RecurrentTransformerDecoderLayer(self_attention, cross_attention, d_model, d_ff=None, dropout=0.1, activation='relu', event_dispatcher='')

Attention to the previous inputs and a preprocessed memory.
forward(x, memory, memory_length_mask=None, state=None)
The shape of x (N, E), the shape of memory (N, S, E), memory is the matrix of the encoder

3. Builders

        if is_training:
            # encoder (training)
            self.transformer_encoder = TransformerEncoderBuilder.from_kwargs(
                n_layers=self.n_layer,
                n_heads=self.n_head,
                query_dimensions=self.d_model//self.n_head,
                value_dimensions=self.d_model//self.n_head,
                feed_forward_dimensions=2048,
                activation='gelu',
                dropout=0.1,
                attention_type="causal-linear",
            ).get()
        else:
            # encoder (inference)
            print(' [o] using RNN backend.')
            self.transformer_encoder = RecurrentEncoderBuilder.from_kwargs(
                n_layers=self.n_layer,
                n_heads=self.n_head,
                query_dimensions=self.d_model//self.n_head,
                value_dimensions=self.d_model//self.n_head,
                feed_forward_dimensions=2048,
                activation='gelu',
                dropout=0.1,
                attention_type="causal-linear",
            ).get()

Getting started with PyTorch