Compilation of some theoretical knowledge of d2l [1]

Table of contents

Examination knowledge compilation

introduction

• Machine learning studies how computer systems use experience (usually data) to improve performance on specific tasks. It combines the ideas of statistics, data mining and optimization. Often, it is used as a means to implement artificial intelligence solutions.

• Representation learning is a type of machine learning, and its research focuses on how to automatically find appropriate data representation methods. Deep learning is multi-level representation learning by learning multi-level transformations.

• Deep learning not only replaces the shallow models of traditional machine learning, but also replaces labor-intensive feature engineering.

• Much of the recent progress in deep learning has been triggered by massive amounts of data generated by cheap sensors and Internet-scale applications, as well as breakthroughs in computing power (via GPUs).

• Entire system optimization is a key link in achieving high performance. The open source of effective deep learning frameworks makes the design and implementation of this very easy.

2 Preliminary knowledge

2.1. Data operations

• The main interface for deep learning to store and manipulate data is tensors (n-dimensional arrays). It provides a variety of features, including basic mathematical operations, broadcasting, indexing, slicing, memory saving, and transforming other Python objects.

2.2. Data preprocessing

The pandas software package is a commonly used data analysis tool in Python. pandas is compatible with tensors.
When using pandas to deal with missing data, we can choose to use interpolation and deletion methods according to the situation.

2.3. Linear algebra

Insert image description here

2.4. Calculus

• Differential calculus and integral calculus are two branches of calculus, the former can be applied to optimization problems in deep learning.
• The derivative can be interpreted as the instantaneous rate of change of a function with respect to its variables, which is also the slope of the tangent line to the curve of the function.
• A gradient is a vector whose components are the partial derivatives of a multivariable function with respect to all its variables.
• The chain rule can be used to differentiate composite functions.

2.5. Automatic differentiation

Deep learning frameworks can calculate derivatives automatically: we first attach the gradient to the variable for which we want to calculate the partial derivative, then record the calculation of the target value, perform its backpropagation function, and access the resulting gradient.

2.6. Probability

• We can sample from a probability distribution.
• We can analyze multiple random variables using joint distribution, conditional distribution, Bayes theorem, marginalization and independence assumption.
• Expectation and variance provide practical measures for generalizing key features of probability distributions.

2.7. Consult documentation

The official documentation provides extensive descriptions and examples beyond the scope of this book.
You can view the API usage documentation by calling the dir and help functions or using ? and ?? in Jupyter Notepad

3 Linear Neural Network

3.1. Linear regression

• The key elements in a machine learning model are the training data, loss function, optimization algorithm, and the model itself.
• Vectorization makes mathematical expressions simpler and faster.
• Minimizing the objective function is equivalent to performing maximum likelihood estimation.
• The linear regression model is also a simple neural network.

3.2. Implementation of linear regression from scratch

• We learned how deep networks are implemented and optimized. Only tensors and automatic differentiation are used in this process, without the need to define layers or complex optimizers.
• This section only scratches the surface. In the following sections, we will describe other models based on the concepts just introduced and learn how to implement other models more concisely.

3.3. Simple implementation of linear regression

We can implement the model more concisely using PyTorch's high-level API.
In PyTorch, the data module provides data processing tools, and the nn module defines a large number of neural network layers and common loss functions.
We can initialize the parameters by replacing them with the method ending in _.

3.4. softmax regression

• The softmax operation takes a vector and maps it to probabilities.
• Softmax regression is suitable for classification problems, it uses the probability distribution of the output category in the softmax operation.
• Cross entropy is a good measure of the difference between two probability distributions, it measures the number of bits required to encode the data for a given model.

3.5. Image classification data set

Insert image description here

3.6. Implementation of softmax regression from scratch

• With softmax regression, we can train multi-class models.
• Training a softmax regression recurrent model is very similar to training a linear regression model: first read the data, then define the model and loss function, and then use the optimization algorithm to train the model. Most common deep learning models have a similar training process.

3.7. Simple implementation of softmax regression

• Using the high-level API of the deep learning framework, we can implement softmax regression more concisely.
• From a computational perspective, implementing softmax regression is complex. In many cases, deep learning frameworks take extra precautions beyond these well-known tricks to ensure numerical stability. This allows us to avoid the pitfalls we might encounter when writing a model from scratch in practice.

4 multi-layer perceptron

4.1. Multi-layer perceptron

• Multilayer perceptron adds one or more fully connected hidden layers between the output layer and the input layer, and transforms the output of the hidden layer through an activation function.
• Commonly used activation functions include ReLU function, sigmoid function and tanh function.

4.2. Implementation of multi-layer perceptron from scratch

It's easy to implement a simple multilayer perceptron manually. However, if there are a large number of layers, implementing multi-layer perception opportunities from scratch becomes cumbersome (e.g., naming and documenting the parameters of the model).

4.3. Simple implementation of multi-layer perceptron

We can implement multilayer perceptrons more concisely using high-level APIs.
For the same classification problem, the implementation of multi-layer perceptron is the same as the implementation of softmax regression, except that a hidden layer with an activation function is added to the implementation of multi-layer perceptron.

4.4. Model selection, underfitting and overfitting

Underfitting means that the model cannot continue to reduce the training error. Overfitting means that the training error is much smaller than the validation error.
Since generalization error cannot be estimated based on training error, simply minimizing training error does not necessarily mean a reduction in generalization error. Machine learning models need to pay attention to preventing overfitting, that is, preventing excessive generalization errors.
The validation set can be used for model selection, but you should not use it too casually.
We should choose a model of appropriate complexity and avoid using an insufficient number of training samples.

4.5. Weight decay

Insert image description here

4.6. Dropout

Insert image description here

4.7. Forward propagation, back propagation and computational graphs

• Forward propagation calculates and stores intermediate variables sequentially from the input layer to the output layer in the computational graph defined by the neural network.
• Backpropagation calculates and stores the gradients of intermediate variables and parameters of the neural network in reverse order (from output layer to input layer).
• When training a deep learning model, forward propagation and back propagation are interdependent.
• Training requires more memory than prediction.

4.8. Numerical stability and model initialization

• Vanishing and exploding gradients are common problems in deep networks. Great care needs to be taken during parameter initialization to ensure that gradients and parameters can be well controlled.
• Heuristic initialization methods are needed to ensure that the initial gradient is neither too large nor too small.
• The ReLU activation function alleviates the vanishing gradient problem, which can speed up convergence.
• Random initialization is key to ensuring that symmetries are broken before optimization.
• Xavier initialization shows that for each layer, the variance of the output is not affected by the number of inputs, and the variance of any gradient is not affected by the number of outputs.

4.9. Environmental and Distribution Shifts

In many cases, the training and test sets do not come from the same distribution. This is called distribution shift.
True risk is the expected overall loss for all data drawn from the true distribution. However, this data population is often unavailable. Empirical risk is the average loss on the training data and is used to approximate the true risk. In practice, we perform empirical risk minimization.
Covariate shift and label shift can be detected and corrected at test time, given appropriate assumptions. Not accounting for this offset can be a problem when testing.
In some cases, the environment may remember automated actions and respond in surprising ways. We must account for this possibility when building models and continue to monitor live systems and be open to the possibility that our models and environments may become entangled in unexpected ways.

4.10. Practical Kaggle Competition: Predicting House Prices

Insert image description here

5. Deep learning calculations

5.1. Layers and blocks

• A block can be made up of many layers; a block can be made up of many blocks.
• Blocks can contain code.
• Blocks are responsible for a lot of internal processing, including parameter initialization and backpropagation.
• Sequential connection of layers and blocks is handled by Sequential blocks.

5.2. Parameter management

• We have several ways to access, initialize and bind model parameters.
• We can use custom initialization method.

5.3. Delayed initialization

• Lazy initialization enables the framework to automatically infer parameter shapes, making it easy to modify the model architecture and avoiding some common mistakes.
• We can pass data through the model so that the framework finally initializes the parameters.

5.4. Custom layers

We can design custom layers through basic layer classes. This allows us to define flexible new layers that behave differently from any existing layer in the deep learning framework.
After the custom layer is defined, we can call the custom layer in any environment and network architecture.
Layers can have local parameters, which can be created through built-in functions.

5.5. Reading and writing files

The save and load functions can be used for file reading and writing of tensor objects.
We can save and load all parameters of the network through the parameter dictionary.
Saving the schema must be done in code, not in parameters.

5.6. GPU

• We can specify the device used for storage and computation, such as CPU or GPU. By default, data is created in main memory and then computed using the CPU.
• Deep learning frameworks require that all input data for calculations be on the same device, whether it is CPU or GPU.
• Inadvertently moving data can significantly degrade performance. A typical error is as follows: when calculating the loss for each mini-batch on the GPU and reporting it to the user on the command line (or logging it in a NumPy ndarray), a global interpreter lock is triggered, blocking all GPUs . It's better to allocate memory for logs inside the GPU and only move larger logs.

6. Convolutional Neural Network

6.1. From fully connected layers to convolutions

• The translation invariance of the image allows us to treat a local image in the same way regardless of its position.
• Locality means that only a small fraction of local image pixels are needed to compute the corresponding hidden representation.
• In image processing, convolutional layers usually require fewer parameters than fully connected layers, but still achieve high utility models.
• Convolutional neural network (CNN) is a special type of neural network that can contain multiple convolutional layers.
• Multiple input and output channels allow the model to capture multifaceted features of the image at each spatial location.

6.2. Image convolution

The core calculation of the two-dimensional convolutional layer is the two-dimensional cross-correlation operation. In its simplest form, it performs a cross-correlation operation on the 2D input data and the convolution kernel, and then adds a bias.
We can design a convolution kernel to detect the edges of the image.
We can learn the parameters of the convolution kernel from the data.
When learning the convolution kernel, whether strict convolution operation or cross-correlation operation is used, the output of the convolution layer will not be greatly affected.
When we need to detect wider regions in the input features, we can build a deeper convolutional network.

6.3. Padding and strides

Insert image description here

6.4. Multiple input and multiple output channels

Insert image description here

6.5. Aggregation layer

• For a given input element, the maximum pooling layer outputs the maximum value within the window, and the average pooling layer outputs the average value within the window.
• One of the main advantages of pooling layers is to alleviate the over-sensitivity of convolutional layers to position.
• We can specify the padding and stride of the pooling layer.
• Using a maximum pooling layer and a stride greater than 1 reduces the spatial dimensions (such as height and width).
• The number of output channels of the aggregation layer is the same as the number of input channels.

6.6. Convolutional Neural Network (LeNet)

• Convolutional neural network (CNN) is a type of network that uses convolutional layers.
• In convolutional neural networks, we use a combination of convolutional layers, nonlinear activation functions, and pooling layers.
• To construct high-performance convolutional neural networks, we usually arrange the convolutional layers to gradually reduce the spatial resolution of their representation while increasing the number of channels.
• In traditional convolutional neural networks, the representation obtained by convolutional block encoding needs to be processed by one or more fully connected layers before output.
• LeNet is one of the earliest published convolutional neural networks.

7. Modern Convolutional Neural Networks

7.1. Deep convolutional neural network (AlexNet)

• The architecture of AlexNet is similar to LeNet, but uses more convolutional layers and more parameters to fit the large-scale ImageNet dataset.
• Today, AlexNet has been surpassed by more efficient architectures, but it is a crucial step from shallow to deep networks.
• Although AlexNet has only a few more lines of code than LeNet, it took many years for the academic community to accept the concept of deep learning and apply its excellent experimental results. This is also due to the lack of efficient computational tools.
• Dropout, ReLU and preprocessing are other key steps to improve the performance of computer vision tasks.

7.2. Networks using blocks (VGG)

Insert image description here

7.3. Network of Networks (NiN)

Insert image description here

7.4. Network with parallel connections (GoogLeNet)

Insert image description here

7.5. Batch normalization

• During the model training process, batch normalization uses the mean and standard deviation of small batches to continuously adjust the intermediate output of the neural network, making the intermediate output values ​​of each layer of the entire neural network more stable.
• Batch normalization is used slightly differently in fully connected and convolutional layers.
• The batch normalization layer, like the regression layer, is calculated differently in training mode and prediction mode.
• Batch normalization has many beneficial side effects, mainly regularization. On the other hand, the original motivation of "reduce internal covariate shift" does not seem to be a valid explanation.

7.6. Residual Network (ResNet)

• Learning nested functions is an ideal situation for training neural networks. In deep neural networks, it is easier to learn another layer as an identity function (although this is an extreme case).
• Residual mapping makes it easier to learn the same function, such as by approximating parameters in the weight layer to zero.
• An efficient deep neural network can be trained using residual blocks: inputs can be propagated forward faster through residual connections between layers.
• Residual Networks (ResNet) had a profound impact on subsequent deep neural network design.

7.7. Densely connected network (DenseNet)

• In terms of cross-layer connections, unlike ResNet which adds input and output, dense connection network (DenseNet) connects input and output in the channel dimension.
• The main building blocks of DenseNet are dense blocks and transition layers.
• When building DenseNet, we need to control the dimensionality of the network by adding transition layers, thereby reducing the number of channels again.

8. Recurrent Neural Network

8.1. Sequence model

Insert image description here

Guess you like

Origin blog.csdn.net/CSDN_YJX/article/details/130644681