Machine learning, some good articles on deep learning


Record some good articles about machine learning and deep learning:

1、CNN

Summarized some links to good articles on the Internet, and at the same time recorded the most important ones that I personally think.

"Learning Convolutional Neural Networks in Half an Hour" : Read this article first. If you don't understand it, you can check the information separately, or you can use the link in the experience below.

1.1. Convolution layer:

  1. Local connection (1x1 convolution kernel can be regarded as a special full connection; full connection can also be regarded as a special convolution) + weight sharing . The purpose of both is to reduce the amount of parameters .
  2. It is necessary to have a deep understanding of the concept of the convolution kernel (in fact, it is the weight matrix in the calculation of the convolution layer size, and the effect is different after convolution with different convolution kernels), and at the same time understand the calculation principle of the convolution layer size and the special 1x1 convolution Product kernel (1x1 convolution generally only changes the number of output channels (channels), without changing the width and height of the output) .
    Specific reference 1. "Calculation example of convolutional layer in CNN" : written very concisely and clearly, the calculation principle of convolutional layer size, standard convolution calculation example, 1 x 1 convolution calculation example, fully connected layer calculation example, TensorFlow A simple implementation of a convolutional layer in Medium. 2. "Basic Principles of Convolution Kernel" : The official account article is also very detailed. 3. "Understanding the 1x1 Convolution Kernel in Convolutional Neural Networks in One Article" : There are examples why it is said that "1x1 convolution kernel can be regarded as a special full connection".

1.2. Pooling layer:

1. Function ? Gradually reduce the spatial size of the network to achieve the purpose of reducing the number of parameters in the network and reducing the use of computing resources, while also effectively controlling overfitting.
2. Why is the pooling layer effective ? The image features are locally invariant, that is, the features of the image will not be lost even through downsampling. Due to this characteristic, the image can be reduced and then convoluted, which can greatly reduce the time of convolution calculation. The most commonly used pooling layer size is 2x2, the sliding step is 2, the image is down-sampled, 75% of the information is discarded, and the largest one is selected to be retained, which also achieves the purpose of removing some noise information.

1.3. Fully connected layer:

1. Function ? Convolution and pooling only extract local features, and the fully connected layer is to fuse local features.
2. Example ? The features extracted through the convolution and pooling layers include eyes, nose and mouth. Can we judge that this is a cat through these three features alone? Obviously not, because there are too many animals with the three characteristics of eyes, nose and mouth, so we need to perform feature fusion of these three characteristics, so as to finally judge that this stuff is a cat rather than a dog.
3. Convolution and full connection are algorithmically convertible . Usually, when performing fully connected calculations, it can be equivalent to a convolution operation with a convolution kernel of 1x1.
Reference: "Fully connected layer and softmax"

1.4、Softmax:

1. Function ? Classification.
1. How to classify ? Classification is not based on scores, but probabilities.
2. How does the probability come from ? Softmax converts scores into probabilities.
3. How do you score ? The neural network is calculated after dozens of layers of convolution operations.
4. Why does Softmax choose e as the base ? Expand the gap
Reference: "Fully Connected Layer and Softmax"
5. Softmax is actually an activation function . The Softmax function is usually described as a combination of multiple sigmoids. We know that sigmoid returns a value between 0 and 1, and we can think of the function value as the probability that a data point belongs to a particular class. Therefore, sigmoid is widely used in binary classification problems, and the softmax function can be used in multi-class classification problems. The activation function is divided into linear function and nonlinear function. The main function of the nonlinear activation function is to provide the nonlinear modeling ability of the network. If there is no activation function, the network can only express a linear map. At this time, even if there are more hidden layers, the entire network is equivalent to a single-layer neural network.
For activation functions, please refer to "Deep Learning - Why Use Activation Functions" : The explanation is very straightforward, especially why non-linear activation functions are used. "Exclusive | Deep Learning Basics - Activation Functions and When to Use Them?" (with code)" : Explains most of the activation functions and why so many activation functions are derived. "Visualization of Forward and Backpropagation in Neural Networks" : There are visualization sections for forward and backpropagation.


2. Activation function

For reference, see "Softmax" above.


3、Dropout

"Understanding of Regularization in Deep Learning" : It talks about regularization in a straightforward and broad way, with a clear framework. The following thoughts are from this article.
"The role of Dropout layer in deep learning" : Explain why Dropout can prevent overfitting, and explain the process, with corresponding code.

1. It is actually a kind of regularization. Regularization in English is Regularization, literally translated as " regularization ", that is, by adding "rules and restrictions" to the model, the model has a strong generalization ability and prevents overfitting .
2. In a deep neural network, theoretically as long as the number of layers is large enough and the number of neurons in each layer is large enough, the deep neural network can perfectly fit any data set. But if the deep neural network model really fully fits all the data in the training set, it will inevitably lead to overfitting on the test set. The purpose of regularization is to avoid over-fitting, so the idea of ​​regularization can be summarized as: some neurons in the neural network are turned off by certain means, thereby reducing the complexity of the neural network and avoiding over-fitting .
3. Common regularization methods include: L0 regularization, L1 regularization, L2 regularization (the most commonly used), dropout regularization.
In addition, there are data amplification, early stopping, etc., which can also prevent overfitting, and are also classified as regularization methods. 4.
Dropout is a hyperparameter , which needs to be based on Specific network, specific application field to try.


4. Backpropagation

4.1 Calculation graph, push backpropagation

1. "Introduction to Deep Learning - Backpropagation (1) Derivation" : From the perspective of calculation graph, it explains how to deduce backpropagation, from the initial simple + and *, to the subsequent derivation of Relu and Sigmoid.
2. "Introduction to Deep Learning" Chapter 5 Actual Combat: Handwritten Number Recognition-Error Backpropagation" : The difference from the above article is that there is an extra code practice.
3. "Forward Propagation, Back Propagation - Easy to Understand" : After reading 1 and 2, it is easy to understand. Note that for a complete weight update, the last data is still used. For example, when updating w1, the original data of w5 is used . Also pay attention to why the formula for updating the weight is so long, because the gradient direction is the fastest rising direction, and if it is the fastest falling direction, it is natural to use "—" . At the same time, attach the corresponding hand push map.

insert image description here

4. "Gradient Descent, Back Propagation, Learning Rate α, Optimizer, Neural Network General Process Code Practice" : This article is strongly recommended, and is now attached below:

4.2. Gradient descent and backpropagation

1. Backpropagation is a method to solve the gradient of the loss function with respect to each parameter. (Find the gradient [partial derivative])

2. Gradient descent is a method to update each weight W according to the gradient (partial derivative) calculated by backpropagation, so as to minimize the loss function. (make weight W better)

4.3. What is the relationship between learning rate α and gradient descent?

The learning rate α is part of the weight update formula in gradient descent.

Weight update formula in gradient descent: insert image description here
Use the weight update formula to update the weight W, where α is the learning rate.

Remarks: I think: learning rate == step size

4.4. What is the relationship between optimizer and gradient descent?

Gradient descent is a kind of optimizer . The optimizer is a method to minimize the loss function, which also includes the learning rate.

In addition to the common gradient descent, the optimization algorithm used in machine learning and deep learning also has several optimizers such as Adam, Adagrad, RMSProp, etc.

4.5, pytorch realizes the training process of neural network

After understanding the above content, we know that the training and implementation process of the entire neural network is divided into 5 steps:

Forward propagation to get the predicted value --> Find the loss between the predicted value and the real value --> Clear the gradient of the optimizer (optional operation) --> Use backpropagation to find the gradient (derivative) of all parameters --> Optimizer update weight W

# Gradient Descent
for epoch in range(50):
   # Forward pass: Compute predicted y by passing x to the model
   y_pred = model(x)
 
   # Compute and print loss
   loss = criterion(y_pred, y)
   print('epoch: ', epoch,' loss: ', loss.item())
 
   # Zero gradients, perform a backward pass, and update the weights.
   optimizer.zero_grad()
 
   # perform a backward pass (backpropagation)
   loss.backward()
 
   # Update the parameters
   optimizer.step()

备注: The optimizer.zero_grad() here is the gradient clearing operation, which requires a large amount of memory. If the "gradient accumulation" operation is used: when the memory size is not enough, the grads of multiple batches are superimposed as a large batch for iteration, because This is equivalent to the gradient obtained by a large batch_size, but the effect is naturally worse. This can be said to be a small trick to increase the batch-size and reduce memory.


5. Parameter initialization

  • Introduction: The parameter learning in the training process of the neural network is optimized based on the gradient descent method. The gradient descent method needs to assign an initial value to each parameter at the beginning of training. The selection of this initial value is very critical. Generally, we hope that the mean value of the data and parameters is 0, and the variance of the input and output data is consistent. In practical applications, parameters obeying Gaussian distribution or uniform distribution are more effective initialization methods. A good weight initialization is beneficial to model performance and convergence speed, and is of great help to solve gradient disappearance and gradient explosion.
  • Parameter initialization classification:
    • All-zero initialization: every neuron in every layer will learn the same thing, no way to break symmetry. Why is it wrong to initialize all weights to 0? Because if all parameters are 0, then the output of all neurons will be the same, then all neurons in the same layer will behave the same during back propagation (same gradient, same weight update), this Obviously an unacceptable result.
      Random initialization: The variance of the network output data distribution will change with the number of input neurons. This method has disadvantages. Once the random distribution is not selected properly, it will lead to network optimization difficulties (resulting in a small gradient, making it difficult to update the parameters).
    • Xavier initialization: The influence of the activation function on the output data distribution is not considered. Its idea is to make the input and output obey the same distribution as much as possible, so that the output value of the activation function of the subsequent layer tends to 0 He initialization: Considering the influence of ReLU on the output data distribution, the input and output data
      variance unanimous. The idea of ​​He initialization is: in the ReLU network, it is assumed that half of the neurons in each layer are activated and the other half is 0, so to keep the variance constant, you only need to divide by 2 on the basis of Xavier. It is recommended to use it in ReLU.
    • Random initialization With BN: BN reduces the dependence of the network on the initial value scale, and can be initialized with a smaller standard deviation. BN is a clever and crude way to weaken the impact of bad initialization. What we want is that the output value should have a better distribution (such as a Gaussian distribution) before the nonlinear activation, so that the gradient can be calculated and the weight can be updated during backpropagation. BN forces the output value to do a Gaussian Normalization and linear change.
    • Pre-train initialization: Initialize the parameters of the pre-trained model as parameters of the model on the new task (fine-tuning).
  • Summary:
    When using the ReLU (without BN) activation function, it is best to use the He initialization method to initialize the parameters to a small random number that is subject to Gaussian distribution or uniform distribution.
    When using BN, the dependence of the network on the scale of the initial value of the parameter is reduced. At this time, a small standard deviation (such as 0.01) can be used for initialization.
    Using the parameters in the pre-trained model as new task parameter initialization is also a simple and effective method of model parameter initialization.

6. Comparison between keras and pytorch for CNN model definition code and training code

6.1. Comparison

Hard:

  • A high-level API that can run on TensorFlow, CNTK, Theano, or MXNet (or tf.contrib in TensorFlow). Since its first release in March 2015, it has gained favor for its ease of use and syntactic simplicity, facilitating rapid development. It is backed by Google. More suitable for a novice like me .
  • Looking at other people's codes, when programming models, they generally use sequential or functional forms . I usually use these two types more often. The model is relatively simple and does not require multiple inputs and outputs, so use the sequential form. And if the model contains multiple outputs, then a functional programming form is required
  • Training can only be done with fit().

PyTorch:

  • Released in October 2016, it is a low-level API focused on directly processing array expressions. It has gained huge interest in the last year, becoming the solution of choice for academic research, as well as for deep learning applications that require optimized custom representations. It is supported by Facebook.
  • Define a class that implements the structure of your neural network . Many source codes seen on the Internet are in this form. Of course, keras also supports this type of programming, and torch also supports sequential programming, but it is not recommended for novices.
  • At the beginning of each batch of training , initialize the gradient, forward propagation, back propagation, calculate the loss and update the weight, and the appeal loop.

For a more detailed introduction and comparison of advantages and disadvantages between the two, refer to: "Which deep learning framework is more suitable for beginners compared to Keras and PyTorch?" "

6.2. Model definition code

Reserved knowledge: The fully connected layer is called the Dense layer in keras, and the Linear layer is being transferred in pytorch

Here is an example of defining a simple CNN:

# Keras
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3))) # 第一层卷积+激活
model.add(MaxPool2D()) # 第二层池化
model.add(Conv2D(16, (3, 3), activation='relu')) # 第三层卷积+激活
model.add(MaxPool2D()) # 第四层池化
model.add(Flatten()) # Flatten层用来将输入“压平”,即把多维的输入一维化,常用在从卷积层到全连接层的过渡
model.add(Dense(10, activation='softmax')) # Dense层,即全连接层,同时有个softmax
# Pytorch

import torch
import numpy
 
from torchvision import transforms #处理图像
from torchvision import datasets #处理数据集
from torch.utils.data import DataLoader #加载数据集
 
import torch.nn.functional as F #导入激活函数
import matplotlib.pyplot as plt

class Net(nn.Module):
	def __init__(self):
		super(Net, self).__init__()
		self.conv1 = nn.Conv2d(3, 32, 3) # 卷积
		self.conv2 = nn.Conv2d(32, 16, 3) # 卷积
		self.pool = nn.MaxPool2d(2, 2) # 池化
		self.fc1 = nn.Linear(16 * 6 * 6, 10)  # Linear层,即全连接层
	def forward(self, x):
		x = self.pool(F.relu(self.conv1(x))) # 第一层的卷积+激活+池化
		x = self.pool(F.relu(self.conv2(x))) # 第二次的卷积+激活+池化
		x = x.view(-1, 16 * 6 * 6) # 将数据平展成一维
		x = F.log_softmax(self.fc1(x), dim=-1) # 全连接+softmax
		return x
		
model = Net()

Remarks: The above-mentioned pytorch code is defined so-so, and I can’t understand the recommended article of 6.4.

6.3. Comparison of training codes

Training models with Keras is super easy! Just a simple .fit() and you can go straight to the run.

history = model.fit_generator(
generator=train_generator,
epochs=10,
validation_data=validation_generator)

Training a model in Pytorch involves the following steps:

  • Initialize gradients at the beginning of each batch of training
  • forward propagation
  • backpropagation
  • Calculate the loss and update the weights
# 在数据集上循环多次、梯度下降
for epoch in range(2):
	for i, data in enumerate(trainloader, 0):
		# 获取输入; data是列表[inputs, labels]
		inputs, labels = data
		# (1) 初始化梯度,清零
		optimizer.zero_grad()
		'''
		# Compute and print loss
	    loss = criterion(y_pred, y)
	    print('epoch: ', epoch,' loss: ', loss.item())
	    '''
		# (2) 前向传播
		outputs = net(inputs)
		loss = criterion(outputs, labels)
		# (3) 反向传播
		loss.backward()
		# (4) 计算损失并更新权重
		optimizer.step()

It takes a lot of steps just to train!

6.4, full set: model definition, training, testing

1. pytorch: "CNN Realizes Handwritten Digit Recognition" : The code logic is very well defined.

2. pytorch: "Building a Simple Convolutional Neural Network with the MNIST Dataset" : Note the small differences from the above code.

3. keras: "[Deep Learning] Instantiate a Simple Convolutional Neural Network Using the mnist Dataset" : It is implemented with keras, and the code logic is pretty good.

4. tensorfolw: "Defining and Training CNN Convolutional Neural Network on Minst Dataset, Code + Principle + Model Change" : It is not recommended for beginners.


7. Summary of common evaluation indicators in machine learning

Reference: "Summary of Common Evaluation Indicators in Machine Learning"

Evaluation metrics are built on different machine learning tasks and fall into three main categories: classification, regression, and unsupervised.

Evaluation indicators in classification tasks encountered in learning include Accuracy, TPR, FPR, Recall, Precision, F-score, MAP, ROC curve, and AUC, etc. Indicators in regression tasks include MSE, MAE, etc.

insert image description here

8、RNN

"One article to understand the basics of RNN (cyclic neural network)"
"RNN, DNN, LSTM"

RNN:

1. Better to deal with time series problems;
2. The difference from full connection, in fact, there is an extra layer passed in;
3. What is realized is short-term memory (since the RNN model needs to implement long-term memory, the current The calculation of the hidden state is linked to the previous n calculations, and the amount of calculation will increase exponentially, resulting in a substantial increase in the training time of the model. When the neural network parameters are updated, the gradient disappears or the gradient explodes. When the gradient disappears When the weights cannot be updated, it is equivalent to being unable to learn, which will cause the neural network to not be able to learn the information in the longer sequence well, so the RNN neural network only has a short-term memory) 4. Why does the gradient disappear
and explode. "RNN Gradient Disappearance and Gradient Explosion" : Explained from two perspectives of deep network and activation function, highly recommended! ! !
5. The gradient disappearance of CNN is different from that of RNN, and RNN is more prone to gradient disappearance. Because the weights of RNN are shared, but the weights of CNN can be different for each layer. When backpropagating, even if multiplication occurs, CNN may be offset.

9、LSTM

The hidden state h in the cyclic neural network stores historical information, which can be regarded as a kind of memory. In the simple RNN model, h is changed every moment and will be rewritten, so it can be regarded as a kind of short-term memory. In LSTM, the memory unit c can capture a certain key information at a certain moment, and has the ability to save this key information for a certain time interval. The life cycle of information stored in memory unit c is longer than short-term memory, but far shorter than long-term memory, so LSTM is called Long Short-Term Memory (Long Short-Term Memory)

four stages:

1. Forget gate: decide to discard information
2. Input gate: determine updated information
3. Merge past and present memories
4. Output gate

Various structures of RNN:

1.one-to-one
2.n-to-n
3.one-to-n
4.n-to-one
5.Encoder-Decoder (n-to-m)

"One Picture Really Understands LSTM and BiLSTM" : After reading a lot of information, the picture in this article can be said to be the most clearly drawn, a must-see! ! !
insert image description here

"Deep Learning Notes 9 Recurrent Neural Network (RNN, LSTM)"
"One article to understand long and short-term memory network (LSTM)"

For the comparison of animations of RNN, LSTM, and GRU, you must read: "Super Vivid Graphical Illustration of LSTM and GRU, Understanding Recurrent Neural Networks in One Article!" " ! ! !

10. Standardization and normalization

"Standardization and Normalization"
insert image description here

11、Transformer、Attention

"Detailed Transformer Model"

《https://blog.csdn.net/weixin_42118657/article/details/120164994》

But the pre-knowledge must be:
1. RNN

2、seq2seq 、Encoder/Decoder

propose:

  • Proposal of Encoder-Decoder - 2014 (Bengio team): Cho et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine
    Translation
  • The proposal of Seq2Seq - 2014 (Google): Sutskever et al., Sequence to Sequence Learning with Neural Networks
  • Application of attention in Encoder-Decoder - 2014 (Bengio team): Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and
    Translate

The difference between Encoder-Decoder and Seq2Seq:

  • Seq2seq is an application-level concept, that is, sequence-to-sequence, emphasizing application scenarios.
  • Encoder-decoder is a concept at the network architecture level, specifically referring to a structure that has both an encoder module and a decode module.
  • The encoder-decoder model is a model applied to the seq2seq problem. Currently, Seq2Seq
  • The specific methods used basically belong to the category of the Encoder-Decoder model.
  • Encoder is also called encoder. Its role is to "transform real-world problems into mathematical problems"
  • Decoder is also called a decoder. His role is to "solve mathematical problems and convert them into real-world solutions."

"What is Encoder-Decoder, Seq2Seq, Attention?" " : Summed up a lot, the most important thing is that the picture is very good, and there is a WeChat article "Attention mechanism of animation diagram, let you understand it at a glance"

"Deep Learning Three: SeqtoSeq" : Essentially, the attention mechanism is the process of calculating weights! ! !

3. Attention mechanism

"Understanding the Attention Mechanism in One Article" : A very detailed tutorial, a must-see! ! !

"Detailed Attention Mechanism" : Soft Attention (Soft Attention) mechanism means that when selecting information, it does not select only one from N information, but calculates the weighted average of N input information (even rain and dew) , and then input into the neural network to calculate. In contrast, Hard Attention refers to selecting information at a certain position in the input sequence, such as randomly selecting an information or selecting the information with the highest probability. However, soft attention mechanisms are generally used to deal with neural network problems.

In the soft attention Encoder-Decoder model, more specifically, in the English-Chinese machine translation model, the content and even the length of the input sequence and the output sequence are different, and the attention mechanism occurs between the encoder and the decoder . It can also be said that it occurs between the input sentence and the generated sentence. The self-attention mechanism in the self-attention model occurs inside the input sequence, or inside the output sequence, and can extract the connection between words that are far apart in the same sentence, such as syntactic features (phrase structure)

12. Residual

12.1. Explanation of terms

Residual connection, residual connection, skip connection, shortcut connection. . . . The meanings of these words are basically the same, and they are often used interchangeably, similar to a person’s ID card name and his nickname, and there is no need to make a strict distinction, so the following uses residual connection to describe .

"Skip Connection - A Sharp Tool to Improve the Performance of Deep Neural Networks"
1. What is Skip Connection?
Skip Connection is a method of connecting nodes between different layers in a deep neural network. In a traditional neural network, the signal is transmitted from the input layer to the output layer, and the output of each hidden layer needs to be processed by the activation function before being transmitted to the next layer, while Skip Connection will simultaneously transmit the signal of the current layer back to the The next layer at a deeper level, that is, "skips" the middle layer. And this cross-layer connection can speed up information transmission, avoid gradient disappearance, and retain more information.

2. Advantages of Skip Connection
For deep neural networks, the advantages of Skip Connection are as follows:

  • 1. Solving the problem of gradient disappearance
    With the increase of the number of neural network layers, the problem of gradient disappearance becomes more serious, making it difficult for deep nodes to be effectively updated, and even the training process will completely stagnate. The Skip Connection can retain more information, so that the gradient can be propagated between different layers through cross-layer connections, thus effectively solving the problem of gradient disappearance.

  • 2. Accelerate model training
    Since Skip Connection allows the signal to be directly transmitted to the next layer at a deeper level without having to go through the middle layer, it can shorten the transmission path of the neural network, accelerate the transmission speed of information and the training speed of the entire neural network.

  • 3. Improve the generalization ability of the model
    In the training of some deep neural networks, due to the difference between the training set and the test set, the phenomenon of overfitting is caused. By adding Skip Connection, more information can be retained, thereby enhancing the generalization ability of the model and reducing the risk of overfitting.

3. How to use Skip Connection?
The following is an example of using Skip Connection:

import tensorflow as tf

def conv_block(input_tensor, filters, strides=(2, 2), activation='relu'):
    x = tf.keras.layers.Conv2D(filters, (3, 3), strides=strides, padding='same')(input_tensor)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Activation(activation)(x)

    x = tf.keras.layers.Conv2D(filters, (3, 3), padding='same')(x)
    x = tf.keras.layers.BatchNormalization()(x)

    skip = tf.keras.layers.Conv2D(filters, (1, 1), strides=strides, padding='same')(input_tensor)
    skip = tf.keras.layers.BatchNormalization()(skip)

    x = tf.keras.layers.Add()([x, skip])
    x = tf.keras.layers.Activation(activation)(x)
    return x

inputs = tf.keras.layers.Input(shape=(32, 32, 3))
x = conv_block(inputs, 32)
x = conv_block(x, 64)
x = conv_block(x, 128)
x = conv_block(x, 256)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)

model = tf.keras.models.Model(inputs, outputs)
model.summary()

4. Variants of Skip Connection
There are some variants of Skip Connection, mainly the following two types:

  • 1. Residual Connection
    Residual Connection is a special form of Skip Connection, which was first proposed by He et al. in ResNet. This method is to solve the degradation problem in deep neural network training (the deeper the training, the accuracy rate will decrease), and it uses residual connection to build a deep network. Residual Connection realizes the cross-layer transfer of information by outputting the residual forward or backward (that is, the current feature map minus the previous feature map), thereby maintaining more information and improving the accuracy of the network.

  • 2. Dense Connection
    Dense Connection is a connection method proposed in DenseNet, that is, each layer is connected to all subsequent layers. Dense Connection forces the front and back layers to establish a full connection, so that all the features of the previous layer are preserved and added to the output of the current layer, so as to better maintain the information in the features, avoid the loss of features, and improve the accuracy of the model.

V. Summary
As an important technological innovation in the field of modern deep learning, Skip Connection can help neural networks learn features better and improve performance. By rationally using Skip Connection and its variants, we can obtain higher accuracy, faster training speed and stronger generalization ability, which provides a more solid foundation for the application of deep neural networks in various fields.

12.2. Why is the network structure of residual connection easier to learn?

Reference link "Why is the network structure of residual connection easier to learn?" " : The reason is explained from the perspective of "explainability", the clearest version, highly recommended!
insert image description here
After adding the residual connection:
insert image description here

12.3 Principle

Residual connection residual connection, assuming that a certain layer of the neural network performs an F operation on input x to become F(x), then the normal neural network output is F(x), and after adding the residual connection, the output is x +F(x)

insert image description here

So what are the benefits of the residual structure? It is obvious: because of the addition of an item, when the layer network calculates the partial derivative of x, there is an additional constant item, so in the process of backpropagation, the gradient multiplication will not cause the gradient to disappear. In addition, it can increase the depth of training and improve the generalization ability of the model.
For details, please refer to "bn layer, skip/residual connection"

13. Autoencoder

An autoencoder can be understood as a system that tries to restore its original input.

"[Machine Learning] Autoencoder - Autoencoder"

14. GAN generative confrontation network

"The principle and implementation method of generative confrontation network"

Guess you like

Origin blog.csdn.net/qq_40967086/article/details/130976767