Summary of Deep Learning Basics

Neural Networks

The history of the model

The MP model is the earliest neural network model, which describes the working mechanism of a neuron. According to the structure of the neuron, it can be seen that the neuron is an information processing unit with multiple input and single output, and the processing of information is nonlinear. On this basis, the MP model came into being: y = f ( Σ iwixi + b ) y=f(\Sigma_{i}w_ix_i+b)y=f ( Siwixi+b )
Among them,fff is the activation function.

The perceptron model is very similar to the MP model, its fff takes a sign function.

The multi-perceptron model (MLP) is the combination and superposition of neurons.

feedforward neural network

Feedforward neural network is a form of artificial neural network. Each neuron is arranged in layers. Each neuron is only connected to the previous layer of neurons, receives the output of the previous layer, and outputs it to the next layer. No feedback.

Feedforward neural network is also called fully connected neural network, and MLP and BP neural network are one of the common feedforward neural networks.

insert image description here

activation function

Refer to this article, including Sigmoid, Tanh, ReLU, LReLU, ELU, PReLU, Softmax, Swish . Summarized as follows:

loss function advantage shortcoming
Sigmoid 1. Suitable for probability prediction model; 2. Continuous function, easy to derive 1. It is easy to cause the gradient to disappear; 2. The mean value is non-zero; 3. It involves exponential operations, and the computer calculation efficiency is low
Tanh 1. 0 mean 1. It is easy to cause the gradient to disappear; 2. It involves exponential operations, and the computer calculation efficiency is low
resume 1. At x > 0 x\gt0x>0 , there will be no gradient saturation; 2. The calculation speed is fast; 1. When a negative number is input, the gradient is 0 (Dead ReLU); 2. Non-zero mean;
LReLU 1. Solve the Dead ReLU problem; 2. Inherit all the advantages of ReLU; 1. Other disadvantages of inheriting ReLU
UP 1. Solve the Dead ReLU problem; 2. Close to 0 mean; 3. The normal gradient is close to the natural gradient; 4. It tends to be saturated under a small input, so it is robust to noise 1. High computational intensity
PRELU 1. Inherit the advantages of LReLU; 2. The parameters can be learned 1. The disadvantages of inheriting LReLU
Softmax 1. Probabilistic prediction model suitable for multi-classification; 2. Approximate smoothing of argmax 1. When the variance of the input is relatively large, it will output a form close to one-hot, which further causes the problem of gradient dispersion
Swish - -

Q1: Sigmoid is not zero mean, why is this a disadvantage?

You can refer to this article . In short, all parameters are updated in the same direction, resulting in a Z-shaped update phenomenon, which slows down the convergence speed.

Q2: What is gradient diffusion?

Originating from the saturation of the activation function, once it falls into the saturation region of the function, the gradient becomes very small.

Q3: What is the functional form of Swish?
f ( x ) = x ⋅ sigmoid ( β x ) f(x)=x\cdot sigmoid(\beta x)f(x)=xs i g m o i d ( β x )
is a smooth function between a linear function and a ReLU function.

backpropagation algorithm

The gradient descent method belongs to the optimization algorithm and is a kind of iterative method, which can be used to solve the least squares problem (linear/nonlinear). Its formula is: x = x − γ ⋅ ∇ x = x - \gamma\cdot \nablax=xc

The backpropagation algorithm is a learning algorithm suitable for multi-layer neural networks , which is based on the gradient descent method.

The input and output relationship of the feedforward neural network is essentially a mapping, and its information processing ability comes from the multiple compounding of simple nonlinear functions. This is the basis for the application of BP algorithm.

BP algorithm is composed of forward propagation process and back propagation process . In the process of forward propagation, the input information is processed layer by layer through the hidden layer through the input layer and transmitted to the output layer. Take the loss function as the objective function, transfer to backpropagation, calculate the partial derivative of the objective function to the weight of each neuron layer by layer, and construct the gradient of the objective function to the weight vector as the basis for modifying the weight. Finally, the weight is updated, and when the error reaches the expected value, the learning ends.

The construction of the ladder quantity follows the chain derivation rule , and the update of the weight follows the gradient descent method.

The essence of backpropagation is gradient! The gradient (reflecting the impact of the change of the variable on the objective function) is also very knowledgeable. For details, please refer to this video .

The BP algorithm relies on calculation graphs, examples of which are as follows:
Please add a picture description

That is to say, in the actual algorithm execution process, the ladder quantity of the next node to the current node will be calculated, and the chain derivation rule will be executed in the process of back propagation, and the ladder quantity will be multiplied and the value will be substituted. .

Automatic differentiation is a computer derivation method, which is divided into forward mode and reverse mode. With the function f = g ( h ( x ) ) f=g(h(x))f=g ( h ( x ) ) as an example, its derivative is dfdx = dgdhdhdx \frac{df}{dx}=\frac{dg}{dh}\frac{dh}{dx}dxdf=dhdgdxdhThe forward mode refers to calculating the derivative from right to left (calculating the state [function] and activation value of each layer while calculating the partial derivative of the node to the variable), and the reverse mode is the opposite (with The backpropagation algorithm calculates the gradient in the same way: calculate the state and activation value of each layer forward; calculate the partial derivative of the parameter of each layer backward, that is, the partial derivative of the objective function and the current variable).

For automatic differentiation, you can refer to this video .

In this process, the calculation graph will be built: static (no change after construction) and dynamic (real-time adjustment according to the function structure).

model training

data normalization

The role of normalization: Unify the model order of magnitude.

The benefits of normalization are: to facilitate subsequent data processing; to speed up model convergence.

This is because the order of magnitude conference brings two problems: large shocks and unstable models; long convergence time.

Common methods are: min-max normalization and Z-score.

x ∗ = x − m i n m a x − m i n x^*=\frac{x-min}{max-min} x=maxminxmin

x ∗ = x − μ σ x^*=\frac{x-\mu}{\sigma} x=pxm

The normalization here is for the initial data.

parameter initialization

The role of parameter initialization is to speed up the convergence of gradient descent.

Symmetrical weight problem: If there are K hidden units in a certain layer, and the values ​​of its parameter matrix are all N, then these K mappings are all the same, then such a network structure with many hidden units is a completely redundant expression, and finally The network can only learn one feature.

The solution to this problem is random initialization.

Common and simple random initialization methods include Gaussian initialization and uniform distribution initialization.

These two initialization methods are also flawed:

  1. The variance is too small and the weight is concentrated near 0. If the Sigmoid function is used, it will cause the problem of gradient explosion;
  2. If the variance is too large, if the Sigmoid function is used, the gradient will disappear.

All in all, the weights are random and uneven.

The corresponding solutions are variance scaling and orthogonal initialization (Gaussian initialization + singular value decomposition).

loss function

See the summary of this article .

Common ones are: 01 loss function, absolute value loss function, logarithmic loss function, square loss function, exponential loss function, hinge loss function, perceptual loss function, cross-entropy loss function, and Focal loss function.

illustrate:

  1. Absolute value loss function and square loss function are often used in regression problems, but are sensitive to noise and not robust;
  2. The logarithmic loss function is to take the negative logarithm of the confidence level and use it for logistic regression, which is sensitive to noise and not robust;
  3. The exponential loss function is used in Adaboost, which is sensitive to noise and not robust;
  4. The hinge loss function is used in SVM, which not only requires correct classification, but also requires a certain degree of credibility, insensitive to noise, and strong robustness;
  5. The perceptual loss function removes 1, which weakens the hinge loss function's requirement for credibility;
  6. When it comes to cross-entropy loss function, we must talk about KL divergence, or relative entropy, which measures the gap between two distributions. The formula is DKL ( p ∣ ∣ q ) = ∑ ip ( xi ) log ⁡ p ( xi ) q ( xi ) D_{KL}(p||q)=\sum_{i}p(x_i)\log\frac{p(x_i)}{q(x_i)}DKL(pq)=ip(xi)logq(xi)p(xi)Further simplification can get the sum of information entropy and cross entropy, which can be said to be a simplified version of KL divergence.
  7. The focusing loss function is an enhanced version of the cross-entropy loss function, whose purpose is to adaptively make the model focus on difficult samples, potentially solving the problem of positive and negative sample imbalance.

Model optimization

The so-called model optimization is to find a parameter that minimizes the empirical risk/structural risk.

Traditional machine learning often faces a convex optimization problem. Deep learning is faced with non-convex optimization problems. Intuitively, the difference between the two is as follows:
Please add a picture description

Difficulties in optimization include: many parameters, which affect training; non-convex optimization solution; gradient disappearance; parameters are difficult to explain.

Intuitively, the optimization process is to find an optimal position on the surface of the loss function. However, the surface of the loss function is usually very complex. Visually it looks like this.

Please add a picture description
The so-called vanishing gradient refers to falling into a flat area in the surface. In addition, training on this surface can easily fall into a local optimum.

It is worth mentioning that skip connections can smooth the surface.Please add a picture description

Common optimization algorithms can refer to this article , which mainly introduces BGD, SGD, MBGD, SGD+Momentum, Nesterov acceleration gradient, AdaGrad, AdaDelta, RMSprop, Adam.

illustrate:

  1. BGD, SGD, and MBGD are all gradient descent methods, but the basis for calculating the gradient is different, which are the entire training set, a certain sample, and a certain pile of samples;
  2. Momentum's update formula is still the gradient descent method, but the derivative needs to be added to the derivative at the previous moment;
  3. Adagrad, AdaDelta, RMSprop, and Adam are all adaptive algorithms, which are a dynamic adjustment of the gradient;
  4. Adagrad just divides the gradient by ∑ i = 1 tgt 2 + ϵ \sqrt{\sum_{i=1}^tg_t^2+\epsilon}i=1tgt2+ϵ , the advantage is that the denominator is small in the early stage and the speed is fast, and the denominator in the later stage is large and the speed is slow;
  5. AdaDelta is no longer simply a sum of squares of gradients, but a weighted nt = v × nt − 1 + ( 1 − v ) × gt 2 n_t=v\times n_{t-1}+(1-v )\times g_t^2nt=v×nt1+(1v)×gt2;In addition, AdaDelta also replaced the learning rate with ρ E [ Δ θ ] t − 2 + ( 1 − ρ ) Δ θ t − 1 2 \rho E[\Delta\theta]_{t-2}+(1 -\rho)\Delta\theta_{t-1}^2ρ E [ Δ θ ]t2+(1r ) D it12, so that there is no need to consider the learning rate, and you can learn by yourself.
  6. RMSprop is a simplified version of AdaDelta that retains the original learning rate;
  7. Adam is RMSprop with a momentum term;

Data augmentation/model generalization

Helps prevent overfitting and enhance generalization.

Common means are: translation, flip, zoom, rotation, noise, focus.

Model overfitting is a common problem, and there are two solutions: data level and algorithm level. Data enhancement is used at the data level, and regularization, dropout, BN, early termination, weight decay, etc. are used at the algorithm level.

convolutional neural network

The characteristics of convolutional neural network are:

  1. local connection;
  2. weight sharing;
  3. Translation invariance.

1 and 2 solve the problem of too many parameters in FCN;

convolution kernel

The convolution operation is similar to the "*" operation (flip, translation, dot product), but the convolution in deep learning omits the flip operation.

The value of traditional convolution is fixed, such as Sobel operator, Gaussian operator, etc., the purpose is to extract specific features or operations on the image. Convolution in deep learning focuses on feature extraction, but its value becomes learnable.

Next, we introduce the basic parameters of convolution and simple numerical calculations.

  1. Convolution kernel size k;
  2. step size s;
  3. fill p;
  4. expansion factor d;

Size calculation:
W out = W in + 2 p − ks + 1 , H is the same W_{out}=\frac{W_{in}+2p-k}{s}+1, H is the sameWout=sWin+2pk+1,H is the same.
If we introduce an expansion factor, then the size of the kernel isk ′ = d × ( k − 1 ) + 1 k'=d\times(k-1)+1k=d×(k1)+1;

Parameter calculation:
N = ( k × k × C in + 1 ) × C out N = (k\times k\times C_{in}+1)\times C_{out}N=(k×k×Cin+1)×Cout

Activation and pooling layers

The activation layer is a common activation function;

For the pooling layer, there are common maximum pooling and average pooling.

There is no padding in the pooling layer, so its size calculation formula is W out = W in − ks + 1 W_out=\frac{W_{in}-k}{s}+1Wout=sWink+1

It is worth mentioning that the parameter amount of the pooling layer is 0.

The convolutional layer, activation layer, and pooling layer constitute the basic structure of a convolutional neural network.

Common CNN models include LeNet, AlexNet, GoogleNet, VGG, ResNet...

recurrent neural network

The shortcomings of traditional feedforward neural networks when processing serialized data are:

  1. The input needs to be data of a specific size, and the output is the same, such restrictions are too strict;
  2. Weak memory and insufficient consideration of contextual information;

RNN

Please add a picture description
The parameters for each time step are shared. On the one hand, it is to reduce the amount of parameters, and on the other hand, it is also to achieve translation invariance on the sequence.

Common RNN structures are:
Please add a picture description

  1. Music generation is One2Many, and the output of each layer will also be used as the input of the next layer;
  2. Sentiment classification is Many2One's, very standard structure;
  3. Named entity recognition is Many2Many (2), a very standard structure;
  4. Machine translation is Many2Many (1), its structure is as follows, this model is also called Seq2Seq/Encoder-Decoder:
    Please add a picture description

BPTT (backpropagation through time)

Because the same parameters are used for each time step, taking the Many2Many model as an example, it is necessary to derive and sum the parameters under each time step.

Taking the third time step as an example, the formula is:

Please add a picture description
Please add a picture description

It can be seen that no matter which parameter is derived, the multiplication of derivatives between hidden layers is essentially increasing. This can easily lead to vanishing or exploding gradients.

This is the so-called long-range dependency problem (gradient disappearance/explosion). From a mathematical point of view, the essence is that the activation function is saturated or near 0.

I think the possibility of gradient disappearance is more likely, because the number in each neuron is continuously accumulated, which can easily cause the value to increase, and further cause the activation function to be saturated. Intuitively, it means that the memory has reached the upper limit of capacity.

LSTM

The image below should be of type One2Many since there are no additional inputs.
Please add a picture description
Another way of expressing it is:
Please add a picture description
Description:

  1. The input of the previous time step participated in the calculation of the three gates and the current initial memory unit;
  2. After an input comes, it must first be converted into a memory unit, and then the AND gate function has a subsequent story.

Why can LSTM solve the problem of gradient disappearance?
The essential reason for the disappearance of the gradient is that the activation function tends to saturate because the value of the memory unit is too large. LSTM implements the update of the memory unit, ensuring that it will not be too large. Intuitively, it is because memory will forget.

GRU

Please add a picture description
It can be seen that the calculation of the gate of both GRU and LSTM is the joint calculation of hidden state and input.

The calculation of the candidate hidden state uses the reset gate, which controls the proportional relationship between the hidden state at the previous moment and the state at this moment;

The calculation of the hidden state uses an update gate, which is a compromise between the current state and the candidate hidden state.

The performance of GRU and LSTM is similar, the difference is that the number of parameters is slightly smaller.

transfer learning

Transfer learning refers to the ability to learn a model on a task and then use it to solve other related tasks.

Migration learning has played a big role in the field of deep learning, because networks often require a lot of data and are expensive.

There are three approaches to transfer learning:

  1. Train a feature extraction module, WordVec in the text field, ResNet in the image field, and I3D in the video field (and then use these data for downstream tasks);
  2. Train a model on a related task, then fine-tune it on another task.

Common application areas are:

  1. semi-supervised learning;
  2. Small sample learning;
  3. multi-task learning;

Same mission, different fields; same field, different missions.

There are many large-scale data sets in CV. We hope to train some models on these data and expand their knowledge to the target task. This is what transfer learning needs to do.

Generally speaking, a neural network is divided into two parts: encoder (feature extraction) and decoder (decision making).

The common method is pre-training model + fine-tuning.

The so-called pre-training model is to initialize the current model with data trained elsewhere (the decoder part remains randomly initialized because the labels are inconsistent).

Fine-tuning is to update some parameters of this model. We generally think that the initial result will be near the solution, so the degree of training is generally limited. There are three common methods:

  1. Reduce the epoch and learning_rate of partial layer training;
  2. Freeze some layers;
  3. Add some constraints during training: output close/parameter close (conservative learning).

It is worth mentioning that for the second method, because the neural network is usually hierarchical, the bottom layer learns the characteristics of the lower layer, and the upper layer is related to semantics, so the lower layer can be considered to freeze directly.

In fact, transfer learning is essentially only able to speed up the convergence of the model. The reason why the accuracy can be improved a little is because the things learned in the pre-training model can learn more than the small data set used in the current task. If the data set of the current task is relatively large, the improvement in accuracy will not be obvious.

Guess you like

Origin blog.csdn.net/weixin_46365033/article/details/125420367