[Learning arrangement] Knowledge points related to deep learning 1

1.MLP

1. Multilayer Perceptron Multilayer Perceptron, also called Artificial Neural Network (ANN, Artificial Neural Network), in addition to the input (Input Layer) output (Output Layer) layer, it can have multiple hidden layers (Hidden Layer) in the middle.
The layers of the multi-layer perceptron are fully connected. The bottom layer of a multi-layer perceptron is the input layer, the middle is the hidden layer, and the last is the output layer.
How do you get the neurons in the hidden layer? First of all, it is fully connected to the input layer. Assuming that the input layer is represented by a vector X, the output of the hidden layer is f (W1X+b1), W1 is the weight (also called the connection coefficient), b1 is the bias, and the function f can be Commonly used sigmoid function or tanh function.
2. Why use activation function?

  1. Without using an activation function, the output of each layer is a linear function of the input of the upper layer. No matter how many layers the neural network has, the output is a linear combination of the inputs.
  2. Using the activation function, nonlinear factors can be introduced into neurons, so that the neural network can arbitrarily approximate any nonlinear function, so that the neural network can be used in more nonlinear models.

3. The activation function needs to have the following properties:

  1. Non-linear functions that are continuous and differentiable (allowing non-differentiability at a small number of points). Differentiable activation functions can be used directly to learn network parameters using numerical optimization methods.
  2. The activation function and its derivative function should be as simple as possible, which is conducive to improving the computational efficiency of the network.
  3. The value range of the derivative function of the activation function should be within an appropriate interval, neither too large nor too small, otherwise it will affect the efficiency and stability of the training.

4. What is the relationship between the output layer and the hidden layer?
The hidden layer to the output layer can be regarded as a multi-category logistic regression, that is, softmax regression, so the output of the output layer is softmax(W2X1+b2), and X1 represents the output f(W1X+b1) of the hidden layer.

2.FLOPS

FLOPS is the abbreviation of floating point operations per second, which means the number of floating point operations per second, which is understood as calculation speed and is an indicator to measure hardware performance.
FLOPs is the abbreviation of floating point operations (s means complex number), which means floating-point operations, understood as the amount of calculation, and can be used to measure the complexity of the algorithm/model.
Reference article: Introduction and calculation of FLOPs in deep learning (note the distinction between FLOPS)

3. Ablation Experiment

Ablation experiment (Ablation experiment): When the author proposes a new scheme, which changes multiple conditions/parameters at the same time, then in the ablation experiment, the author will control one condition/parameter to remain unchanged to see the results. Exactly which condition/parameter has more influence on the result.
Example: In the target detection system, the method of adding A, B, and C has achieved good results, but at this time you don’t know which one of A, B, and C is responsible for the good effect. So you keep A, B, and remove C for an experiment to see what role C plays in the entire system. Ultimately determine which method has the greater impact on the outcome.
Summary: The ablation experiment class is equivalent to: "Control variable method"
Reference article: What is the ablation experiment in deep learning?

4. GELU activation functioninsert image description here

Image source: From ReLU to GELU, an overview of the activation function of the neural network The
activation function (Activation Function) is a function that runs on the neurons of the artificial neural network and is responsible for mapping the input of the neuron to the output. Activation functions are very important for artificial neural network models to learn and understand very complex and nonlinear functions. They introduce nonlinear characteristics into our network.

The activation function precedes the output of the previous hidden layer and the input of the next hidden layer.

4.1 Commonly used activation functions

sigmoid function

It is a very commonly used nonlinear activation function. Its mathematical form is as follows: The
insert image description here
image is as follows: Features: The sigmoid function can transform
insert image description here
the input continuous real value into an output between 0 and 1. If it is a very large negative number, the output value is 0 , if it is a very large positive number, the output value is 1, which is suitable for binary classification.

shortcoming:

  1. In the deep neural network, gradient explosion and gradient disappearance are caused when the gradient is reversed. The probability of gradient explosion is very small, and the probability of gradient disappearance is relatively high. The specific performance is: if we initialize the weight of the neural network to [0 ,1], the mathematical derivation of the backpropagation algorithm shows that when the gradient propagates from the back to the front, the gradient value will be reduced to 0.25 times the original value every time a layer is passed. If there are too many Hidden layers, then the gradient After passing through multiple layers, it will become very small and close to 0, that is, the phenomenon of gradient disappearance will appear. When the network weight is initialized to a value in the interval (1,+∞), it is prone to gradient explosion.
  2. The calculation contains power operations, and it takes time to solve it. For a relatively large-scale deep network, it will increase the training time.
  3. The output of sigmoid is not 0 mean value, which will cause the neurons of the latter layer to get the non-0 mean value signal output by the previous layer as input, and one result is: if x>0, f(x)=wTx+b> 0, then the local gradient for w is all positive, so that in the process of backpropagation, w is either updated in the positive direction or updated in the negative direction, resulting in a binding effect that makes the convergence slow.

tanh function

The mathematical form is as follows:
insert image description here
Written in the form of sigmoid function: tanh(x) = 2sigmoid(2x) - 1
The image is as follows:
insert image description here
Features:

  1. Partially solved the problem of sigmoid's output on zero-centerd, the derivative range becomes larger between (0, 1), while sigmoid is between (0, 0.25), the problem of gradient disappearance has been alleviated
  2. Convergence is much faster than sigmoid

shortcoming:

  1. Power operation, high computational cost, gradient disappearance problem
  2. Certain neurons (such as features smaller than 0) may never be activated, causing the corresponding parameters to never be updated (when the input is large or small, the output is almost smooth and the gradient is small).
    insert image description here

ReLU function

It is a maximum value function and is currently the most commonly used activation function.

The mathematical form is as follows:
insert image description here
The image is as follows:
insert image description here

Features:

  1. When the input is positive, there is no gradient saturation problem
  2. The calculation speed is much faster, and there is only a linear relationship in the ReLU function, so his calculation speed is faster than the first two.

shortcoming:

  1. Dead ReLU problem, when the input is negative, ReLU completely fails, this is not a problem in the process of forward propagation, but in the process of back propagation, if the input is negative, the gradient will be completely 0, sigmoid function and tanh function Also has the same problem.
  2. The ReLU function is not a 0-centered function

Leaky ReLU


insert image description here
The activation function Leaky ReLU specially designed to solve the Dead ReLU problem adjusts the zero gradient problem of negative values ​​by giving a very small linear component of x to the negative input (0.01x). Leak helps to expand the range of the ReLU function, usually a The value is around 0.01, and the function range of Leaky ReLU is from negative infinity to positive infinity. In theory, Leaky ReLU has all the advantages of ReLU, and Dead ReLU will not have any problems, but in practice, it has not been fully proved that Leaky ReLU is always better than ReLU.

PReLU function

The mathematical form is as follows:
insert image description here
The function image is as follows:

insert image description here
The PReLU function is also used to solve the Dead ReLU problem caused by ReLU. Unlike the Leaky ReLU function, the slope parameter α of the negative semi-axis of the PReLU activation function is obtained through learning, rather than a constant value set manually.

ELU function

The mathematical form is as follows:
insert image description here
The function image is as follows:
insert image description hereUnlike the Leaky ReLU and PRelu activation functions, the negative semi-axis of the ELU activation function is an exponential function rather than a straight line.

Under what circumstances is it suitable to use ELU?

  1. ELU tries to make the output mean of the activation function close to zero, making the normal gradient closer to the natural gradient of the unit, thus speeding up the learning speed

  2. The ELU saturates to negative values ​​on small inputs, reducing forward-propagated variation and information

shortcoming:

The calculation index needs to be calculated, and the calculation efficiency is low

SELU function

The mathematical form is as follows:
insert image description here
where λ = 1.0507 , α = 1.6733

The function image is as follows:
insert image description here
Advantages:

  1. Internal normalization is faster than external normalization, which means the network can converge faster
  2. No chance of vanishing or exploding gradients

Disadvantages:
This activation function is relatively new, and more papers are needed to comparatively explore its application in architectures such as CNN and RNN.

The GELU
Gaussian Error Linear Unit activation function has recently been used in Transformer models (Google's Bert and OpenAI's GPT-2).
The mathematical form is as follows:
insert image description here
It can be seen that this is the combination of the hyperbolic tangent function tanh and the approximate value.

The function image is as follows:
insert image description here
It can be seen that when x is greater than 0, the output is close to x, but when x is in the (0, 1) interval, the curve is more inclined to the y-axis.

advantage:

  1. Seems to be the current best in NLP, especially in Transformer models
  2. Can avoid the gradient vanishing problem
  3. Although it was proposed in 2016, it is still a fairly novel activation function in practical applications

Below are images of the major functions:

insert image description here

Reference article:
Summary of 10 common activation functions (Activation Function) in deep learning
From ReLU to GELU, an overview of the activation functions of neural networks

Guess you like

Origin blog.csdn.net/qq_45746168/article/details/129342916