Study Notes: Deep Learning (1) - Basic Concepts and Activation Functions

Study time: 2022.04.08~2022.04.09

1. Basic concept of neural network

1.1 What is a neural network

Artificial Neural Networks (ANN), also referred to as neural network for short, is a mathematical model that imitates the behavioral characteristics of biological neural networks and performs distributed parallel information processing.

This kind of network depends on the complexity of the system, and achieves the purpose of processing information by adjusting the interconnection relationship between a large number of internal nodes.

In a biological neural network, the structure of a single neuron:

insert image description here

1.2 From neuron model to perceptron model

1.2.1 MP neuron model

In 1943, psychologist Warren McCulloch and mathematician Walter Pitts jointly proposed the McCulloch-Pitts neuron model based on biological neural networks (McCulloch-Pitts' neuron model). The basic idea is to abstract and simplify the characteristic components of biological neurons.

  • Each neuron is an information processing unit with multiple input and single output;
  • There are two types of neuron input: excitatory input and inhibitory input;
  • Neurons have spatial integration characteristics and threshold characteristics;
  • There is a fixed time lag between neuron input and output, mainly determined by synaptic delay;
  • Neglect of temporal integration and refractory periods;
  • The neuron itself is time-invariant, that is, its synaptic delay and synaptic strength are constant.

McCulloch-Pitts模型公式如下:
O j ( t + 1 ) = f { [ ∑ i = 1 n ω i j χ i ( t ) ] − T j } O_j(t+1) = f\{[\sum^n_{i=1}ω_{ij}χ_i(t)]-T_j\} Oj(t+1)=f{ [i=1nohijhi(t)]Tj}
O j O_j Ojis the output signal;

χ i χ_{i}hiis the input signal applied to the input terminal (synapse), ( i = 1 , 2 , … , n ) (i = 1, 2, …, n)(i1,2,,n)

ω i j ω_{ij} ohijis the corresponding synaptic connection weight coefficient, which is a proportional coefficient for simulating the strength of synaptic transmission;

∑ i = 1 n \sum^n_{i=1} i=1nrepresents the spatial accumulation of postsynaptic signals;

T j T_j TjIndicates the threshold of the neuron;

f f f represents the response function of the neuron. Function: ① Control the activation of the input to the output; ② Transform the function of the input and output; ③ Transform the input that may be an infinite domain.

Operating rules: time is discrete, time t ( t = 0 , 1 , 2 , … … ) t(t=0, 1, 2, …)t(t=0,1,2,) get excitatory inputχ i χ_{i}hi, if the membrane potential ( ∑ i = 1 n ω ij χ i ( t ) \sum^n_{i=1}ω_{ij}χ_i(t)i=1nohijhi( t ) ) equal to or greater than the threshold (T j T_jTj) and the inhibitory input is 0, then at time t+1 t+1t 1 , neuron cell output (O j ( t + 1 ) O_j(t+1)Oj(t+1 ) ) is 1, otherwise 0.

img

1.2.2 Perceptron model

But the MP model lacks a learning mechanism that is crucial to artificial intelligence - "Hebb's law" (1949 - Donald Hebb - "Organization of Behavior"). Generally speaking, the more two nerve cells communicate, the higher the efficiency of their connection, and vice versa.

Inspired by Hebb's foundational work, psychologist Frank Rosenblatt, working at Cornell University's Astronautics Laboratory, proposed the "Perceptron" model in 1957.

img

This is the first model that uses algorithms to accurately define neural networks. The perceptron is composed of two layers of neurons. The input layer receives external signals, and the output layer is MP neurons, which are threshold logic units, also known as a processing of neural networks. Unit (PE, Processing Element).

Rosenblatt gave a simple and intuitive learning scheme for the perceptron: Given a training set with input and output instances, the perceptron "learns" a function: for each example, if the output value of the perceptron is too low compared to the instance If it is too much, increase its weight, otherwise, if it is too much higher than the instance, reduce its weight. The algorithm is as follows:

  1. Initialize the weight coefficient;

  2. For an input value of an instance in the training set, compute the output value of the perceptron;

  3. If the output value of the perceptron is different from the default correct output value in the instance:

    (1) If the output value should be 0 but is actually 1, reduce the weight of the example whose input value is 1;

    (2) If the output value should be 1 but is actually 0, increase the weight of the example whose input value is 1;

  4. Do the same for the next example in the training set, repeat steps 2~3 until the perceptron no longer makes mistakes;

  5. By feeding the calculation result to the activation function (or transfer function), the calculation result is turned into an output signal.

    The purpose of introducing the activation function is to introduce nonlinear factors into the model. Each layer without an activation function is equivalent to a matrix multiplication. Even after you superimpose several layers, it is nothing more than a matrix multiplication.

preview

A perceptron is essentially a tool for making decisions through weighted computation functions.

Single-layer perceptron, multi-layer perceptron, multi-layer perceptron including a hidden layer, as shown below:

imgimgimg

1.3 Activation function

1.3.1 Definition of activation function

The activation function (Activation Function) activates a certain part of the neurons in the neural network during operation, and transmits the activation information to the next layer of the neural network.

Each neuron node in the neural network accepts the output value of the neuron in the previous layer as the input value of the neuron, and passes the input value to the next layer, and the neuron node in the input layer will directly pass the input attribute value to the next layer. One layer (hidden or output layer). In a multi-layer neural network, there is a functional relationship between the output of the upper node and the input of the lower node. This function is called the activation function (also known as the activation function/transfer function).

img

1.3.2 Why use activation function?

Using non-linear activation functions, it is possible to generate non-linear mappings from input to output.

Because the input and output of each layer in the neural network is a linear summation process, the output of the next layer is only a linear transformation of the input function of the previous layer, so if there is no activation function, then no matter how much neural network you construct Complexity, how many layers there are, the final output is a linear combination of inputs, pure linear combination cannot solve more complex problems. After the activation function is introduced, we will find that the common activation functions are nonlinear, so nonlinear elements will also be introduced into the neurons, so that the neural network can approach any other nonlinear function, which can make the neural network applied to more in a nonlinear model.

1.3.3 The role of the activation function

  • Increase the nonlinear segmentation capability of the model;

  • Improve model robustness: it can fit data in various situations;

  • Alleviate the problem of Vanishing Gradients;

    Assuming that the value of the neuron input Sigmoid is particularly large or small, the corresponding gradient is approximately equal to 0, even if the gradient transmitted from the previous step is large, the gradient of the neuron weight (w) and bias (bias) will approach If it is less than 0, the parameters cannot be effectively updated.

    Gradient Explosion: Gradient error is the direction and gradient calculated during the training of the neural network, and the neural network updates the network weights with the correct direction and value. In deep networks or recurrent neural networks, gradient errors can accumulate during updates, resulting in very large gradients. This in turn leads to massive updates of network weights, which in turn leads to network instability. In extreme cases, weight values ​​can become so large that they overflow and cause NaN values ​​to explode gradients.

  • Speed ​​up model convergence, etc.

    Neuron death: Although the ReLU function can improve computational efficiency, it can also hinder the training process. Usually, the input value of the activation function has a bias item (bias). Assuming that the bias becomes too small so that the value of the input activation function is always negative, then the gradient of the backpropagation process passing through this place is always 0, and the corresponding Weight and bias parameters cannot be updated at this time. If the input of the activation function is negative for all sample inputs, then the neuron can no longer learn, which is called the neuron "death" problem.

1.3.4 Classification of activation functions

Activation functions are mainly divided into saturated activation functions (Saturated) and non-saturated functions (One-sided / Non-Saturated).

  • Suppose h ( x ) h(x)h ( x ) is an activation function:

    • When nnWhen n approaches negative infinity, the derivative of the activation function approaches 0 (limn → − ∞ h ′ ( x ) = 0 lim_{n→-∞} h'(x) = 0limnh(x)=0 ), then the activation function is left saturated; whennnWhen n approaches positive infinity, the derivative of the activation function approaches 0 (limn → + ∞ h ′ ( x ) = 0 lim_{n→+∞} h'(x) = 0limn+h(x)=0 ), then the activation function is right-saturated. When a function satisfies both left saturation and right saturation, it is a saturated function.

    • A function that does not satisfy the condition of a saturating function is called an unsaturated activation function.

img
Saturated activation function
1. Sigmoid function

The Sigmoid function, also called the Logistic function, has been widely used, but due to some of its own defects, it is rarely used now.

The expression and graph of the function are as follows:
f ( x ) = 1 1 + e − xf(x) = \frac{1}{1+e^{-x}}f(x)=1+ex1
img

  • advantage:
    • It can transform the continuous real value of the input into an output between 0 and 1. In particular, if it is a very large negative number, the output is 0; if it is a very large positive number, the output is 1;
    • Optimized and stable, it can be used as an output layer;
    • Derivation is easy: f ′ ( x ) = x ( 1 − x ) f^′(x) = x(1−x)f(x)=x ( 1 x ) , the derivative achieves a maximum value of 0.25 at 0.
  • shortcoming:
    • Its output is not centered on 0;
    • When the value is large or small, the derivative is very small, which will cause the update speed of the parameter to be very slow;
    • Its analytical formula contains exponent operation, and it is relatively time-consuming to solve it by computer;
    • It is easy to cause gradient explosion and gradient disappearance when the gradient is reversed in the deep neural network. The probability of gradient explosion is very small, and the probability of gradient disappearance is relatively high.
2. TanH function

The TanH activation function (pronounced Hyperbolic Tangent), also known as the hyperbolic tangent activation function (hyperbolic tangent activation function), which solves the problem of non-zero-centered output of the Sigmoid function, but the problems of gradient disappearance and power operations still exist.

The expression and graph of the function are as follows:
tan H ( x ) = ex − e − xex + e − x = 2 1 + e − 2 x − 1 tanH(x) = \frac{e^xe^{-x}} {e^x+e^{-x}} = \frac{2}{1+e^{-2x}} - 1t a n H ( x )=ex+exexex=1+e−2x _ _21
img

We can find that the Tanh function can be regarded as an enlarged and translated Sigmoid function (Logistic function), and its value range is (−1, 1). The relationship between Tanh and sigmoid is as follows:
f ( x ) = 2 sigmoid ( 2 x ) − 1 f(x) = 2sigmoid(2x) - 1f(x)=2 s i g m o i d ( 2 x )1

  • advantage:
    • Convergence faster than the Sigmoid function;
    • Compared with the Sigmoid function, its output is centered on 0;
    • Derivation is easy: f ′ ( x ) = 1 − ( f ( x ) ) 2 f^′(x) = 1-(f(x))^2f(x)=1(f(x))2
  • shortcoming:
    • The activation function has a large amount of calculations, including power calculations;
    • Still haven't changed the biggest problem of the Sigmoid function - the gradient disappears due to saturation.
  • Note: In general binary classification problems, the tanh function is used for the hidden layer and the sigmoid function is used for the output layer, but this is not fixed and needs to be tuned to the specific problem.
3. Softmax function

Softmax is soft (softening) max. Softmax is an activation function for multiclass classification problems where more than two class labels require class membership.

For any real vector with a length of K, Softmax can compress it into a real vector with a length of K, a value in the range (0, 1), and the sum of the elements in the vector is 1.

The expression and image of the function are as follows:

Among them, i and j are the output values ​​of the i and j nodes respectively?

S i = ei ∑ j = 1 jej S_i = \frac{e^i}{\sum_{j=1}^je^j}Si=j=1jejei

img
  • advantage:
    • The denominator of the Softmax function combines all factors of the original output value, which means that various probabilities obtained by the Softmax function are related to each other;
  • shortcoming:
    • not differentiable at zero;
    • Negative inputs have a gradient of zero, which means that for activations in this region, the weights are not updated during backpropagation, thus resulting in dead neurons that never fire.
non-saturated activation function
1. ReLU function

The ReLU function, also known as the Rectified Linear Unit, is a piecewise linear function that makes up for the gradient disappearance problem of the sigmoid function and the tanH function, and is widely used in current deep neural networks. Essentially a ramp function.

The ReLU function is actually a maximum value function. Note that this is not fully derivable, but we can take sub-gradient. Although ReLU is simple, it is an important achievement in recent years.

The expression and graph of the function are as follows:
f ( x ) = M ax ( 0 , x ) = { 0 x ≤ 0 xx > 0 f(x) = Max(0, x) = \begin{cases}0 & x≤ 0\\x & x>0\end{cases}f(x)=Max(0,x)={ 0xx0x>0
img

  • advantage:
    • Compared with Sigmoid and tanh, ReLU can quickly converge in SGD;
    • Sigmoid and tanh involve a lot of expensive operations (such as indices), and ReLU can be implemented more simply;
    • Since the gradient value is always 1 when x>0, the problems of gradient dispersion and gradient explosion are effectively alleviated;
    • It can also perform better without unsupervised pre-training;
    • Provides the sparse expression ability of the neural network (Relu will make the output of some neurons 0, which causes the sparsity of the network, reduces the interdependence of parameters, and alleviates the occurrence of overfitting problems);
    • It is considered to have biological plausibility (Biological Plausibility), such as unilateral inhibition, wide excitation boundary (that is, the degree of excitation can be very high).
  • shortcoming:
    • The output of ReLU is not zero-centered;
    • As training progresses, neurons may die and weights cannot be updated. If this happens, then the gradient flowing through the neuron will always be 0 from this point forward. That is to say, ReLU neurons are irreversibly dead during training;
    • Dead ReLU Problem refers to the fact that some neurons may never be activated, causing the corresponding parameters to never be updated.
2. LReLU function, PReLU function

In order to solve the problem of Dead ReLU Problem and gradient disappearance, it is proposed to set the first half of ReLU (when x<0) to γ ​​x γxγ x instead of 0 (usuallyγ = 0.01 γ = 0.01c=0 . 0 1 ), thus proposing the LReLU function (Leaky ReLU).

The expression of the function is as follows:
f ( x ) = M ax ( α x , x ) = { γ xx ≤ 0 xx > 0 f(x) = Max(\alpha x, x) = \begin{cases}γx & x ≤0\\x & x>0\end{cases}f(x)=Max(αx,x)={ γxxx0x>0
LReLU is an extension based on ReLU for existing problems. In addition to this, it can also be extended from other angles, instead of multiplying x by a constant term, but by multiplying x by hyperparameters, which seems to be better than LReLU. This kind of expansion is the PReLU function (Parametric ReLU), which is a parametric The ReLU function. where γ_ix is ​​γ_ixciThe x hyperparameter corresponds to the slope of the function when x≤0.

Here a random hyperparameter is introduced that can be learned and backpropagated on it. Different neurons can have different parameters, where i corresponds to the i-th neuron, which enables neurons to choose the best gradient in the negative area. With this ability, they can become ReLU or Leaky ReLU. If γ i = 0 γ_i=0ci=0 , then PReLU degenerates into ReLU; ifγ i γ_iciis a small constant, then PReLU can be regarded as Leaky ReLU; PReLU can allow different neurons to have different parameters, or a group of neurons can share a parameter.

The expression of the function is as follows:
f ( x ) = M ax ( γ ix , x ) = { γ ixx ≤ 0 xx > 0 f(x) = Max(γ_ix, x) = \begin{cases}γ_ix & x≤0 \\x & x>0\end{cases}f(x)=M a x ( cix,x)={ cixxx0x>0
The function image is as follows:

img
  • advantage:
    • Leaky helps to expand the range of the ReLU function. The function range of Leaky ReLU is (negative infinity to positive infinity);
  • shortcoming:
    • Although Leaky ReLU has all the characteristics of the ReLU activation function (such as efficient calculation, fast convergence, and no saturation in the positive area), it does not fully prove that Leaky ReLU is always better than ReLU in actual operation;
    • In many cases, it is better to use ReLU, but you can experiment with Leaky ReLU or Parametric ReLU to see if one is better for your problem.
3. ELU function, SELU function

The proposal of ELU (Exponential Linear Unit) is also aimed at solving the problems existing in the negative part of ReLU. It was proposed by Djork et al. and has been proved to have high noise robustness. ELU activation function pair xxThe case where x is less than zero is output in a manner similar to exponential calculation. Compared to ReLU, ELU has negative values, which makes the mean value of activations close to zero. Mean activations close to zero can make learning faster because they make gradients closer to natural gradients.

The expression and image of the function are as follows:
f ( x ) = { α ( ex − 1 ) x ≤ 0 xx > 0 f(x) = \begin{cases}α(e^x-1) & x≤0\\ x & x>0\end{cases}f(x)={ a ( ex1)xx0x>0
img

The SNN in the Self-Normalizing Neural Networks (SNNs) paper is based on the scaling exponential linear unit "SELU", which can induce self-standardization properties (such as variance stabilization), thereby avoiding the explosion and disappearance of gradients. The SELU function is to multiply the coefficient λ λ to the ELU functionl .

The expression and graph of the function are as follows:
f ( x ) = λ ⋅ ELU ( X ) = λ { α ( ex − 1 ) x ≤ 0 xx > 0 f(x) = λ·ELU(X) = λ\begin{ cases}α(e^x-1) & x≤0\\ x & x>0\end{cases}f(x)=λE L U ( X )=l{ a ( ex1)xx0x>0
insert image description here

  • advantage:
    • ELU has no Dead ReLU problem, and the average value of the output is close to 0, centered on 0;
    • ELU makes the normal gradient closer to the unit natural gradient by reducing the influence of bias offset, so that the mean value accelerates learning towards zero;
    • The ELU saturates to negative values ​​with small inputs, reducing the variation and information in the forward pass.
  • shortcoming:
    • The calculation intensity is higher and the calculation amount is large. At present, there is no sufficient evidence in practice that ELU is always better than ReLU.
4. Swish function

The Swish activation function is also called the self-gated activation function. β ββ is a learnable parameter or a fixed hyperparameter,f ( x ) ∈ ( 0 , 1 ) f(x) ∈ (0,1)f(x)(0,1 ) can be regarded as a soft gating mechanism. Whensigmoid ( β x ) sigmoid(βx)When s i g m o i d ( β x ) is close to 1, the gate is in the "open" state, and the output of the activation function is approximately x itself; whensigmoid ( β x ) sigmoid(βx)When s i g m o i d ( β x ) is close to 0, the gate is in the "closed" state, and the output of the activation function is close to 0.

β = 0 β=0b=When 0 , the Swish function becomes a linear functionx 2 \frac{x}{2}2x;当β = 1 β=1b=When 1 , the Swish function is atx > 0 x>0x>0 is approximately linear, atx < 0 x<0x<0 is approximately saturated and has certain non-monotonicity; whenβ βWhen β tends to positive infinity, sigmoid ( β x ) sigmoid(βx)The s i g m o i d ( β x ) function tends to be a discrete 0~1 function, and the Swish function is approximately a ReLU function; therefore, the Swish function can be regarded as a nonlinear interpolation function between the linear function and the ReLU function, and its degree By parameterβ βbeta control.

The expression and graph of the function are as follows:
f ( x ) = x ⋅ sigmoid ( β x ) = x 1 + e − β xf(x) = x sigmoid(βx) = \frac{x}{1+e^{ -βx}}f(x)=xsigmoid(βx)=1+eβxx
img

  • advantage:
    • ReLU has the characteristics of no upper bound and lower bound, and Swish has added smooth and non-monotonic features compared to ReLU, which makes it better on ImageNet.
  • shortcoming:
    • Exponential functions are introduced to increase the amount of computation.
5. Mish function

A new paper by Diganta Misra titled "Mish: A Self Regularized Non-Monotonic Neural Activation Function" introduces a new deep learning activation function - the Mish activation function, which outperforms Swish(+. 494%) and ReLU (+1.671%) are improved.

The expression and graph of the function are as follows:
f ( x ) = x ⋅ tanh ( ln 1 + ex ) f(x) = x tanh(ln^{1+e^x})f(x)=xt a n h ( l n1+ex)
img

  • advantage:
    • Unbounded above: It can prevent network saturation, that is, the gradient disappears;
    • Bounded below: Improve the regularization effect of the network;
    • Smooth (smooth): Firstly, compared with ReLU, continuous at 0 value points can reduce some unpredictable problems, and secondly, it can make the network easier to optimize and improve generalization performance;
    • Nonmonotonic: Some small negative inputs can also be retained as negative outputs, improving the interpretability and gradient flow of the network.
  • shortcoming:
    • Exponential functions are introduced to increase the amount of computation.
6. Softplus function

The Softplus function is the original function of the Sigmoid function, that is, the result of the derivation of the softplus function is the sigmoid function. Softplus can be seen as a smooth version of the ReLU function.

The expression and image of the function are as follows:
f ( x ) = log 1 + exf ′ ( x ) = sigmoid ( x ) f(x) = log^{1+e^x}\\ f'(x) = sigmoid( x)f(x)=log1+exf(x)=sigmoid(x)
preview

The Softplus function adds 1 to ensure non-negativity. Softplus can be seen as a smooth version of the forced non-negative correction function max(0,x). The red one is ReLU.

The derivative of the Softplus function is just a Logistic function. Although the Softplus function also has the characteristics of unilateral inhibition and wide excitation boundary, it has no sparse activation.

7. MaxOut function

Maxout can be seen as adding an activation function layer to the deep learning network, including a parameter k. Compared with ReLU, sigmoid, etc., this layer is special in that it adds nnn neurons, and then output the value with the largest activation value. ("Maxout Networks", Goodfellow, ICML2013)

Suppose the input feature vector of a certain layer of the network is: X = ( x 1 , x 2 , … , xd ) X = ( x_1, x_2, …, x_d )X=(x1,x2,,xd) , that is, our input isddd neurons. The calculation formula of each neuron in the Maxout hidden layer is as follows:
hi ( x ) = Maxj ∈ [ 1 , n ] zij h_i(x) = Max_{j∈[1,n]}\ z_{ij}hi(x)=Maxj[1,n] zij
The above formula is the Maxout hidden layer neuron iiThe calculation formula of i . Among them, k is the number of parameters required by the Maxout layer, and we manually set the size. Just like dropout, it also has its own parameter p (the dropout probability of each neuron), and the parameter of maxout is k. The calculation formula of Z in the formula is:
zij = x TW … ij + bij z_{ij} = x^TW…_{ij} + b_{ij}zij=xTWij+bij
Weight WWW is a size( d , m , k ) (d,m,k)(d,m,k ) three-dimensional matrix,bbb is a size( m , k ) (m,k)(m,k ) two-dimensional matrix, these two are the parameters we need to learn. If we set the parameterk = 1 k=1k=1 , then at this time, the network is similar to the ordinary MLP network we have learned before.

Common hidden layer node output: hi ( x ) = sigmoid ( x TW … i + bi ) h_i(x) = sigmoid(x^TW…_i + b_i)hi(x)=sigmoid(xTWi+bi)

Traditional MLP Algorithm, Network Section iii layer toi + 1 i+1i+There is only one set of parameters for 1 layer. If the activation function such as ReLU or sigmoid is replaced and Maxout is introduced, thennnWWof group nW b b The b parameter, and then select the zzwith the largest activation valuez is used as the activation value of the next layer of neurons, thisMaxj ∈ [ 1 , n ] zij Max_{j∈[1,n]}\ z_{ij}Maxj[1,n] zijThat is, it acts as an activation function.

img
  • advantage:
    • Maxout has a very strong fitting ability and can fit any convex function;
    • Maxout has all the advantages of ReLU, linearity, unsaturation, and does not have some disadvantages of ReLU. Such as: the death of neurons;
    • The experimental results show that the combination of Maxout and Dropout can play a better effect.
  • shortcoming:
    • As can be seen from the activation function formula above, there are two sets of (w,b) parameters in each neuron, then the number of parameters is doubled, which leads to a surge in the number of overall parameters.
8. Other functions
  • Step step function/step function: f ( x ) = { 0 x ≤ 0 1 x > 0 f(x) = \begin{cases}0&x≤0\\1&x>0\end{cases}f(x)={ 01x0x>0
  • Sgn sign function: f ( x ) = { − 1 x ≤ 0 1 x > 0 f(x) = \begin{cases}-1&x≤0\\1&x>0\end{cases}f(x)={ 11x0x>0
  • Linear function: f ( x ) = xf(x) = xf(x)=x
  • Ramp saturated linear function: f ( x ) = { 0 x < 0 x 0 ≤ x ≤ 1 1 x > 1 f(x) = \begin{cases}0&x<0\\x&0≤x≤1\\1&x>1 \end{cases}f(x)=0x1x<00x1x>1
function summary
img

1.3.5 How to choose the activation function

  1. Deep learning often requires a lot of time to process a large amount of data, and the convergence speed of the model is particularly important. Therefore, in general, training deep learning networks should use zero-centered data (which can be achieved through data preprocessing) and zero-centered output as much as possible. Therefore, try to choose an activation function with zero-centered characteristics to speed up the convergence speed of the model;

  2. Be careful using the sigmoid function except in binary classification problems;

  3. When the input data features are significantly different, the effect of using tanh will be very good, and the feature effect will be continuously expanded and displayed during the loop process. However, in most other cases, its effect will be inferior to ReLU and Maxout;

  4. If you don't know which activation function to use, then please choose ReLU first;

  5. If you use ReLU, you need to pay attention to the Dead ReLU problem. At this time, you need to choose the Learning rate carefully to avoid large gradients that lead to too many neurons "Dead". If the Dead ReLU problem occurs, you can try leaky ReLU, PReLU, ELU, Maxout, etc., maybe it will have a good effect.

1.4 Neural network structure

A neural network generally consists of an input layer, a hidden layer (also called an intermediate layer) and an output layer, where the hidden layer has one or more layers. Each layer can have several nodes. The connection status of nodes between layers is reflected by weights.

  • Input layer: the input terminal of information;
  • Hidden layer: the processing end of information, used to simulate a calculation process;
  • Output layer: the output end of the information, which is the result we want (there can be more than one).
img There is only one hidden layer: traditional shallow neural network; there are multiple hidden layers: deep learning neural network.

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124071564