activation function 0

The role of the activation function

Generally speaking, in neurons, the activation function is a very important part. In order to enhance the representation ability and learning ability of the network , the activation function of the neural network is nonlinear , and usually has the following properties:

  • Continuous and derivable (allowing non-derivability on a few points), the derivable activation function can directly use numerical optimization methods to learn network parameters
  • The activation function and its derivative should be as simple as possible, too complex is not conducive to improving the network calculation rate
  • The value range of the derivative function of the activation function should be within an appropriate interval, neither too large nor too small, otherwise it will affect the efficiency and stability of the training.
    insert image description here

basic activation function

1. Sigmoid function

The Sigmoid function is also called the Logistic function, which is used for the output of hidden layer neurons. The value range is (0,1). It can map a real number to the interval of (0,1).Can be used for binary classificationThe effect is better when the feature difference is complex or the difference is not particularly large. Sigmoid is a very common activation function, the expression of the function is as follows,
insert image description here
insert image description here

Under what circumstances is it appropriate to use the Sigmoid activation function?

  • The output of the sigmoid function ranges from 0 to 1. Since the output value is limited to 0 to 1, it performs aNormalized
  • for the model with predicted probabilities as output. Since the probability ranges from 0 to 1, the Sigmoid function is very suitable;
  • Gradient smoothing to avoid "jumping" output values;
  • Functions are differentiable. This means that the slope of the sigmoid curve of any two points can be found;

The shortcomings of the Sigmoid activation function:

  • Gradient vanishing : Note: The rate of change becomes flat as the Sigmoid function approaches 0 and 1, that is, the gradient of the Sigmoid approaches 0. When the neural network uses the Sigmoid activation function for backpropagation, the gradient of the neuron whose output is close to 0 or 1 tends to 0. These neurons are called saturating neurons. Therefore, the weights of these neurons are not updated. Also, the weights of neurons connected to such neurons are updated very slowly. This problem is called vanishing gradient. Imagine if a large neural network contained sigmoid neurons, many of which were saturated, the network would not be able to perform backpropagation.
  • Not centered on zero : If the Sigmoid output is not centered on zero, the output is always greater than 0, and the output of the non-zero center will make the input of the neurons in the next layer biased (Bias Shift), and further This slows down the convergence rate of gradient descent.
  • Computationally expensive : The exp() function is computationally expensive and slow to run on a computer compared to other nonlinear activation functions.
  • Gradient explosion : The maximum derivative of sigmoid is 1/4, and the x inside is generally wx, so the derivative needs to be multiplied by a w, so the gradient explosion may only occur when abs(w)>4

2.Tanh (hyperbolic tangent activation function)

Tanh activation function Similar to the Sigmoid function, the Tanh function also uses the true value, but the Tanh function compresses it into the interval from -1 to 1. Unlike Sigmoid, the output of the Tanh function is centered at zero because the interval is between -1 and 1.
insert image description here
insert image description here
insert image description here

You can think of the Tanh function as two sigmoid functions put together. In practice, the Tanh function is used in preference to the Sigmoid function. Negative inputs are treated as negative values, zero input values ​​map close to zero, and positive inputs are treated as positive values:

  • When the input is large or small, the output is almost smooth and the gradient is small, which is not good for weight update. The difference between the two is the output interval, the output interval of tanh is 1, andThe whole function is centered on 0, which is better than the sigmoid function
  • In a tanh graph, negative inputs will be strongly mapped to be negative, and zero inputs will be mapped to be close to zero.

The shortcomings of tanh :

  • Similar to sigmoid, the Tanh function also has the problem of gradient disappearance, so it will also "kill" the gradient when it is saturated (when x is large or small).
  • time-consuming power function

3. ReLU activation function

The ReLU function, also known as the Rectified Linear Unit, is a piecewise linear function.It makes up for the gradient disappearance problem of the sigmoid function and the tanh function, is widely used in current deep neural networks. The ReLU function is essentially a ramp function. The formula and function image are as follows: The
insert image description here
insert image description here
ReLU function is a popular activation function in deep learning. Compared with the sigmoid function and tanh function, it has the following advantages :

  • When the input is positive, the derivative is 1,To a certain extent, the problem of gradient disappearance has been improved, to accelerate the convergence rate of gradient descent;
  • Computes much faster. There are only linear relationships in the ReLU function, so it can be calculated faster than sigmoid and tanh.
  • Considered to have biological plausibility (Biological Plausibility), such as unilateral inhibition, wide excitation boundary (that is, the degree of excitation can be very high)

Disadvantages of the ReLU function :

  • Dead ReLU problem.ReLU fails completely when the input is negative, which is not a problem during forward propagation. Some areas are sensitive and others are not. But during backpropagation, if the input is negative, the gradient will be exactly zero.[Dead ReLU problem] ReLU neurons are more likely to "die" during training. During training, if a certain ReLU neuron in the first hidden layer cannot be activated on all training data after an inappropriate parameter update, the gradient of the neuron's own parameters will always be 0, It can never be activated during the subsequent training process. This phenomenon is called the dead ReLU problem, and it can also occur in other hidden layers.
  • Not centered on zero: Similar to the Sigmoid activation function, the output of the ReLU function is not centered on zero,The output of the ReLU function is 0 or a positive number, which introduces a bias offset to the neural network of the latter layer, which will affect the efficiency of gradient descent.
  • Relu is very sensitive to learning rate, if it is too large, many neurons will be locked

4. Leaky ReLU

To solve the vanishing gradient problem in the ReLU activation function, when x < 0, we use Leaky ReLU - this function tries to fix the dead ReLU problem. Let's take a closer look at Leaky ReLU.
insert image description here

Why is using Leaky ReLU better than ReLU?

  • Leaky ReLU adjusts the zero gradients problem for negative values ​​by giving a very small linear component of x to the negative input (0.01x), and when x < 0, it gets a positive gradient of 0.1. This function alleviates the dead ReLU problem to a certain extent
  • Leak helps to expand the range of the ReLU function, usually the value of a is around 0.01;
  • The function range of Leaky ReLU is (negative infinity to positive infinity)

Although Leaky ReLU has all the characteristics of the ReLU activation function (such as computational efficiency, fast convergence, and no saturation in the positive region), it does not fully prove that Leaky ReLU is always better than ReLU in actual operation.

5. Softmax activation function

Softmax is an activation function for multiclass classification problems where more than two class labels require class membership. For any real vector whose length is K, Softmax can compress it into a real vector whose length is K, the value is in the range of (0, 1), and the sum of the elements in the vector is 1.
insert image description here
Softmax is different from the normal max function: the max function only outputs the largest value, but Softmax ensures that smaller values ​​have a smaller probability and are not discarded directly. We can think of it as the probabilistic or "soft" version of the argmax function.

The denominator of the Softmax function combines all factors of the original output value, which means that the various probabilities obtained by the Softmax function are related to each other.

Insufficient of Softmax activation function :

  • not differentiable at zero;
  • Negative inputs have a gradient of zero, which means that for activations in this region, the weights are not updated during backpropagation, thus resulting in dead neurons that never fire.

Guess you like

Origin blog.csdn.net/CSTGYinZong/article/details/128496064