Artificial intelligence study notes one - activation function

  Activation functions are very important for artificial neural network models to learn and understand very complex and nonlinear functions. They introduce nonlinear properties into our network. As shown in Figure 1, in the neuron, the input inputs are weighted and summed, and then applied to a function, which is the activation function. The activation function is introduced to increase the nonlinearity of the neural network model . Each layer without an activation function is equivalent to a matrix multiplication. Even after you superimpose several layers, it is nothing more than a matrix multiplication.

Figure 1 Working principle of neural network

Generally speaking, common activation functions are:

1. Sigmoid activation function

Figure 2 sigmoid activation function

The graph of the sigmoid function looks like an S-curve. The function expression is:

The Sigmoid activation function has the following advantages:

The output of the sigmoid function ranges from 0 to 1. It normalizes the output of each neuron since the output value is clamped between 0 and 1;

For models that take predicted probabilities as output. Since the probability ranges from 0 to 1, the Sigmoid function is very suitable;

Gradient smoothing to avoid "jumping" output values;

Functions are differentiable. This means that the slope of the sigmoid curve of any two points can be found;

Unambiguous predictions, ie very close to 1 or 0.

The sigmoid activation function has the following disadvantages:

Tend to vanishing gradient;

The function output is not centered on 0, which will reduce the efficiency of weight update;

The sigmoid function performs exponential calculations, and the computer runs slower.

2. Tanh / hyperbolic tangent activation function

Figure 3 tanh activation function

The image of the tanh activation function is also S-shaped, and the expression is:

tanh is a hyperbolic tangent function. The curves of the tanh function and the sigmoid function are relatively similar. But it has some advantages over the sigmoid function.

Figure 4 Comparison between tanh and sigmoid

First, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight updating. The difference between the two is the output interval, the output interval of tanh is 1, and the whole function is centered on 0, which is better than the sigmoid function;

In a tanh graph, negative inputs will be strongly mapped to be negative, and zero inputs will be mapped to be close to zero.

Note: In general binary classification problems, the tanh function is used for the hidden layer and the sigmoid function is used for the output layer, but this is not fixed and needs to be tuned to the specific problem.

3. ReLU activation function

Figure 5 relu activation function

The ReLU activation function image is shown in the figure above, and the function expression is:

The ReLU function is a popular activation function in deep learning. Compared with the sigmoid function and tanh function, it has the following advantages:

When the input is positive, there is no gradient saturation problem.

Computations are much faster. There are only linear relationships in the ReLU function, so it can be calculated faster than sigmoid and tanh.

Of course, it also has disadvantages:

Dead ReLU problem. ReLU fails completely when the input is negative, which is not a problem during forward propagation. Some areas are sensitive and others are not. But in the process of backpropagation, if you input a negative number, the gradient will be completely zero, and the sigmoid function and tanh function also have the same problem;

We found that the output of the ReLU function is 0 or positive, which means that the ReLU function is not a 0-centered function.

4. Leaky ReLU activation function

Figure 6 Comparison diagram of leaky relu activation function and relu activation function

It is an activation function specially designed to solve the Dead ReLU problem:

Its expression is:  a is a constant, generally 0.01

Leaky ReLU adjusts the zero gradients problem of negative values ​​by giving a very small linear component of x to the negative input (0.01x);

Leak helps to expand the range of the ReLU function, usually the value of a is around 0.01;

The function range of Leaky ReLU is (negative infinity to positive infinity).

Note: In theory, Leaky ReLU has all the advantages of ReLU, and Dead ReLU will not have any problems, but in practice, it has not been fully proved that Leaky ReLU is always better than ReLU.

5. ELU activation function

Figure 7 ELU activation function

The proposal of ELU also solves the problem of ReLU. Compared to ReLU, ELU has negative values, which makes the mean value of activations close to zero. Mean activations close to zero can make learning faster because they make gradients closer to natural gradients.

Its expression is:  a is a constant, generally 0.1

Obviously, ELU has all the advantages of ReLU, and:

There is no Dead ReLU problem, the average value of the output is close to 0, centered on 0;

ELU makes the normal gradient closer to the unit natural gradient by reducing the influence of bias offset, so that the mean value accelerates learning towards zero;

The ELU saturates to negative values ​​with small inputs, reducing the variation and information in the forward pass.

A slight problem is that it is more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.

6. Swish activation function

Figure 8 Swish activation function

Function expression: y = x * sigmoid (x)

The design of Swish was inspired by the use of the sigmoid function of gating in LSTM and high-speed networks. We use the same gating value to simplify the gating mechanism, which is called self-gating.

The advantage of self-gating is that it only requires a simple scalar input, while ordinary gating requires multiple scalar inputs. This enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar input (such as ReLU) without changing the hidden capacity or number of parameters.

The main advantages of the Swish activation function are as follows:

"Unboundedness" helps to prevent gradients gradually approaching 0 and causing saturation during slow training; (At the same time, boundedness is also advantageous, because bounded activation functions can have strong regularization and large negative Input problems can also be solved);

Derivative constant > 0;

Smoothness plays an important role in optimization and generalization.

7.Softplus

Figure 9 softplus activation function

The expression of the Softplus function is: , also known as the logistic/sigmoid function.

The Softplus function is similar to the ReLU function, but it is relatively smooth, and it is unilaterally suppressed like ReLU. It accepts a wide range: (0, + inf).

Guess you like

Origin blog.csdn.net/qq_45198339/article/details/128655855