Neural Network 02 (activation function)

1. Activation function

The activation function is introduced into the neuron, and its essence is to introduce nonlinear factors into the neural network . Through the activation function, the neural network can fit various curves .

  • If an activation function is not used, the output of each layer is a linear function of the input of the upper layer. No matter how many layers the neural network has, the output is a linear combination of the inputs ;
  • By introducing a nonlinear function as an activation function, the output is no longer a linear combination of inputs and can approximate any function .

The role of activation function:

Increase the nonlinear segmentation capabilities of the model

Improve model robustness
, alleviate gradient disappearance problem,
accelerate model convergence, etc.

Commonly used activation functions

1.1 Sigmoid/logistics function

sigmoid is differentiable everywhere in the domain, and the derivatives on both sides gradually approach 0.

If the value of At this time, it is difficult for the network parameters to be effectively trained. This phenomenon is called vanishing gradient .

Generally speaking, the sigmoid network will produce gradient disappearance within 5 layers . Moreover, the activation function is not centered on 0, so in practice this activation function is rarely used. The sigmoid function is generally only used for the output layer of binary classification .

# 导入相应的工具包
import tensorflow as tf
import tensorflow.keras as keras
import matplotlib.pyplot as plt
import numpy as np
# 定义x的取值范围
x = np.linspace(-10, 10, 100)
# 直接使用tensorflow实现
y = tf.nn.sigmoid(x)
# 绘图
plt.plot(x,y)
plt.grid()

1.2 tanh (hyperbolic tangent curve)

tanh is also a very common activation function. Compared with sigmoid, it is centered on 0, making its convergence speed faster than sigmoid (in comparison, the tanh curve is steeper) and reducing the number of iterations . However, as can be seen from the figure, the derivatives on both sides of tanh are also 0, which will also cause the gradient to disappear.

# 导入相应的工具包
import tensorflow as tf
import tensorflow.keras as keras
import matplotlib.pyplot as plt
import numpy as np
# 定义x的取值范围
x = np.linspace(-10, 10, 100)
# 直接使用tensorflow实现
y = tf.nn.tanh(x)
# 绘图
plt.plot(x,y)
plt.grid()

1.3 RELAY

ReLU is currently the most commonly used activation function. As can be seen from the figure, when x<0, the ReLU derivative is 0, and when x>0, there is no saturation problem. Therefore, ReLU can keep the gradient from decaying when x>0, thereby alleviating the vanishing gradient problem . However, as training progresses, some inputs will fall into the area less than 0, causing the corresponding weights to fail to be updated. This phenomenon is called " neuronal death ."

Relu means that the input can only be greater than 0. If your input contains negative numbers, Relu is not suitable. If your input is in image format, Relu is very commonly used, because the pixel value of the image is [0,255] when used as input .

Compared with sigmoid, the advantages of RELU are:

  • Using the sigmoid function requires a large amount of calculation (exponential operation). When backpropagating to find the error gradient, derivation involves division, which requires a relatively large amount of calculation. However, using the Relu activation function saves a lot of calculations in the entire process.
  • When the sigmoid function is backpropagated, it is easy for the gradient to disappear, making it impossible to complete the training of the deep network.
  • Relu will cause the output of some neurons to be 0, which causes the sparsity of the network, reduces the interdependence of parameters, and alleviates the occurrence of over-fitting problems.
     
# 导入相应的工具包
import tensorflow as tf
import tensorflow.keras as keras
import matplotlib.pyplot as plt
import numpy as np
# 定义x的取值范围
x = np.linspace(-10, 10, 100)
# 直接使用tensorflow实现
y = tf.nn.relu(x)
# 绘图
plt.plot(x,y)
plt.grid()

1.4 LeakyReLu

1.5 SoftMax

Softmax is used in the multi-classification process. It is the generalization of the binary classification function sigmoid in multi-classification . The purpose is to display the results of multi-classification in the form of probability .

To put it simply, softmax is to map the logits output by the network into the value of (0,1) through the softmax function, and the cumulative sum of these values ​​​​is 1 (satisfying the properties of probability), then we understand it as probability, select The contact point with the highest probability (that is, the one with the largest value) is used as our prediction target category.

Logits are the raw scores or unprocessed values ​​of the neural network output layer. In deep learning, logits are often used for multi-class classification problems, where each class corresponds to a possible class. Logits are the model's scores for each category, and the model makes classification decisions based on these scores.

Typically, the last layer of a neural network generates logits. Then, the logits are converted into probability distributions for each class by applying a softmax activation function. The softmax operation maps logits to probability values ​​such that their sum equals one. In this way, the category with the highest probability can be selected as the final classification result.

# 导入相应的工具包
import tensorflow as tf
import tensorflow.keras as keras
import matplotlib.pyplot as plt
import numpy as np
# 数字中的score
x = tf.constant([0.2,0.02,0.15,0.15,1.3,0.5,0.06,1.1,0.05,3.75])
# 将其送入到softmax中计算分类结果
y = tf.nn.softmax(x) 
# 将结果进行打印
print(y)

1.6 Other activation functions

2. How to choose activation function

2.1 Hidden layer

  • Prefer the RELU activation function
  • If ReLu doesn't work well, then try other activations, such as Leaky ReLu, etc.
  • If you use Relu, you need to pay attention to the Dead Relu problem to avoid large gradients that lead to excessive neuron death.
  • Instead of using the sigmoid activation function, try using the tanh activation function


2.2 Output layer

  • Select sigmoid activation function for binary classification problem
  • Select softmax activation function for multi-classification problems
  • Select identity activation function for regression problem



 

Guess you like

Origin blog.csdn.net/peng_258/article/details/132828856