The activation functions of sigmoid, tanh and ReLU in deep learning

 

 

 

Three nonlinear activation functions sigmoid, tanh, ReLU.

sigmoid : y = 1 / (1 + e -x )

tanh : y = (e x - e -x ) / (e x  + and -x )

ReLU : y = max (0, x)

In the hidden layer, the tanh function is superior to the sigmoid function. It can be regarded as a translational version of sigmoid. The advantage is that its value is [-1, 1], the average value of the data is 0, and the average value of sigmoid is 0.5. Similar to the effect of data centralization.

But in the output layer, sigmoid may be better than tanh, because we want the probability of the output to fall between 0 ~ 1, such as binary classification, sigmoid can be used as the activation function of the output layer.

In actual situations, especially when training deep networks, sigmoid and tanh will approach saturation at the end value, resulting in slower training speed, so the activation function of deep networks is mostly ReLU, shallow networks can use sigmoid and tanh .

 

To understand how to perform gradient descent in backpropagation, let's look at the derivation process of three functions:

1. Derivative of sigmoid

The sigmoid function is defined as y = 1 / (1 + e -x ) = (1 + e -x ) -1

Related derivative formula: (x n ) '= n * x n-1    and (e x ) = e x

Applying the chain rule, the derivation process is:

                                                        dy/dx = -1 * (1 + e-x)-2 * e-x * (-1)

                                                     = ex * (1 + ex) -2

                                                     = (1 + ex - 1) / (1 + ex) 2

                                                     = (1 + ex) -1 - (1 + ex) -2
                                                     = y - y2
                                                     = y (1 -y)

2. Derivation of tanh

The tanh function is defined as y = (e x  -e -x ) / (e x  + e -x )

Related derivative formula: (u / v) = (u v-uv ' ) / v 2 

Applying the chain rule, the derivation process is:

                                                                dy / dx = ((e x  - e -x ) ' * (e x  + e -x ) - (e x  - e -x ) * (e x  + e -x ) ) / (e x  + e - x ) 2 

                             = ((e x  - (-1) * e -x ) * (e x  + e -x ) - (e x  - e -x ) * (e x  + (-1) * e -x )  ) / ( e x  + e -x ) 2   

                             = ((and x  + and -x ) - (and x  - and -x ) 2  ) / (and x  + and -x ) 2 

                             = 1 - ((e x  - e -x ) / (e x  + e -x )) 2 

                             = 1 - y 2 

3. ReLU derivation

The ReLU function is defined as y = max (0, x)

Simply deduced that when x <0, dy / dx = 0; when x> = 0, dy / dx = 1

 

Next, focus on ReLU

In deep neural networks, the linear rectification function ( ReLU , Rectified Linear Units) is usually selected as the activation function of the neuron. ReLU originated from the study of animal neuroscience. In 2001, Dayan and Abbott simulated a more accurate activation model of brain neurons receiving signals from a biological perspective, as shown in the figure:

其中横轴是刺激电流,纵轴是神经元的放电速率。同年,Attwell等神经学科学家通过研究大脑的能量消耗过程,推测神经元的工作方式具有稀疏性和分布性;2003年,Lennie等神经学科学家估测大脑同时被激活的神经元只有1~4%,这进一步表明了神经元工作的稀疏性。

那么,ReLU是如何模拟神经元工作的呢

 

从上图可以看出,ReLU其实是分段线性函数,把所有的负值都变为0,正值不变,这种性质被称为单侧抑制。因为单侧抑制的存在,才使得神经网络中的神经元也具有了稀疏激活性。尤其在深度神经网络中(如CNN),当模型增加N层之后,理论上ReLU神经元的激活率将降低2的N次方倍。或许有人会问,ReLU的函数图像为什么非得长成这样子。其实不一定这个样子。只要能起到单侧抑制的作用,无论是镜面翻转还是180°翻转,最终神经元的输入也只是相当于加上了一个常数项系数,并不会影响模型的训练结果。之所以这样定义,或许是为了符合生物学角度,便于我们理解吧。

这种稀疏性有什么作用呢?因为我们的大脑工作时,总有一部分神经元处于活跃或抑制状态。与之类似,当训练一个深度分类模型时,和目标相关的特征往往也就几个,因此通过ReLU实现稀疏后的模型能够更好地挖掘相关特征,使网络拟合训练数据。

相比其他激活函数,ReLU有几个优势:(1)比起线性函数来说,ReLU的表达能力更强,尤其体现在深度网络模型中;(2)较于非线性函数,ReLU由于非负区间的梯度为常数,因此不存在梯度消失问题(Vanishing Gradient Problem),使得模型的收敛速度维持在一个稳定状态。(注)梯度消失问题:当梯度小于1时,预测值与真实值之间的误差每传播一层就会衰减一次,如果在深层模型中使用sigmoid作为激活函数,这种现象尤为明显,将导致模型收敛停滞不前。

 

Guess you like

Origin www.cnblogs.com/booturbo/p/12691358.html