Assumption function, activation function, output function

Sigmoid

y = 1 1 + e - xy = \ frac {1} {1 + e ^ {- x}} Y=1+ex1
y ∈ [ 0 , 1 ] y\in[0,1] Y[0,1 ]

Advantages: can be directly used in the output layer;
disadvantages:

  1. It is easy to saturate, causing "gradient disappearance" when propagating backward;
  2. Neuron output is non-zero mean, which is not conducive to model training;
    continuous, monotonous and differentiable
    Insert picture description here

Unit step function

Insert picture description here
Not continuous

resume

y = m a x ( 0 , x ) y=max(0,x) Y=max(0,x)
= { x , x > 0 0 , x ≤ 0 =\left\{ \begin{aligned} x,&\quad x>0 \\ 0, &\quad x\leq 0 \end{aligned} \right. ={ x,0,x>0x0

advantage:

  1. It has unilateral inhibitory property, the neuron excitement domain is wide, and the activation is sparse, which is similar to the activation frequency of brain neurons;
  2. The gradient is not easy to saturate;
  3. Calculate and Derivative Velocity Block

Disadvantages:

  1. It is easy to appear that the parameters are negative and the gradient is 0, which leads to neuron "necrosis" and can no longer be activated;

tensorflow中:tf.nn.relu(features, name=None)

LReLU (leaky ReLU)

y = { x , x > 0 0.01 ∗ x , x ≤ 0 y=\left\{ \begin{aligned} x,&\quad x>0 \\ 0.01*x, &\quad x\leq 0 \end{aligned} \right. Y={ x,0.01x,x>0x0

tensorflow中:tf.nn.leaky_relu(features, alpha=0.2, name=None)

PReLU (Parametric ReLU)

y = { x , x > 0 α ∗ x , x ≤ 0 y=\left\{ \begin{aligned} x,&\quad x>0 \\ α*x, &\quad x\leq 0 \end{aligned} \right. Y={ x,ax,x>0x0
The difference between it and ReLU is that when the input is negative, the output and input are still in a linear relationship, but the linear scale value can be adjusted continuously during the training process. This allows neurons to be activated when the input is negative.

CReLU(Concatenated Rectified Linear Units)

tensorflow中:tf.nn.crelu(features, name=None)

ELU (Exponential Linear Unit)

f ( x ) = { x , x > 0 α ∗ ( e x − 1 ) , x ≤ 0 , α > 0 f(x)=\left\{ \begin{aligned} x,&\quad x>0 \\ α*(e^x-1), &\quad x\leq 0, α > 0 \end{aligned} \right. f(x)={ x,a(ex1),x>0x0,a>0

f ′ ( x ) = { 1 , x > 0 f ( x ) + α , x ≤ 0 , α > 0 f'(x)=\left\{ \begin{aligned} 1,&\quad x>0 \\ f(x)+α, &\quad x\leq 0, α > 0 \end{aligned} \right. f(x)={ 1,f(x)+a ,x>0x0,a>0

Among them, α is an adjustable parameter, which controls when the negative part of ELU is saturated. The linear part on the right makes the ELU alleviate the disappearance of the gradient, while the soft saturation on the left makes the ELU more robust to input changes or noise. The output average of ELU is close to zero, so the convergence speed is faster

tensorflow中:tf.nn.elu(features, name=None)

VILLAGE

f ( x ) = λ { x , x > 0 α ∗ ( e x − 1 ) , x ≤ 0 , α > 0 f(x)=λ\left\{ \begin{aligned} x,&\quad x>0 \\ α*(e^x-1), &\quad x\leq 0, α > 0 \end{aligned} \right. f(x)=λ{ x,a(ex1),x>0x0,a>0

After the activation function, the sample distribution is automatically normalized to 0 mean and unit variance (self-normalization, to ensure that the gradient will not explode or disappear during the training process, and the effect is better than Batch Normalization). In fact, ELU is multiplied by a lambda. The key is that the lambda is greater than 1. In the past, the activation functions of relu, prelu, and elu were all gentle on the negative semi-axis slope, so that when the activation variance is too large, it can be reduced to prevent the gradient explosion, but the positive semi-axis slope is simply set to 1. . The positive semi-axis of selu is greater than 1, which can be increased when the variance is too small, while preventing the gradient from disappearing. In this way, the activation function has a fixed point. After the network is deep, the output of each layer has a mean value of 0 and a variance of 1.

tensorflow中:tf.nn.selu(features, name=None)

fishy

y = e x − e − x e x + e − x y=\frac{e^x-e^{-x}}{e^x+e^{-x}} Y=ex+exexex
y ∈ [ − 1 , 1 ] y\in[-1,1] Y[1,1 ]

Advantages: The output of neurons is 0 mean, which is easy for model training.
Disadvantages:

  1. It is easy to saturate, causing "gradient disappearance" when propagating backward;
  2. Complicated calculation;

Softmax

Softmax is the Soft version of ArgMax
https://www.zhihu.com/question/294679135/answer/527393818

import math
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl


def tanh(x):
    return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))


def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def relu(x):
    result = []
    for i in x:
        if i > 0:
            result.append(i)
        else:
            result.append(0)
    return np.array(result)


def lrelu(x):
    result = []
    for i in x:
        if i > 0:
            result.append(i)
        else:
            result.append(0.01*i)
    return np.array(result)


fig = plt.figure(figsize=(6, 4))
ax = fig.add_subplot(111)

x = np.linspace(-10, 10)
y_tanh = tanh(x)
y_sigmoid = sigmoid(x)
y_relu = relu(x)
y_lrelu = lrelu(x)

ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')

ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
ax.set_xticks([-10, -5, 0, 5, 10])
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
ax.set_yticks([-1, -0.5, 0.5, 1])

# plt.plot(x, y_tanh, label="tanh", color="red")
# plt.plot(x, y_sigmoid, label="Sigmoid", color="black")
# plt.plot(x, y_relu, label="ReLU", color="green")
plt.plot(x, y_lrelu, label="LReLU", color="red")
plt.legend()
plt.show()

Guess you like

Origin blog.csdn.net/weixin_38052918/article/details/107725993