Rectifier (neural networks) - 整流函数

https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

线性整流函数 / 线性修正单元 (Rectified Linear Unit，ReLU) 是一种人工神经网络中常用的激活函数 (activation function)，通常指代以斜坡函数及其变种为代表的非线性函数。

常用的线性整流函数有斜坡函数 $f(x) = \max(0, x)$ 、带泄漏整流函数 (Leaky ReLU)，其中 $x$ 为神经元 (Neuron) 的输入。线性整流被认为有一定的生物学原理，并且由于在实践中通常有着比其他常用激活函数 (譬如逻辑函数) 更好的效果，而被如今的深度神经网络广泛使用于诸如图像识别等计算机视觉人工智能领域。

In the context of artificial neural networks, the rectifier is an activation function defined as the positive part of its argument:
在人工神经网络的背景下，整流器是一个激活函数，被定义为其参数的正数部分：

$f(x) = x^{+} = \max(0, x),$

where $x$ is the input to a neuron. This is also known as a ramp function and is analogous to half-wave rectification in electrical engineering. This activation function was first introduced to a dynamical network by Hahnloser et al. in 2000 with strong biological motivations and mathematical justifications. It has been demonstrated for the first time in 2011 to enable better training of deeper networks, compared to the widely-used activation functions prior to 2011, e.g., the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more practical counterpart, the hyperbolic tangent. The rectifier is, as of 2017, the most popular activation function for deep neural networks.
其中 $x$ 是神经元的输入。这也称为斜坡函数，类似于电气工程中的半波整流。该激活函数首先由 Hahnloser et al. 在 2000 年引入动力网络，具有强烈的生物学动机和数学理由。与 2011 年之前广泛使用的激活函数相比 (e.g., the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more practical counterpart, the hyperbolic tangent.)，2011 年首次证明了能够更好地训练更深层次的网络。截至 2017 年，整流器是深度神经网络最受欢迎的激活功能。

A unit employing the rectifier is also called a rectified linear unit (ReLU).
采用整流器的单元也称为整流线性单元 (ReLU)。

rectifier ['rektɪfaɪə]；n. 整流器，改正者，矫正者
neuron ['njʊərɒn]：n. 神经元，神经单位
ramp [ræmp]：n. 斜坡，坡道，敲诈 vi. 蔓延，狂跳乱撞，敲诈 vt. 敲诈，使有斜面
analogous [ə'næləgəs]：adj. 类似的，同功的，可比拟的
rectification [,rektɪfɪ'keɪʃən]：n. 改正，矫正，精馏，整流，求长
dynamical [daɪ'næmɪkl]：adj. 动力学的 (等于dynamic)，有生气的，有力的
biological [baɪə(ʊ)'lɒdʒɪk(ə)l]：adj. 生物的，生物学的
mathematic [,mæθə'mætɪk]：adj. 数学的，精确的，数理的，肯定的，精确的，严谨的
justification [dʒʌstɪfɪ'keɪʃ(ə)n]：n. 理由，辩护，认为有理，认为正当，释罪
sigmoid ['sɪgmɒɪd]：adj. 乙状结肠的，C 形的，S 形的 n. 乙状结肠 (等于 sigmoidal)，S 状弯曲
counterpart ['kaʊntəpɑːt]：n. 副本，配对物，极相似的人或物
hyperbolic [,haɪpə'bɒlɪk]：adj. 双曲线的，夸张的
tangent [ˈtændʒənt]：adj. 切线的，相切的，接触的，离题的 n. 切线，正切

通常意义下，线性整流函数指代数学中的斜坡函数，即

$f(x) = \max(0, x)$

在神经网络中，线性整流作为神经元的激活函数，定义了该神经元在线性变换 $\mathbf {w} ^{T}\mathbf {x} + b$ 之后的非线性输出结果。对于进入神经元的来自上一层神经网络的输入向量 $x$ ，使用线性整流激活函数的神经元会输出

${\max(0, \mathbf {w} ^{T}\mathbf {x} +b)}$

至下一层神经元或作为整个神经网络的输出 (取决现神经元在网络结构中所处位置)。

在这里插入图片描述

Plot of the rectifier (blue) and softplus (green) functions near $x = 0$

1. Variants - 变体

线性整流函数在基于斜坡函数的基础上有其他同样被广泛应用于深度学习的变种，譬如带泄漏线性整流 (Leaky ReLU)，带泄漏随机线性整流 (Randomized Leaky ReLU)，以及噪声线性整流 (Noisy ReLU)。

1.1 Leaky ReLUs - 带泄漏线性整流

在输入值 $x$ 为负的时候，带泄漏线性整流函数 (Leaky ReLU) 的梯度为一个常数 $\lambda \in (0,1)$ ，而不是 0。在输入值为正的时候，带泄漏线性整流函数和普通斜坡函数保持一致。

$f(x)= {\begin{cases} x&{\text{if }}x>0\\ \lambda x&{\text{if }}x\leq 0 \end{cases}}$

在深度学习中，如果设定 $\lambda$ 为一个可通过反向传播算法 (backpropagation) 学习的变量，那么带泄漏线性整流又被称为参数线性整流 (Parametric ReLU)。

Leaky ReLUs allow a small, positive gradient when the unit is not active.
当神经元未激活时，Leaky ReLU 允许小的正梯度。

Parametric ReLUs (PReLUs) take this idea further by making the coefficient of leakage into a parameter that is learned along with the other neural network parameters.
参数化 ReLU （PReLU）通过将泄漏系数变为与其他神经网络参数一起学习的参数。

$f(x)= \begin{cases} x&{\text{if }}x>0\\ ax&{\text{otherwise}} \end{cases}$

Note that for $a\leq 1$ , this is equivalent to

$f(x)=\max(x, ax)$ $

and thus has a relation to “maxout” networks.

1.2 带泄漏随机线性整流

带泄漏随机线性整流 (Randomized Leaky ReLU, RReLU) 最早是在 Kaggle 全美数据科学大赛 (NDSB) 中被首先提出并使用的。相比于普通带泄漏线性整流函数，带泄漏随机线性整流在负输入值段的函数梯度 $\lambda$ 是一个取自连续性均匀分布 $U(l,u)$ 概率模型的随机变量，即

$f(x)={ \begin{cases} x&{\text{if }}x>0 \\ \lambda x&{\text{if }}x\leq 0 \end{cases} }$

其中 $\lambda \sim U(l,u),l<u$ 且 $l,u\in [0,1)$ 。

1.3 Noisy ReLUs - 噪声线性整流

噪声线性整流 (Noisy ReLU) 是修正线性单元在考虑高斯噪声的基础上进行改进的变种激活函数。对于神经元的输入值 $x$ ，噪声线性整流加上了一定程度的正态分布的不确定性，即

$f(x)=\max(0,x+Y)$

其中随机变量 $Y\sim {\mathcal {N}}(0,\sigma (x))$ 。目前，噪声线性整流函数在受限玻尔兹曼机 (Restricted Boltzmann Machine) 在计算机图形学的应用中取得了比较好的成果。

Rectified linear units can be extended to include Gaussian noise, making them noisy ReLUs, giving

$f(x) = \max(0, x+Y), with Y\sim {\mathcal {N}}(0, \sigma (x))$

1.4 ELUs

Exponential linear units try to make the mean activations closer to zero which speeds up learning. It has been shown that ELUs can obtain higher classification accuracy than ReLUs.
指数线性单位试图使平均激活接近于零，这加速了学习。已经表明，ELU 可以获得比 ReLU 更高的分类精度。

$f(x)= \begin{cases} x&{\text{if }}x>0\\ a(e^{x}-1)&{\text{otherwise}} \end{cases}$

$a$ is a hyper-parameter to be tuned and $a\geq 0$ is a constraint.
$a$ 是要调整的超参数， $a\geq 0$ 是约束。

2. Advantages - 优势

相比于传统的神经网络激活函数，诸如逻辑函数 (Logistic sigmoid) 和 tanh 等双曲函数，线性整流函数有着以下几方面的优势：

仿生物学原理：相关大脑方面的研究表明生物神经元的信息编码通常是比较分散及稀疏的。通常情况下，大脑中在同一时间大概只有 1%-4% 的神经元处于活跃状态。使用线性修正以及正则化 (regularization) 可以对机器神经网络中神经元的活跃度 (即输出为正值) 进行调试；相比之下，逻辑函数在输入为 0 时达到 $\frac {1}{2}$ ，即已经是半饱和的稳定状态，不够符合实际生物学对模拟神经网络的期望。不过需要指出的是，一般情况下，在一个使用修正线性单元 (即线性整流) 的神经网络中大概有 50% 的神经元处于激活态。
更加有效率的梯度下降以及反向传播，避免了梯度爆炸和梯度消失问题。
简化计算过程：没有了其他复杂激活函数中诸如指数函数的影响，同时活跃度的分散性使得神经网络整体计算成本下降。
Biological plausibility: One-sided, compared to the antisymmetry of tanh.
Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output).
Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.
Efficient computation: Only comparison, addition and multiplication.
Scale-invariant: $\max(0, ax) = a\max(0, x){\text{ for }} a\geq 0$
Rectifying activation functions were used to separate specific excitation and unspecific inhibition in the Neural Abstraction Pyramid, which was trained in a supervised way to learn several computer vision tasks. In 2011, the use of the rectifier as a non-linearity has been shown to enable training deep supervised neural networks without requiring unsupervised pre-training. Rectified linear units, compared to sigmoid function or similar activation functions, allow for faster and effective training of deep neural architectures on large and complex datasets.
生物学合理性：与 tanh 的反对称性相比，是单侧的。
稀疏激活：例如，在随机初始化的网络中，只有大约 50% 的隐藏单元被激活（具有非零输出）。
更好的梯度传播：与在两个方向上饱和的 S 形激活函数相比，消失梯度问题更少。
高效计算：仅比较、加法和乘法。
尺度不变： $\max(0, ax) = a \max(0, x) {\text{ for }} a \geq 0$ 。
整流激活函数用于在神经抽象金字塔中分离特定激活和非特定抑制，神经抽象金字塔以监督的方式训练，学习若干计算机视觉任务。在 2011 年，已经证明使用整流器作为非线性可以训练深度监督神经网络而无需非监督的预训练。与 sigmoid 函数或类似的激活函数相比，整流线性单元允许在大型和复杂数据集上更快，更有效地训练深度神经架构。

biological [baɪə(ʊ)'lɒdʒɪk(ə)l]：adj. 生物的，生物学的
plausibility [,plɔzə'bɪləti]：n. 善辩，似乎有理，貌似可信
antisymmetry：n. 反对称性
propagation [,prɒpə'ɡeɪʃən]：n. 传播，繁殖，增殖
vanish ['vænɪʃ]：vi. 消失，突然不见，成为零 vt. 使不见，使消失 n. 弱化音
saturate ['sætʃəreɪt]：vt. 浸透，使湿透，使饱和，使充满 adj. 浸透的，饱和的，深颜色的
excitation [,eksaɪ'teɪʃ(ə)n]：n. 激发，刺激，激励，激动
inhibition [ɪn(h)ɪ'bɪʃ(ə)n]：n. 抑制，压抑，禁止

3. Potential problems

Non-differentiable at zero; however, it is differentiable anywhere else, and the value of the derivative at zero can be arbitrarily chosen to be 0 or 1.
Non-zero centered
Unbounded
Dying ReLU problem: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and “dies.” This is a form of the vanishing gradient problem. In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high. It may be mitigated by using Leaky ReLUs instead, which assign a small positive slope to the left of $x = 0$ .
0 位置不可微分，但是，它在任何其他地方都是可微分的，并且在0 位置的导数的值可以任意选择为 0 或 1。
非零中心
无界
死亡 ReLU 问题：ReLU 神经元有时会被推入基本上所有输入都变为非活动状态。在这种状态下，没有梯度向后流过神经元，因此神经元陷入永久不活动状态并“死亡”。这是消失梯度问题的一种形式。在某些情况下，网络中的大量神经元可能会陷入死亡状态，从而有效地降低了模型容量。当学习率设置得太高时，通常会出现此问题。可以通过使用 Leaky ReLUs 来减轻它，它在 $x = 0$ 的左边分配一个小的正斜率。

essentially [ɪ'senʃ(ə)lɪ]：adv. 本质上，本来
stuck [stʌk]：v. 刺 adj. 卡住的，动不了的，被困住的，陷入的，停滞不前的，无法摆脱困境的，被难倒的，无法继续的
perpetually [pɚ'pɛtʃʊəli]：adv. 永恒地，持久地
slope [sləʊp]：n. 斜坡，倾斜，斜率，扛枪姿势 vi. 倾斜，逃走 vt. 倾斜，使倾斜，扛

4. Softplus

A smooth approximation to the rectifier is the analytic function
整流器的平滑近似是分析函数

$f(x) = \log(1+e^{x}),$

which is called the softplus or SmoothReLU function. The derivative of softplus is $f'(x) = {\frac {e^{x}}{1+e^{x}}} = {\frac {1}{1+e^{-x}}}$ , the logistic function. The logistic function is a smooth approximation of the derivative of the rectifier, the Heaviside step function.
这被称为 softplus 或 SmoothReLU 函数。softplus 的导数是 $f'(x) = {\frac {e^{x}}{1+e^{x}}} = {\frac {1}{1+e^{-x}}}$ 是 logistic function。逻辑函数是整流器的导数的平滑近似，Heaviside 阶跃函数。

The multivariable generalization of single-variable softplus is the LogSumExp with the first argument set to zero:
单变量 softplus 的多变量推广是 LogSumExp，第一个参数设置为零：

$\mathrm {LSE_{0}} ^{+}(x_{1},...,x_{n}):=\mathrm {LSE} (0,x_{1},...,x_{n})=\log \left(1+e^{x_{1}}+\cdots +e^{x_{n}}\right).$

The LogSumExp function itself is:
LogSumExp 函数本身是：

$\mathrm {LSE} (x_{1},\dots ,x_{n})=\log \left(e^{x_{1}}+\cdots +e^{x_{n}}\right),$

and its gradient is the softmax; the softmax with the first argument set to zero is the multivariable generalization of the logistic function. Both LogSumExp and softmax are used in machine learning.
它的梯度是 softmax，第一个参数设置为零的 softmax 是逻辑函数的多变量推广。LogSumExp 和 softmax 都用于机器学习。