[Activation function] GELU activation function

1 Introduction

        GELU (Gaussian Error Linear Units) is an activation function based on the Gaussian error function . Compared with activation functions such as ReLU, GELU is smoother and helps improve the convergence speed and performance of the training process.

# GELU激活函数的定义
def gelu(x):
    return 0.5 * x * (1 + torch.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

2, official

$\operatorname{GELU}(x)=0.5 x\left(1+\tanh \left(\sqrt{\frac{2}{\pi}}\left(x+0.044715 x^3\right)\right)\right)$

where \sqrt{\frac{2}{\pi}} and 0.044715 are the two adjustment coefficients of the GELU function.

3. Image

4. Features

  • Nonlinearity: GELU introduces nonlinear transformation , which enables the neural network to learn more complex mappings, helping to improve the expression ability of the model.

  • Smoothness: GELU is a smooth function with continuous derivatives and no problem of gradient truncation, which contributes to the stable propagation of gradients. And GELU is smoother than ReLU near 0, so it is easier to converge during the training process .

  • Preventing neuron "death": Unlike ReLU, which is completely zeroed in the negative range, GELU introduces a smooth nonlinearity in the negative range , which helps prevent neuron "death" problems.

  • Gaussian distribution: The output of the GELU activation function is close to Gaussian distribution when the input is close to 0, which helps to improve the generalization ability of the neural network and makes it easier for the model to adapt to different data distributions.

  • Computing resource efficiency: The calculation of the GELU activation function is relatively complex , involving operations such as exponential, square root, and hyperbolic tangent, so it may incur greater computational overhead when computing resources are limited.

  • Tends to linearity : For larger input values, the output of the GELU function tends to be linear , which may lead to the loss of some nonlinear features.

        The GELU activation function is commonly used in Transformer models.

        Compared to Sigmoid and Tanh activation functions, ReLU and GeLU are more accurate and efficient because they perform better on the vanishing gradient problem in neural networks. Gradient disappearance usually occurs in deep neural networks, which means that the value of the gradient gradually becomes smaller during the backpropagation process, causing the network gradient to fail to be updated, thus affecting the training effect of the network. ReLU and GeLU have almost no vanishing gradient phenomenon and can better support the training and optimization of deep neural networks.

        The difference between ReLU and GeLU lies in shape and computational efficiency. ReLU is a very simple function that only returns 0 when the input is a negative number, and returns itself when the input is a positive number, thus only containing a piecewise linear transformation. However, there is a problem with the ReLU function, that is, when the input is a negative number, the output is always 0. This problem may cause the death of neurons, thus reducing the expression ability of the model. The GeLU function is a continuous S-shaped curve, between Sigmoid and ReLU. The shape is smoother than ReLU, which can alleviate the problem of neuron death to a certain extent. However, since the GeLU function contains complex calculations such as exponential operations, it is usually slower than ReLU in practical applications.

        In summary, ReLU and GeLU are both commonly used activation functions. They each have their own advantages and disadvantages and are suitable for different types of neural network and machine learning problems. Generally speaking, ReLU is more suitable for use in convolutional neural networks (CNN), while GeLU is more suitable for fully connected networks (FNN).

Paper link:

[1606.08415] Gaussian Error Linear Units (GELUs) (arxiv.org)

For more deep learning content, please visit my homepage. The following are quick links:

[Activation Function] Several activation functions you must know in deep learning: Sigmoid, Tanh, ReLU, LeakyReLU and ELU activation functions (2024 latest compilation) - CSDN Blog

Guess you like

Origin blog.csdn.net/Next_SummerAgain/article/details/135408269