Attention mechanism + ReLU activation functions: adaptation parameters of the activation function RELU

This paper reviews some of the traditional activation function and attention mechanisms, interpret an "activation function under attention mechanism" , i.e. the adaptive parametric linear correction unit (Unit Adaptively from Parametric Linear Rectifier, APReLU) .


1. Activate the function

Activation function is the core of artificial neural network one of the components, whose role is non-linear artificial neural network. We first review some of the most common activation function, including Sigmoid activation function, Tanh ReLU activation function and the activation function, respectively, as shown in the figure.

Gradient ranging Tanh Sigmoid activation function and the activation functions are (0,1) and (-1,1). When there are many layers, the artificial neural network may experience problems gradient disappears. Gradient ReLU activation function is either zero or one, can well avoid the problem of the disappearance of the gradient and gradient explosion, so in recent years has been widely used.

However, ReLU activation function remains a drawbacks. If at the time of the artificial neural network training, all the characteristics of the situation encountered less than zero, then ReLU activation function of the output is all zeros. This time on the train failed. To avoid this, researchers have proposed a leaky ReLU activation function, is not less than the zero feature set to zero, but will be less than zero, characterized by multiplying a small coefficient, for example, 0.1, and 0.01.

In the leaky ReLU, the value of the coefficient is set manually. However, the coefficient set manually may not be optimal, thus Ho Kai Ming et al proposed Parametric ReLU activation function (parameterized RELU activation function, Prelu activation function), this coefficient is set as a parameter can be trained obtained in the artificial neural network together with the training process and other parameters using the gradient descent method of training. However, there is a characteristic Prelu activation function: Once the training process is completed, Prelu activate the function of the coefficient becomes a fixed value. In other words, for all test samples, Prelu activation function of the coefficient values ​​is the same.

Here we will probably introduce several common activation function. What's wrong with it activate these functions? We can think about, if after an artificial neural network using the above some activation function, or a combination of these several activation function, then the artificial neural network training is completed, when applied to test samples for all test samples using the non-linear transformation is the same. That is, all the test samples will experience the same nonlinear transformation. This is a relatively inflexible way.

As shown below, if we scatter plot on the left represents the original feature space to the right of the scatter plot represents the high-level feature space artificial neural network obtained by learning to scatter plot of dots and small squares represent two kinds of samples of different categories, to F, G and H represents a nonlinear function. These samples are then implemented to transform the original feature space level feature space by the same nonlinear function. In other words, "=" means that for these samples, nonlinear transformation they experienced exactly the same picture.

So, we can according to the characteristics of each sample individually set the parameters for the activation function of each sample, the sample experience different for each non-linear transformation of it? APReLU activation function follow-up article to be introduced, do this. 


2. attentional mechanisms

APReLU activation function reference herein to introduce the classic Squeeze-and-Excitation Network (SENet), while SENet it is a classic, deep neural network algorithm under the attention mechanism. SENet works as shown below:

Here to explain SENet inherent in the idea. For many samples, the degree of importance which the various features of FIG channels is likely to be different. For example, channel characteristics A sample of a very important feature of Channel 2 is not important; the channel characteristics of the sample B 1 is not important, Channel 2 features is important; so at this time, for the sample A, we should focus on features channel 1 (i.e., channel characteristic imparting higher weight weight 1); conversely, for sample B, we should focus on characteristic channel 2 (i.e. given a higher weight right eigen channel 2).

For this purpose, a small SENET fully connected through a network, the weight coefficient obtained by learning a set of weights, wherein for each channel original weighted FIG. In this manner, each sample (including the training and test samples) has its own unique set of weights for weighting the channel itself various features. This is actually a focus mechanism that noted the important features of the channel, and then give it a higher weight.


3. The adaptive parametric linear correction unit (APReLU) activation function

APReLU activation function, in essence, is the integration of SENet and PReLU activation function. In SENet, a fully connected network of small weights obtained by learning weight is weighted for each channel characteristics. APReLU activation function also obtained by weighting a small fully connected network, and thus the set of weights as a function of where the coefficients PReLU activated, i.e. the negative portion of the heavy weight. The basic principle APReLU activation function is shown below in FIG.

We can see that in APReLU activation function, form and function of the nonlinear transformation is PReLU activation function exactly the same. The only difference is that, APReLU activation function of the weight coefficients in the negative feature weights, through a small fully-connected network learning obtained. When using Artificial Neural Network APReLU activation function, each sample can have its own unique weighting factor, i.e., a unique non-linear transformation (as shown below). Meanwhile, the input and output characteristic features FIGS APReLU FIG activation function has the same size, which means APReLU can easily be embedded into the existing depth learning algorithms.

In summary, APReLU activation function so that each sample can have its own unique set of nonlinear transformation, provides a more flexible way of non-linear transformation has the potential to improve the accuracy of pattern recognition.

 


references

Zhao M, Zhong S, Fu X, et al. Deep residual networks with adaptively parametric rectifier linear units for fault diagnosis[J]. IEEE Transactions on Industrial Electronics, 2020, DOI: 10.1109/TIE.2020.2972458. 

https://ieeexplore.ieee.org/document/8998530/

 

Guess you like

Origin www.cnblogs.com/uizhi/p/12452760.html