Cause of 'Dead ReLU Problem'

Original address: https://www.quora.com/What-is-the-dying-ReLU-problem-in-neural-networks

Translator's words: After reading some Chinese blogs on the advantages and disadvantages of activation functions, very few people explain the 'Dead ReLU' phenomenon, but I have no choice but to go to foreign forums to find the answer, so I have this translation, which makes sense.


Assuming that the input W of a neural network follows a certain distribution, for a fixed set of parameters (samples), the distribution of w is also the distribution of the input of ReLU. Assume that the ReLU input is a Gaussian distribution with low variance centered at +0.1.

In this scenario:

  • The input to most ReLUs is positive, so
  • Most inputs can get a positive value through the ReLU function (ReLU is open), so
  • Most inputs can be back-propagated through ReLU to get a gradient, so
  • The input (w) of the ReLU can generally be updated by stochastic backpropagation (SGD)

Now, suppose there is a huge gradient going through ReLU during random backpropagation, and since ReLU is on, there will be a huge gradient passed to the input (w). This will cause a huge change in the input w, which means that the distribution of the input w will change, assuming that the distribution of the input w now becomes a low variance, centered at -0.1 Gaussian distribution.

In this scenario:

  • The input to most ReLUs is negative, so
  • Most inputs get a 0 through the ReLU function (ReLU is close), so
  • Most inputs cannot be back-propagated through ReLU to get a gradient, so
  • The input w of ReLU is generally not updated through random backpropagation (SGD)

what happened? Only the distribution function of the input of the ReLU function has undergone a small change (-0.2 change), resulting in a qualitative change in the behavior of the ReLU function. We cross the 0 boundary and the ReLU function is almost permanently turned off. More importantly, once the ReLU function is closed, the parameter w will not be updated, which is the so-called 'dying ReLU'.

(Translator: There is a discussion below about the ability of neurons to be resurrected after death, untranslated)

Mathematically speaking, this is due to the mathematical formula of ReLU

r ( x ) = m a x ( x , 0 )

The derivative is as follows

Δ x r ( x ) = 1 ( x > 0 )

So it can be seen that if ReLU is close during forward propagation, then ReLU is also close during back propagation.

I'm not sure if ReLU dying happens often in practice, but it's clearly something to watch. Hope you can see why a large learning rate might be the culprit here. In the process of backpropagation, large gradient updates may cause the distribution of parameter W to be less than 0.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326045287&siteId=291194637