Simple model created
gradient descent method
Expand its Taylor, you can see that f(wk+1)>f(wk)
BP algorithm based on simple model
Define the objective function:
algorithm flow:
activation function
step function
sigmoid
Transform the step function
fishy
Among the above two functions, when x is very large, the function value does not change much and there is an upper bound. That is, when the calculated x is very large, the information transmitted backward is compressed, which is the so-called gradient disappearance. . In addition, gradient explosion may occur.
resume
For this function, when x<0, make its function value 0, that is, if the neuron calculates that activity) to reduce the size of training. Only neurons with x>0 are retained without setting an upper bound, which avoids gradient disappearance and gradient explosion and prevents overfitting.
Air Leak
It does not completely kill those neurons with x<0, it just reduces the activity of the neurons.
General BP algorithm
(When looking for partial derivatives, look for them from the back to the front)