CROSS advantage entropy loss function (reproduced)

When the use of some saturation sigmoid activation function, such as activation time, if using a mean square error loss, the weight loss function is transmitted to the right of the last layer of the gradient, the gradient of the formula

 

 

 

Visible gradient proportional to the derivative of the activation function of the last layer, and therefore, if the initial output value is relatively large, i.e. the derivative of the activation function is relatively small, the amplitude of the entire updated gradient amplitude are relatively small, the convergence time is very long. If a start value is relatively small then the output update rate is better, faster convergence, and therefore unstable. And an output value proportional to a deviation from the true value.

Look into when cross-entropy loss function loss:

 

 

 

At this time, with no loss of function of the derivative of the activation function related to the weight of the final layer weight gradients, with only proportional to the difference value and the true value of the output at this time converges faster. And back-propagation is continually multiply, so the entire weight matrix update will be accelerated.

 

 

 

In addition, the multi-classification cross entropy loss simpler derivative losses related to the probability that only the correct category. Softmax activation and loss of the input layer is very simple derivation.


----------------
Disclaimer: This article is CSDN bloggers "without it, only hand-cooked Seoul" in the original article, follow the CC 4.0 BY-SA copyright agreements, please attach a reprint the original source link and this statement.
Original link: https: //blog.csdn.net/qq_42422981/article/details/90645074

Guess you like

Origin www.cnblogs.com/hugh2006/p/11691369.html