Source: https://www.jianshu.com/p/c02a1fbffad6
Softmax straightforward derivation cross entropy loss function
Softmax seeking to write a derivation guide, not only to their own clear thinking, can also benefit of the public, would not Miya ~
softmax frequently added in the neural network classification tasks in the output layer, back-propagation neural network key the step is the derivation from this process can also be a deeper understanding of the process of back-propagation, but also have more thought to the issue of the gradient spread.
softmax function
SoftMax (Flexible max) function, generally in the neural network, as the output layer SoftMax classification task. In fact, can be considered softmax output is the probability of selecting several categories, for example, I have a classification task, to be divided into three categories, according to their relative softmax function of the size of the output probability of selecting three categories, and the probability is 1 .
Formula softmax function is of the form:
Representative S_i i-th output neuron.
ok, in fact, the output behind a set this function before the derivation, we look at each symbol representing a unified network, avoiding sudden appearance behind a symbol ignorant what force is derived not go on.
First is the output of neuron, a neural below:
Output neuron to:
Wherein w_ {ij} is the j-th weight of the i th neuron weight, b is an offset value. z_i denotes the i-th output of the network.
This output softmax to add a function, it has become such:
a_i i represents the output values softmax right side is applied a softmax function.
Loss function loss function
In the back-propagation neural network, the requirements of a loss function, this loss function actually represents the error estimate of the true value of the network, knowing the error in order to know how to modify the network weights.
Loss of function can take many forms, here is the cross-entropy function, mainly due to the result of this derivation is simple, easy to calculate, and solve some of the cross-entropy loss function study the problem of slow. Cross-entropy function is this:
Y_i which represents real classification results.
Here may be nested several levels, but do not worry, the following will be a step by step derivation, it is strongly recommended paper to write about, sometimes to see the light and looked at it confused, and he watched and derive more conducive to understanding ~
Final Preparations
When I started looking softmax deduced, sometimes see half do not know how to launch, in fact, mainly because some derivation rules have forgotten, alas -
so here the basis of derivation posted rules and formulas - some forget about friends can look at:
The derivation process
Well, this case started -
First of all, we must be clear about what we ask, we ask for is our loss for the neuron output (z_i) gradient, namely:
The composite function derivation method:
A man may be questioned why there is a_j instead a_i, look here what softmax formula, because the formula softmax characteristics, its denominator includes all the output neurons, therefore, is not equal to the other output of the i which also contains z_i, all be incorporated into a calculation range, and calculates the latter two cases may need to see i = j and i ≠ j derivation.
Below we push one by one:
The second a little more complicated, we put it into two cases:
If i = j:
If i ≠ j:
ok, let's just take a combination of the above it:
The final result looks a lot simpler, and finally, for the classification, given our results y_i will eventually have a category 1, other categories are 0, therefore, for the classification problem, this gradient is equal to: