sigmoid与softmax
and the output unit may softmax sigmoid neural network.
principle
sigmoid
Predicted binary variable $ Y $ value, defined as follows:
$$\hat{y}= \sigma( \omega^{T}h+b)=\frac{1}{1+exp{-( \omega^ {T}h+b)}}$$
Typically use maximum likelihood to learn, because the maximum likelihood cost function is $ -log (y | x) $, the cost function of log offset the sigmoid in exp, so the result is that only when the $ \ sigma $ Function argument is very large will have a saturation gradient is very small.
softmax
The output unit softmax most common multi-classifier.
$$z=W^{T}h+b$$
$$y_{i}=softmax(z_{i})=\frac{exp(z_{i})}{\sum_{j}exp(z_{j})}$$
Similarly, the number of log-likelihood can be offset by exp, and other forms of objective function (such as square error) can not play a role in learning.
Cross entropy and softmax loss
Entropy is a measure of cross-distance distribution and the true sample prediction sample distribution, the smaller the distance distribution, the smaller the value of the cross-entropy, the following formula:
$$H(p,q)=\sum_{i}^{n}-p_{i}log(q_{i})$$
Where $ p $ is the true distribution of the sample, $ q $ is predictive distribution, $ n $ is the number of samples. The form of cross-entropy, the loss can write softmax function formula, $ \ hat {y_ {i}} $ is the real tag training data:
$$L=\sum_{i}^{n}-\hat{y_{i}}log(y_{i})$$
softmax backpropagation
Chain derivation rule can be seen:
$$\frac{ \partial L}{ \partial z_{i}}=\frac{ \partial L}{ \partial y_{j}} \frac{ \partial y_{j}} {\partial z_{i}}$$
Wherein L $ $ to $ y_ {j} $ is the derivative:
$$\frac{\partial L}{\partial y_{j}}=\frac{\partial \left[- \sum_{j} \hat{y_{j}}log(y_{j}) \right]}{\partial y_{j}}=- \sum_{j} \frac{\hat{y_{j}}}{y_{j}}$$
$ Y_ {j} $ $ z_ {i} of the derivative to be divided into two parts $ See
- $ J = i $ when:
$$\frac{ \partial y_{j}} {\partial z_{i}}=\frac{ \partial \left[ \frac{e^{z_{i}}}{\sum_{k}e^{z_{k}}}\right]} {\partial z_{i}}$$
$$ = \ frac {e ^ {z_ {i}} \ sum_ {k} e ^ {z_ {k}} - (e ^ {z_ {i}}) ^ {2}} {(\ sum_ {k} e ^ {z_ {k}}) ^ {2}} $$
$$ = \ frac {e ^ {z_ {i}}} {\ sum_ {k} e ^ {z_ {k}}} (1- \ frac {e ^ {z_ {i}}} {\ sum_ {k } e ^ {z_ {k}}}) $$
$$=y_{i}(1-y_{i})$$
- $j \not= i$时:
$$\frac{ \partial y_{j}} {\partial z_{i}}=\frac{ \partial \left[ \frac{e^{z_{i}}}{\sum_{k}e^{z_{k}}}\right]} {\partial z_{i}}$$
$$ = \ frac {0-e ^ {z_ {j} z_ {i}}} {(\ sum_ {k} e ^ {z_ {k}}) ^ {2}} $$
$$ = - y_ {j} y_ {i} $$
Therefore