Intuitively understand why the classification problems with cross-entropy loss instead of the mean square error loss?

Blog: blog.shinelee.me | blog park | CSDN

Cross entropy loss and loss of the mean square error

Softmax final layer of a conventional network as shown in FIG classified, conventional machine learning this analogy,

https://stats.stackexchange.com/questions/273465/neural-network-softmax-activation

A total of \ (K \) class, so that the output of the network \ ([\ Hat {Y} _1, \ DOTS, \ Hat {Y} _K] \) , each category corresponding to a probability, so as label \ ([ Y_1, \ DOTS, Y_k] \) . Belonging to a \ (P \) sample type, in which label \ (. 1 y_p = \) , \ (Y_1, \ DOTS, Y_ {}. 1-P, P Y_ {+}. 1, \ DOTS, Y_k \ ) are zero.

For this sample, the cross-entropy (cross entropy) loss of
\ [\ begin {aligned} L & = - (y_1 \ log \ hat {y} _1 + \ dots + y_K \ log \ hat {y} _K) \\ & = -y_p \ log \ hat {y } _p \\ & = - \ log \ hat {y} _p \ end {aligned} \]
mean square error loss (mean squared error, MSE) is
\ [\ begin {aligned} L & = (y_1 - \ hat {y} _1) ^ 2 + \ dots + (y_K - \ hat {y} _K) ^ 2 \\ & = (1 - \ hat {y} _p) ^ 2 + (\ hat {y} _1 ^ 2 + \ dots + \ hat {y} _ {p-1} ^ 2 + \ hat {y} _ {p + 1} ^ 2 + \ dots + \ hat {y} _K ^ 2 ) \ end {aligned} \]
is \ (m \) loss of samples as
\ [\ ell = \ frac {
1} {m} \ sum_ {i = 1} ^ m L_i \] Comparative cross entropy loss and are squared error loss, the loss of a single sample can look below were analyzed from two angles.

Loss of function of the angle

Loss function is learning the baton, which guide the direction of learning - to make smaller loss of function parameters is good parameters.

Therefore, the selection and design of loss function to be able to express your wish to model the nature and tendencies.

Comparative cross-entropy and the mean square error loss, can be found both in the (\ hat {y} = y = 1 \) \ obtain the minimum value 0, but in practice \ (\ hat {y} _p \) only 1 will be close to but not exactly equal to 1, in \ (\ hat {y} _p <1 \) in the case of

  • Cross entropy only the relevant category label, \ (\ Hat {Y} _p \) tends to 1, the better
  • Mean square error only with the \ (\ hat {y} _p \) but also related to other items, it is desirable \ (\ hat {y} _1 , \ dots, \ hat {y} _ {p-1}, \ hat {y} _ {p + 1}, \ dots, \ hat {y} _K \) the average possible, i.e. \ (\ frac {1- \ hat {y} _p} {K-1} \) when obtaining the minimum value

Classification problems, for the correlation between the categories, we lack a priori.

Although we know that higher compared with the similarity between "dog", "cat" and "Tiger", but this marks the beginning of the relationship in the sample is difficult to quantify, so the label is one hot.

In this context, the mean square error loss may give the wrong instructions , such as cats, 3 classification tiger, dog, label as \ ([1, 0, 0] \) , the mean square error seems to predict as \ ([0.8, 0.1, 0.1] \) than \ ([0.8, 0.15, 0.05] \) is better, that is considered average better than tendentious better, but this is contrary to our common sense .

The pair of cross entropy loss, since the inter-class complex similarity matrix is difficult to quantify, can simply sample categories of interest belongs , as long as the \ (\ hat {y} _p \) closer to 1 the like, which display more reasonable.

back propagation angle softmax

softmax role is to \ ((- \ infty, + \ infty) \) Several real number mapped to the \ ((0,1) \) and between a sum of 1, to obtain a certain probabilistic interpretation.

So softmax function input \ (Z \) , the output is \ (\ Hat {Y} \) , for nodes \ (P \) has,
\ [\ Hat {Y} _p = \ FRAC {E ^ {Z_p }} {\ sum_ {K =. 1} ^ K E ^ {Z_K}} \]
\ (\ Hat {Y} _p \) only with the \ (Z_p \) , but also with \ (\ {z_k | k \ neq p \} \) related only to see where Z_p $ $, there
\ [\ frac {\ partial \ hat {y} _p} {\ partial z_p} = \ hat {y} _p (1- \ hat {y} _p) \]
\ (\ Hat _p} {Y \) is the probability of correctly classified, the classification is 0 represents completely false, closer to 1 the more correct. The chain rule logically speaking, to the \ (Z_p \) weight attached to weight loss function of the deflector will contain \ (\ hat {y} _p (1- \ hat {y} _p) \) this factor term, \ (\ Hat {Y} _p = 0 \) when the classification error, but the deflector is 0, the weight is not updated, this is clearly not - classified the error the greater the need to update the weights .

To cross-entropy loss ,
\ [\ frac {\ partial L} {\ partial \ Hat {Y} _p} = - \ FRAC {. 1} {\ Hat {Y} _p} \]
there
\ [\ frac {\ partial L} {\ partial \ hat { z} _p} = \ frac {\ partial L} {\ partial \ hat {y} _p} \ cdot \ frac {\ partial \ hat {y} _p} {\ partial z_p} = \ hat {y} _p - 1 \]
just the \ (\ hat {y} _p (1- {y} _p) \ \ hat) of \ (\ hat {y} _p \) disappeared, to avoid the above occurrence situation, and \ (\ hat {y} _p \) closer to 1, the deflector closer to 0, i.e., the correct classification of the need to update the weights, which is consistent with our expectations.

While the mean square error loss , \
[\ FRAC {\ partial L} {\ partial \ Hat _p} = {Y} -2 (l- \ Hat _p {Y}) = 2 (\ Hat {Y} _p -. 1 ) \]
there,
\ [\ FRAC {\ partial L} {\ partial \ Hat {Z} _p} = \ FRAC {\ partial L} {\ partial \ Hat {Y} _p} \ CDOT \ FRAC {\ partial \ hat {y} _p} {
\ partial z_p} = -2 \ hat {y} _p (1 - \ hat {y} _p) ^ 2 \] obviously, the above mentioned will occur - \ (\ Y = {} _p Hat 0 \) , classification errors, but does not update the weights .

To sum up, in terms of classification, in terms of loss of function point of view or softmax back propagation angle, cross entropy is better than the mean square error.

reference

Guess you like

Origin www.cnblogs.com/shine-lee/p/12032066.html