Loss function (Loss)

If we define a machine learning model, such as a three-layer neural network, then we need to make this model fit the provided training data as closely as possible. But how do we evaluate whether the model fits the data well enough? Then you need to use corresponding indicators to evaluate its fitting degree. The function used is called the loss function . When the value of the loss function decreases, we think that the model has taken another step forward on the road of fitting. The best fit of the final model to the training data set is when the value of the loss function is minimum, which is when the average value of the loss function is minimum on the specified data set.

Cross Entropy loss function (Cross Entropy)

In physics, "entropy" is used to describe the degree of disorder exhibited by a thermodynamic system. Shannon introduced this concept into the field of information theory and proposed the concept of "information entropy" to measure the uncertainty of information through a logarithmic function.

Cross entropy is an important concept in information theory, which is mainly used to measure the difference between two probability distributions. Assume ppp andqqq is dataxxTwo probability distributions of x , by qqq to representppThe cross entropy of p can be calculated as follows: H ( p , q ) = − ∑ xp ( x ) log ⁡ q ( x ) ) H(p,q)=-\displaystyle \sum_x p(x)\log q(x) )H(p,q)=xp(x)logq ( x )) cross entropy characterizes the distance between two probability distributions, aiming to depict the probability distributionqqq to express the probability distributionppThe degree of difficulty of p . It is not difficult to understand according to the formula. The smaller the cross entropy, the two probability distributionsppp andqqq is closer.

Here we still take the three-category classification problem as an example, assuming that the data xxx belongs to category 1. Remember dataxxThe class distribution probability of x isyyy,显然 y = ( 1 , 0 , 0 ) y=(1,0,0) y=(1,0,0 ) represents dataxxThe actual class distribution probability of x . Remembery ^ \hat{y}y^Represents the class distribution probability predicted by the model.

So for data xxFor x , its actual class distribution probabilityyyy and the model predicted class distribution probabilityy ^ \hat yy^The cross entropy loss function is defined as:
cross entropy = − y × log ⁡ ( y ^ ) cross\ entropy=-y\times \log \left( \hat{y} \right)cross entropy=y×log(y^)

Obviously, a good neural network should try to ensure that for each input data, the gap between the category distribution probability predicted by the neural network and the actual category distribution probability is as small as possible, that is, the smaller the cross entropy, the better. Therefore, cross entropy can be used as a loss function to train neural networks.

Insert image description here
The figure above gives an example of three category classification. Since the input data xxx belongs to category 1, so its actual category probability distribution value isy = ( y 1 , y 2 , y 3 ) = ( 1 , 0 , 0 ) y=(y_1,y_2,y_3)=(1,0,0)y=(y1,y2,y3)=(1,0,0 ) . After transformation by the neural network, the input data xxis obtainedx relative to the predicted median value of the three categories( z 1 , z 2 , z 3 ) (z_1,z_2,z_3)(z1,z2,z3) . Then, through softmax function mapping, the input data xxpredicted by the neural network is obtainedClass distribution probability of x y ^ = ( y ^ 1 , y ^ 2 , y ^ 3 ) \hat y=(\hat y_1,\hat y_2,\hat y_3)y^=(y^1,y^2,y^3) . According to the previous introduction,y ^ 1 , y ^ 2 , y ^ 3 \hat y_1,\hat y_2,\hat y_3y^1,y^2,y^3is a probability value between the range (0,1). Due to sample xxx belongs to the first category, so we hope that the neural network predictsy ^ 1 \hat y_1y^1The value must be much larger than y ^ 2 \hat y_2y^2y ^ 3 \hat y_3y^3value. In order to obtain such a neural network, the following cross-entropy loss function can be used to optimize the model parameters during training: cross entropy = − ( y 1 × log ⁡ ( y ^ 1 ) + y 2 × log ⁡ ( y ^ 2 ) + y 3 × log ⁡ ( y ^ 3 ) ) cross\ entropy=-\left( y_1\times \log \left( \hat{y}_1 \right) +y_2\times \log \left( \hat{y }_2 \right) +y_3\times \log \left( \hat{y}_3 \right) \right)cross entropy=(y1×log(y^1)+y2×log(y^2)+y3×log(y^3) ) In the above formula,y 2 y_2y2y 3 y_3y3are all 0, y 1 y_1y1is 1, so the cross-entropy loss function is simplified to: − y 1 × log ⁡ ( y ^ 1 ) = − log ⁡ ( y ^ 1 ) -y_1\times \log \left( \hat{y}_1 \right) = -\log \left( \hat{y}_1 \right)y1×log(y^1)=log(y^1) In neural network training, the error (i.e. loss) between the actual category probability distribution of the input data and the category probability distribution predicted by the model is transferred from the output end to the input end in order to optimize the model parameters. The following is a brief introduction to the error calculated based on cross entropy fromy ^ 1 \hat y_1y^1passed to z 1 z_1z1sum z 2 z_2z2 z 3 z_3 z3The derivation of and z 2 z_2z2 相同)的情况。 ∂ y ^ 1 ∂ z 1 = ∂ ( e z 1 ∑ k e z k ) ∂ z 1 = ( e z 1 ) ′ × ∑ k e z k − e z 1 × e z 1 ( ∑ k e z k ) 2 = e z 1 ∑ k e z k − e z 1 ∑ k e z k × e z 1 ∑ k e z k = y ^ 1 ( 1 − y ^ 1 ) \frac{\partial \hat{y}_1}{\partial z_1}=\frac{\partial \left( \frac{e^{z_1}}{\sum_k{e^{z_k}}} \right)}{\partial z_1}=\frac{\left( e^{z_1} \right) ^{'}\times \sum_k{e^{z_k}-e^{z_1}\times e^{z_1}}}{\left( \sum_k{e^{z_k}} \right) ^2}=\frac{e^{z_1}}{\sum_k{e^{z_k}}}-\frac{e^{z_1}}{\sum_k{e^{z_k}}}\times \frac{e^{z_1}}{\sum_k{e^{z_k}}}=\hat{y}_1\left( 1-\hat{y}_1 \right) z1y^1=z1(kezkez1)=(kezk)2(ez1)×kezkez1×ez1=kezkez1kezkez1×kezkez1=y^1(1y^1) due to the cross entropy loss function− log ⁡ ( y ^ 1 ) -\log \left( \hat{y}_1 \right)log(y^1)y ^ 1 \hat y_1y^1The result of the derivation is − 1 y ^ 1 -\frac{1}{\hat{y}_1}y^11 y ^ 1 ( 1 − y ^ 1 ) \hat{y}_1\left( 1-\hat{y}_1 \right) y^1(1y^1)− 1 y ^ 1 -\frac{1}{\hat{y}_1}y^11The result of multiplication is y ^ 1 − 1 \hat{y}_1-1y^11This shows that once the model prediction outputy ^ 1 \hat y_1y^1, subtracting 1 from this output is the cross loss function relative to z 1 z_1z1 的偏导结果。 ∂ y ^ 1 ∂ z 2 = ∂ ( e z 1 ∑ k e z k ) ∂ z 2 = 0 × ∑ k e z k − e z 1 × e z 2 ( ∑ k e z k ) 2 = − e z 1 ∑ k e z k × e z 2 ∑ k e z k = − y ^ 1 y ^ 2 \frac{\partial \hat{y}_1}{\partial z_2}=\frac{\partial \left( \frac{e^{z_1}}{\sum_k{e^{z_k}}} \right)}{\partial z_2}=\frac{0\times \sum_k{e^{z_k}-e^{z_1}\times e^{z_2}}}{\left( \sum_k{e^{z_k}} \right) ^2}=-\frac{e^{z_1}}{\sum_k{e^{z_k}}}\times \frac{e^{z_2}}{\sum_k{e^{z_k}}}=-\hat{y}_1\hat{y}_2 z2y^1=z2(kezkez1)=(kezk)20×kezkez1×ez2=kezkez1×kezkez2=y^1y^2In the same way, the derivative of the cross entropy loss function is − 1 y ^ 1 -\frac{1}{\hat{y}_1}y^11− y ^ 1 y ^ 2 -\hat{y}_1\hat{y}_2y^1y^2− 1 y ^ 1 -\frac{1}{\hat{y}_1}y^11The result of multiplication is y ^ 2 \hat y_2y^2. This means that partial derivatives are performed for nodes other than the first output node. After getting the model prediction output, as long as it is saved, it is the partial derivative result of the cross loss function relative to other nodes. at z 1 z_1z1 z 2 z_2 z2sum z 3 z_3z3After obtaining the partial derivative result, the loss error can be transferred to the input end through the chain rule (described later).

In the above example, assume that the predicted intermediate values ​​( z 1 , z 2 , z 3 ) (z_1,z_2,z_3)(z1,z2,z3) after Softmax mapping, the result is(0.34, 0.46, 0.20) (0.34,0.46,0.20)(0.34,0.46,0.20 ) . Since the input dataxxx belongs to the first category. Obviously this output is not ideal and the model parameters need to be optimized. If the cross-entropy loss function is chosen to optimize the model, then( z 1 , z 2 , z 3 ) (z_1,z_2,z_3)(z1,z2,z3) The partial derivative value of this layer is( 0.34 − 1 , 0.46 , 0.20 ) = ( − 0.66 , 0.46 , 0.20 ) (0.34−1,0.46,0.20)=(−0.66,0.46,0.20)(0.341,0.46,0.20)=(0.66,0.46,0.20)

It can be seen that the combination of Softmax and cross-entropy loss functions brings great convenience to partial derivative calculations. The partial derivative calculation causes the loss error to be transferred from the output end to the input end to optimize the model parameters. Here, the cross-entropy is combined with the Softmax function, so it is also called Softmax loss (Softmax with cross-entropy loss).

Mean Square Error (MSE)

Mean square error loss is also called quadratic loss and L2 loss, and is often used in regression prediction tasks. The mean square error function measures the quality of a model by calculating the square of the distance (ie, the error) between the predicted value and the actual value. That is, the closer the predicted value is to the true value, the smaller the mean square error between the two is.

Suppose there are nn training dataxi x_ixi, each training data xi x_ixiThe real output of is yi y_iyi,Model 对xi x_ixiThe predicted value is y ^ i \hat y_iy^i. The model is in nnThe mean square error loss generated under n training data can be defined as follows: MSE = 1 n ∑ i = 1 n ( yi − y ^ i ) 2 MSE=\frac{1}{n}\sum_{i=1} ^n{\left( y_i-\hat{y}_i \right) ^2}MSE=n1i=1n(yiy^i)2 Assume that the true target value is 100 and the predicted value is between -10000 and 10000. Draw the MSE function curve as shown in the figure below. It can be seen that when the predicted value is closer to 100, the MSE loss value is smaller. MSE loss ranges from 0 to∞ ∞

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_49346755/article/details/127356376