[Deep Learning Theory] (1) Loss function

Hello everyone, I recently learned the CS231N Stanford Computer Vision Open Class. It was so wonderful that I would like to share it with you.

Knowing the score of an image belonging to each category, we hope that the score of the image belonging to the correct classification is the largest, then how to quantitatively measure it, that is the role of the loss function . By comparing the difference between the score and the real label, and constructing a loss function, the classification effect of the model can be quantitatively measured , and then the subsequent model optimization and evaluation can be carried out.

After constructing the loss function, our goal is to minimize the value of the loss function , and use the gradient descent method to obtain the partial derivative of the loss function for each weight.

The loss function formula is as follows, f represents the score of an image belonging to each category ; yi represents the real label ; Li represents the loss of each image ; the loss value of all data points is averaged to obtain the final loss function L to be optimized

L = \frac{1}{N}\sum_{i}L_{i}(f(x_{i}, W), y_{i})

Here are some typical loss functions


1. hinge loss function ( hinge loss )

The hinge loss function formula is as follows, sj represents the score of an image belonging to the wrong classification, and syi represents the score of an image belonging to the correct category .

L_{i} = \sum_{j\neq y_{i}}max(0, s_{j}-s_{y_{i}}+1)

For example, as shown below.

To calculate the loss function of the cat image, first find the scores of the picture belonging to the two categories of cars and frogs, subtract the scores belonging to cats, add 1, compare with 0, and take the maximum value. Get the hinge loss function value of the cat picture.

Loss function for cats: max[0, (5.1-3.2+1)] + max[0, (-1.7-3.2+1)] = 2.9 + 0 = 2.9

Loss function for car: max[0, (1.3-4.9+1)] + max[0, (2.0-4.9+1)] = 0 + 0 = 0

Frog's loss function: max[0, (2.2+3.1+1)] + max[0, (2.5+3.1+1)] = 6.3 + 6.6 = 12.9

Averaging the hinge loss values ​​for these three samples gives the final loss function value L = (2.9 + 0 + 12.9) / 3


Features of the hinge loss function

(1) Taking the score of the image car belonging to the car category equal to 4.9 as an example, if the score of the image car belonging to the wrong category is less than 3.9, then the loss function value of the image car is equal to 0. The hinge loss function only penalizes wrong classes with scores close to the correct class.

(2) The minimum value of the hinge loss is 0, and the maximum value is positive infinity. Because the misclassification score can be a very large number

(3) At the beginning of training , all weights are randomly initialized, resulting in similar scores for images belonging to each category . The hinge loss function at this point is the number of misclassified classes . If there are three categories, L= max[0, (score-score+1)] + max[0, (score-score+1)]=2

(4) If the formula of the hinge loss function becomes squared, as follows. Also known as the squared loss function , the misclassification value becomes very large when squared. At this time, we focus on optimizing the loss function with a particularly large value. The squared loss value of the frog picture above becomes 12.9^2

L_{i}=\sum_{j\neq y_{i}}max(0, s_{j}-s_{y_{i}}+1])^{2}

(5) The same loss function value corresponds to many groups of different weights . If the value of the hinge loss function is equal to 0, there are many sets of weights that can make the loss function equal to 0 . For example, if we multiply all the weights by 2, the weights are all 0


2. Regularization

If now a loss function value corresponds to multiple sets of weights, how to choose. At this time, the concept of regularization is introduced.

A regularization term is added after the previously constructed loss function to \lambda represent the regularization strength . Regularization is to make the model simpler and make the parameters and weights smaller.

L2 regularization , calculate the square of each weight value and sum it up. R(W)=\sum_{k}\sum_{l}W_{k,l}^{2} 

L1 regularization , first take the absolute value of each weight value, and then sum the absolute value of each weight. R(W)=\sum_{k}\sum_{l}|W_{k,l}|

The elastic net , which considers both L1 regularization and L2 regularization, uses parameters \betato measure how much attention is paid to L1 and L2. R(W)=\sum_{k}\sum_{l}\beta W_{k,l}^{2}+|W_{k,l}|

The purpose of regularization is to prevent overfitting, make the weights simpler, make the model have stronger generalization ability, and make the model simpler

In addition to adding a regularization term directly after the loss function, there are Dropout, Batch Normalization and other methods to prevent overfitting


for example

w has two sets of weight items, x is the input image, and the result of w*x linear summation is all 1 . So how do you choose these two sets of weights?

Through the L2 regularization method , the regularization result of w1 model=1, and the regularization result of w2 model=(0.25^2)x4 , obviously the regularization result of w2 is small. Regularization selects the best model for each weight element to work together. If you use L1 regularization, you will find that the calculation results of these two sets of weights are the same.


3. Softmax classifier

Now that we have the scores of the images belonging to each category, what we want to get is the probability that the image belongs to each category, the probabilities are all between 0-1, and the sum of the probabilities is equal to 1. The softmax steps are as follows:

(1) First, use an exponential function , and ^ {x}, on the scores of the pictures belonging to each category, which can not only turn negative scores into positive ones, but also retain their monotonicity.

(2) Normalize the result of the exponential function . The probability that the obtained picture belongs to each category is between 0-1, and the sum of the probabilities is equal to 1 . As shown in the figure below, the probability of the image belonging to the car category is the highest, and its probability value is also the highest after conversion into probability.

The role of Softmax is to turn scores into probabilities and preserve the magnitude relationship between scores . The softmax itself does not require any weights, it just does the math.

Softmax can enlarge and separate the scores of the correct and wrong categories as much as possible . After the transformation of the softmax function, if the difference between the maximum score and other classification values ​​is large, softmax will amplify the difference. Even if the difference is small, it will magnify the difference.


4. Cross-entropy loss function

Entropy measures the degree of confusion in physics. The smaller the entropy, the more significant the classification is correct and the misclassification is.

After calculating the probability that the image belongs to each category, the cross-entropy loss function , also known as the log-likelihood loss function, is constructed.

The formula is: Li = -log(the probability that the image belongs to the correct class)

L_ {i} = - log (\ frac {e ^ {sy_ _ i}}} {\ sum _ {j} e ^ {sj}})

It can be seen from the formula that the closer the probability of an image belonging to the correct classification is to 1, the closer the value of the cross-entropy loss function is to 0.

At this point, the value of the cross-entropy loss function has nothing to do with the probability that the picture belongs to the wrong classification, and only cares about the probability that the picture belongs to the correct classification


maximum likelihood estimation

The maximum likelihood estimate is the joint probability of the event that all images are correctly classified

Li calculated above is the cross-entropy loss for an image belonging to the correct class . So if I now have a thousand images and multiply the probabilities that they belong to the correct category, I get the probability that all images are correctly classified. The resulting number is very, very small. To solve this problem, we can take a logarithm outside.

just change   p_{1} \times p_{2}\times p_{3} \times ...  to  -log(p_{1} \times p_{2}\times p_{3} \times ...) = -[log(p_{1})+log(p_{2})+log(p_{3})+...]


(1) The closer the probability of correct classification of each image is to 1, the closer the final cross entropy loss function result is to 0

(2) At the beginning of training, the probability of images belonging to each category is almost the same , so what is the cross-entropy loss value of an image at this time.

Assuming that there are a total of C categories at this time, then the cross entropy loss of the image is:Li=-log(\frac{1}{C})=-log(1)+log(C)=log(C)


5. Summary

We now have a training set and get the score for each image in each category.

Using the hinge loss function, the distance between the scores of each category and the true label can be calculated, and the hinge loss function value can be calculated.

After this score is transformed by the Softmax function, the cross-entropy loss function can be calculated.

To make the model simpler, a regularization term can be added after the loss function.

Guess you like

Origin blog.csdn.net/dgvv4/article/details/123463216