Loss function | BCE Loss (Binary CrossEntropy Loss)

Image binary classification problem —> multi-label classification

Binary classification is a problem that every AI beginner comes into contact with, such as cat and dog classification, spam classification... In binary classification, we only have two kinds of samples (positive samples and negative samples), generally the label y=1 of positive samples, negative samples label y=0. For example, in the picture below, judge whether there is anyone in it. insert image description here
Then the label of this picture is y=1, then we can design the output of the model according to the label y=1. Because the binary classification only has positive samples and negative samples, and the sum of the probabilities of the two is 1, there is no need to predict a vector, only a probability value needs to be output. The loss function is generally after the output passes through the sigmoid activation function, and the cross-entropy loss function is used to calculate LOSS, that is,
            LOSS = − ( ylog ( p ( x ) + ( 1 − y ) log ( 1 − p ( x ) ) LOSS=-(ylog (p(x)+(1-y)log(1-p(x))LOSS=(ylog(p(x)+(1y)log(1p(x))

where p(x) is the model output and y is the ground truth label.

The nature of Sigmoid and Softmax and their corresponding loss functions and tasks

I have seen a better explanation about the Sigmoid activation function and Softmax function, and I will share it with you:
insert image description here
insert image description here
Seeing the above explanation, we should have a little clarity in our hearts. Why use Sigmoid activation function and BCE loss function for binary classification; for multi-classification, we can use Softmax activation function and multi-category cross-entropy loss function. ; For multi-label classification, the Sigmoid activation function and BCE loss function are used.

I can’t help feeling here that deep learning is not entirely alchemy. Loss functions, activation functions, and model structures are all designed by experts in combination with statistics and target scenarios.

Loss function BCE for multi-label classification tasks

Now let me change the question, is there anyone in this picture, is there a mobile phone (multi-label classification), then there are four types of labels at this time:

Label meaning
(0, 0) There is no one in the picture, and no mobile phone
(0, 1) There is no one in the picture, but there is a mobile phone
(1, 0) There is a person in the picture, but no mobile phone
(1, 1) There are both people and mobile phones in the picture

By analogy, it can also be extended to 2 n 2^n2n cases (n class classification). Obviously, the problem has changed from ordinary binary classification to multi-label classification. How should the output and loss functions of the multi-label classification problem be defined?
Because there are multiple categories in multi-label classification, you cannot simply output a value, but you should output a vector, and you cannot continue to simply normalize the output to the probability value of [0, 1] with Softmax, and each category The probabilities add up to 1. **Because the categories are not mutually exclusive, they are allowed to appear at the same time. **We can use the sigmoid activation function to convert each element of the output vector into a probability value separately.
For the loss function, a relatively simple idea is to use the cross-entropy loss function separately for each element of the output vector, and then calculate the average value. This is the BCE we are talking about today. Just look at the implementation of Pytorch's official source code.
insert image description here

BCE code and examples for Pytorch

For example, if the model output

>>> import torch
>>> output = torch.randn(3,3)
>>> output
tensor([[-0.8858,  0.3241,  0.9456],
        [ 1.4887,  1.8076, -0.0565],
        [-1.6529, -1.8539,  0.6756]])

First convert all elements in the output vector to probability values ​​between [0, 1]

>>> active_func = nn.Sigmoid()
>>> output = active_func(output)
>>> output
tensor([[0.2920, 0.5803, 0.7202],
        [0.8159, 0.8591, 0.4859],
        [0.1607, 0.1354, 0.6627]])

Suppose the label corresponding to the input data is

>>> target = torch.FloatTensor([[0,1,1],[1,1,1],[0,0,0]])
>>> target
tensor([[0., 1., 1.],
        [1., 1., 1.],
        [0., 0., 0.]])

Calculate LOSS using BCE loss function

>>> loss = nn.BCELoss()
>>> loss = loss(output, target)
>>> loss
tensor(0.4114)

Summarize

After the above analysis, BCE is mainly suitable for binary classification tasks, and multi-label classification tasks can be simply understood as the superposition of multiple binary classification tasks. Therefore, BCE can also be applied to multi-label classification tasks after simple modification. Before using BCE, the output variable needs to be quantized between [0, 1] (the Sigmoid activation function can be used). Above, we also deeply analyzed the two activation functions of Sigmoid and Softmax to explore their statistical nature. The output of Sigmoid is Bernoulli distribution, which is what we often call the binomial distribution; while the output of Softmax is expressed as a multinomial distribution. So Sigmoid is usually used for binary classification, and Softmax is used for multi-category classification.

Guess you like

Origin blog.csdn.net/Just_do_myself/article/details/123393900