Cross entropy and torch.nn.CrossEntropyLoss() study notes


foreword

Defining the softmax operation and the cross-entropy loss function separately may cause numerical instability. Therefore, PyTorch provides a function including softmax operation and cross entropy loss calculation. It has better numerical stability.


1. What is cross entropy?

Cross entropy is mainly used to determine how close the actual output is to the desired output .
Why do you say that, for example: when doing classification training, if a sample belongs to the Kth category, then the output value of the output node corresponding to this category should be 1, while the output of other nodes should be 0, That is [0,0,1,0,….0,0], this array is also the Label of the sample, which is the most expected output result of the neural network. That is to say, use it to measure the difference between the output of the network and the label, and use this difference to update the network parameters through back propagation.

Principle of cross entropy

Before talking about cross entropy, let's talk about information volume and entropy .

Information volume : It is used to measure the uncertainty of an event; the greater the probability of an event occurring and the smaller the uncertainty, the smaller the amount of information it carries. Suppose X is a discrete random variable, its value set is X, and the probability distribution function is
p ( x ) = P ( X = x ) , x ∈ X p(x) = P(X=x), x∈Xp(x)=P(X=x),xX , we define the eventX = x 0 X=x_0X=x0The information content of is: I ( x 0 ) = − log ⁡ ( p ( x 0 ) ) I(x_0 )= -log⁡(p(x_0 ))I(x0)=log(p(x0) ) whenp ( x 0 ) = 1 p(x_0) = 1p(x0)=When 1 , the entropy will be equal to 0, which means that the occurrence of this event will not lead to any increase in the amount of information.

Entropy : It is used to measure the degree of chaos of a system, representing the sum of information in a system; the greater the sum of information, the greater the uncertainty of the system.

For example: if Xiao Ming and Xiao Wang go shooting, the shooting result is actually a 0-1 distribution, and the value of X is {0: hit, 1: miss}. Before hitting the target, we know that the prior probability of Xiao Ming and Xiao Wang hitting the target is 10%, 99.9%. According to the introduction of the amount of information above, we can obtain the amount of information of Xiao Ming and Xiao Wang's target shooting respectively. But if we want to further measure the uncertainty of Xiaoming's shooting results, we need to use the concept of entropy. So how to measure it, it is necessary to use expectations. We expect the amount of information brought by all possible events, and the result can measure the uncertainty of Xiao Ming's shooting:
HA ( x ) = − [ p ( x A ) log ( p ( x A ) ) + ( 1 − p ( x A ) ) log ( 1 − p ( x A ) ) ] = 0.4690 H_A (x)=-[p(x_A )log(p(x_A ))+(1-p(x_A ))log(1- p(x_A ))]=0.4690HA(x)=[p(xA)log(p(xA))+(1p(xA))log(1p(xA))]=0 . 4 6 9 0
Correspondingly, Xiao Wang's entropy (the uncertainty of shooting) is:

H B ( x ) = − [ p ( x B ) l o g ( p ( x B ) ) + ( 1 − p ( x B ) ) l o g ( 1 − p ( x B ) ) ] = 0.0114 H_B (x)=-[p(x_B )log(p(x_B ))+(1-p(x_B ))log(1-p(x_B ))]=0.0114 HB(x)=[p(xB)log(p(xB))+(1p(xB))log(1p(xB))]=0.0114

The uncertainty of Xiao Ming's shooting result is 0.4690; the uncertainty of Xiao Wang's shooting result is 0.0114. The lower the uncertainty, the more certain the result.

Cross entropy : It mainly describes the distance between the actual output (probability) and the expected output (probability), that is, the smaller the value of cross entropy, the closer the two probability distributions are. Suppose the probability distribution p ( x ) p(x)p ( x ) is the expected output, the probability distributionq ( x ) q(x)q ( ​​x ) is the actual output,H ( p , q ) H(p,q)H(p,q ) is cross entropy, then
H ( p , q ) = − ∑ x [ p ( x ) logq ( x ) + ( 1 − p ( x ) ) log ( 1 − q ( x ) ) ] H(p,q )=-∑_x[p(x)logq(x)+(1-p(x))log(1-q(x))]H(p,q)=x[p(x)logq(x)+(1p(x))log(1q ( ​​x ) ) ]
Then how to express the formula, for example, suppose N=3, the expected output isp = ( 1 , 0 , 0 ) p=(1,0,0)p=(1,0,0 ) , the actual outputq 1 = ( 0.5 , 0.2 , 0.3 ) q_1=(0.5,0.2,0.3)q1=(0.5,0.2,0.3) q 2 = ( 0.8 , 0.1 , 0.1 ) q_2=(0.8,0.1,0.1) q2=(0.8,0.1,0 . 1 )
So:
H ( p , q 1 ) = − ( 1 log 0.5 + 0 log 0.2 + 0 log 0.3 + 0 log 0.5 + 1 log 0.8 + 1 log 0.7 ) = 0.55 H(p,q_1)=-(1log0 .5+0log0.2+0log0.3+0log0.5+1log0.8+1log0.7)=0.55H(p,q1)=( 1 log 0 . 5 _ _+0 l o g 0 . 2+0 l o g 0 . 3+0 l o g 0 . 5+1 log 0 . _ _ 8+1 log 0 . _ _ 7 )=0.55
H ( p , q 2 ) = − ( 1 l o g 0.8 + 0 l o g 0.1 + 0 l o g 0.1 + 0 l o g 0.2 + 1 l o g 0.9 + 1 l o g 0.9 ) = 0.19 H(p,q_2)=-(1log0.8+0log0.1+0log0.1+0log0.2+1log0.9+1log0.9)=0.19 H(p,q2)=( 1 log 0 . 8 _ _+0 l o g 0 . 1+0 l o g 0 . 1+0 l o g 0 . 2+1 log 0 . _ _ 9+1 log 0 . _ _ 9 )=0 . 1 9
It can be seen from the above thatq 2 q_2q2with ppp is closer, and its cross-entropy is also smaller.

2. CrossEntropyLoss() function in Pytorch

Let me tell you in advance that CrossEntropy()the function of the function is the same as that of performing first and LogSoftmax()then proceeding . The difference between the two has been introduced at the beginning of the article.NLLLoss()

1.Softmax()

Applies the Softmax function to an n-dimensional input tensor, scaling it such that the elements of the n-dimensional output tensor lie in the range [0,1] and sum to 1.
Softmax is defined as:
Softmax ( xj ) = exp ( xi ) Σ jexp ( xj ) Softmax(x_j)\ =\ \frac{exp(x_i)}{\Sigma_jexp(x_j)}Softmax(xj) = Sjexp(xj)exp(xi)

The test code is as follows:

import torch
import torch.nn as nn

x_input=torch.Tensor([[1,2,3]]) 
print('x_input:\n',x_input) 

#计算输入softmax,此时可以看到每一行加到一起结果都是1
softmax_func=nn.Softmax(dim=1)
soft_output=softmax_func(x_input)
print('soft_output:\n',soft_output)

output:

x_input:
 tensor([[1., 2., 3.]])
soft_output:
 tensor([[0.0900, 0.2447, 0.6652]])

2.LogSoftmax()

LogSoftmax is the natural logarithm of Softmax.
LogSoftmax is defined as:
L og S oftmax ( xj ) = log ( exp ( xi ) Σ jexp ( xj ) ) LogSoftmax(x_j)\ =\ log\left(\frac{exp(x_i)}{\Sigma_jexp(x_j)} \right)LogSoftmax(xj) = log(Sjexp(xj)exp(xi))

The test code is as follows:

#在softmax的基础上取log
log_output=torch.log(soft_output)
print('log_output:\n',log_output)

#对比softmax与log的结合与nn.LogSoftmaxloss(负对数似然损失)的输出结果,发现两者是一致的。
logsoftmax_func=nn.LogSoftmax(dim=1)
logsoftmax_output=logsoftmax_func(x_input)
print('logsoftmax_output:\n',logsoftmax_output)

output:

log_output:
 tensor([[-2.4076, -1.4076, -0.4076]])
logsoftmax_output:
 tensor([[-2.4076, -1.4076, -0.4076]])

2.NLLLoss()

insert image description here
The test code is as follows:

target = torch.Tensor([2]).long()
print('target:\n',target)

#pytorch中关于NLLLoss的默认参数配置为:reducetion=True、size_average=True
nllloss_func=nn.NLLLoss()
nlloss_output=nllloss_func(logsoftmax_output,target)
print('nlloss_output:\n',nlloss_output)

output:

target:
 tensor([2])
nlloss_output:
 tensor(0.4076)

nn.NLLLoss()The thing to do is to take out the value logsoftmax_outputat the corresponding targetposition and take the negative sign, for example target=0, take the value at logsoftmax_outputthe middle index=0position and take the negative sign as -1.

4.CrossEntropy()

insert image description here

#直接使用pytorch中的loss_func=nn.CrossEntropyLoss()看与经过NLLLoss的计算是不是一样
crossentropyloss=nn.CrossEntropyLoss()
crossentropyloss_output=crossentropyloss(x_input,target)
print('crossentropyloss_output:\n',crossentropyloss_output)
crossentropyloss_output:
 tensor(0.4076)

It can be seen that CrossEntropy()the output result is the same as that of LogSoftmax()+ .NLLLoss()


References

[1] Cross-entropy explanation
[2] Code analysis

Guess you like

Origin blog.csdn.net/qq_27839923/article/details/122554692