Why can cross entropy be used as a loss function for machine learning and deep learning?

Information amount and information entropy in information theory

information volume (self-information)

Whether it is primitive times or modern times, people receive and send a lot of information every day, which is an extremely energy-consuming thing, so how can we slow down the energy consumption? The answer is to avoid the trivial and prioritize the important information. The question is, what method can be used to distinguish which is important information and which is not important information? existinformation theoryBefore it was invented, it was impossible for people to accurately measure the content of an event with numerical values.amount of informationFor example, how much information is included in the fact that I won a top meeting? This isquantification of information. ==Shannon, the father of information theory, proposed that quantifying the amount of information contained in an event can be based on the idea that an unlikely event occurs, and more information is provided than a very likely event. ==For example, still take the above example, if I have won the top meeting before, then the probability of winning the top meeting this time is already very high. This is within the expectations of my teacher and classmates, so there is nothing wrong with it. It's amazing, and I don't need any information to learn more; if I haven't had a paper before, and I have zero output in three years, so according to previous statistical experience, my probability of winning the top meeting is almost 0, then this time suddenly It will be a big deal if you win a top meeting. There must be a lot of unknown information in it that we need to study: Did I hug the big brother's thigh? Inspired? etc. According to this logic, we can use mathematical and statistical methods (functions, probability distributions, etc.) to obtain the quantitative relationship between the probability of an event and the amount of information contained in the event: I ( x ) = − log ( p (
                     x ) ) I(x) = -log(p(x))I(x)=l o g ( p ( x ) )
This is the amount of Shannon information, also known as self-information, which represents how much information an event can provide (the logarithm with a base of 2 in the formula, information The amount is measured in bits (bit), then the amount of information is the number of bits) where p(x) is the probability of event x occurring. fromI ( x ) I(x)From the expression of I ( x ) , we can see that between [0, 1], as p(x) increases,I ( x ) I(x)I ( x ) is getting smaller.
insert image description here

information entropy

However, self-information can only measure the amount of information of a single event, while the whole system presents a distribution (a collection of multiple events). How to represent the distribution of information collection in the whole system? That is, how to quantify the amount of information in the system information set. In information theory, information entropy is used to quantify the information contained in the entire system event set, that is, to quantify the probability distribution of the system information set . The calculation formula is:
           
          H ( X ) = EX − p ( I ( X ) ) = EX − p ( − log ( p ( X ) ) ) = − Σ i = 1 N p ( xi ) log ( p ( xi ) ) H(X)=E_{Xp(I(X))}=E_{Xp( -log(p(X)))}=-\Sigma_{i=1}^{N}p(x_i)log(p(x_i))H(X)=EXp(I(X))=EXp(log(p(X)))=Si=1Np(xi)log(p(xi) )
             
So the information entropy (the amount of information of the entire system, that is, the probability distribution of the information set of the entire system) is the expected value of Shannon’s information amount, which is the weighted average of the information contained in each event (as above). Students who don't understand here can review the probability distribution of discrete variables in probability theory.
The meaning of information entropy: From the perspective of information theory, information entropy is according toThe real distribution p, that is, p(X), to measure the encoding length required to identify a sample (the amount of information mentioned above is measured in bits, and the encoding length here is the expectation of how many bits a data (event) needs to encode at least), that is, the average minimum encoding length.

Relative entropy (KL divergence) and cross entropy

Relative entropy, as the name suggests, is a type of entropy of the interaction between two different distributions. Because information entropy represents the expectation of the information set in a single distribution, we can try to use relative entropy to derive information entropy. For example, an information set S 1 S_1 containing 4 letters (A, B, C, D)S1, its true distribution P=( 1 2 \frac{1}{2}21, 1 2 \frac{1}{2} 21, 0, 0), that is, the probability of both A and B appearing is 1 2 \frac{1}{2}21, the probability of both C and D appearing is 0. According to the calculation method of information entropy, S 1 S_1S1The information entropy of is:
                 
                 H ( P ) = − ( 1 2 log ( 1 2 ) + 1 2 log ( 1 2 ) ) = 1 H(P)=-(\frac{1}{2}log(\frac{ 1}{2})+\frac{1}{2}log(\frac{1}{2}))=1H(P)=(21log(21)+21log(21))=1
                 
The probability of occurrence of information C and D is 0, so its self-information is also zero, so it does not participate in the calculation.
Then we give setS 1 S_1S1For another distribution, change to Q=( 1 4 \frac{1}{4}41, 1 4 \frac{1}{4} 41, 1 4 \frac{1}{4} 41, 1 4 \frac{1}{4} 41), Q is not the set S 1 S_1S1the true distribution of . Under the distribution Q, the set S 1 S_1S1The information entropy of is:
           
           H ( P , Q ) = − ( 1 2 log ( 1 4 ) + 1 2 log ( 1 4 ) + 0 log ( 1 4 ) + 0 log ( 1 4 ) ) H(P, Q) =-(\frac{1}{2}log(\frac{1}{4})+\frac{1}{2}log(\frac{1}{4})+0log(\frac{1} {4})+0log(\frac{1}{4}))H(P,Q)=(21log(41)+21log(41)+0 l o g (41)+0 l o g (41))=2

Although information C and D have the probability of occurrence in the distribution Q, they will not occur in the real distribution P. Here we associate the following two classification tasks, P is the real label, and Q is the predicted result. When we use cross-entropy to calculate the loss, the calculation form is: − Σ i = 1 N P ( xi ) log Q ( xi ) -\Sigma_{ i=1}^{N}~P(x_i)logQ(x_i)Si=1N P(xi)logQ(xi)

In fact, according to Gibbs' inequality, H ( p , q ) > = H ( p ) H(p,q)>=H(p)H(p,q)>=H ( p ) holds constant, when Q is exactly equal to the real distribution P, the two are equal. We call the difference between the average code length obtained by Q and the average code length obtained by P (extra information carrier) as relative entropy, also known as KL divergence: KL ( P ∣ ∣ Q ) = H ( P
   
     , Q ) − H ( P ) = − ∑ i = 1 N pilog ( qi ) + − ∑ i = 1 N pilog ( pi ) = ∑ i = 1 N pilog ( piqi ) KL(P||Q) =H(P , Q)-H(P)=-\sum_{i=1}^Np_ilog(q_i)+-\sum_{i=1}^Np_ilog(p_i)=\sum_{i=1}^Np_ilog(\frac{ p_i}{q_i})KL(PQ)=H(P,Q)H(P)=i=1Npilog(qi)+i=1Npilog(pi)=i=1Npilog(qipi)

H ( P , Q ) in the above formula H(P,Q)H(P,Q ) is the cross entropy. In information theory coding, it describes the average code length (the expectation of the amount of information) required to use the Q distribution to encode the character set whose original distribution is P.

Why cross entropy can be used as a loss function

Through the derivation above,P is the real probability distribution of the data, and Q is the distribution of the data predicted by the machine learning algorithm, then the KL divergence (relative entropy) can be used to measure the difference between the two distributions, which is generated by using the distribution Q to represent the real distribution Q loss (difference). Therefore, the KL divergence (relative entropy) can be used as the objective function of machine learning, and the difference between the two distributions can be reduced by reducing the KL divergence as much as possible through training. The closer the predicted model distribution Q is to the real distribution P. However, as can be seen from the above formula, the cross entropy is only H ( P ) H(P) less than the KL divergenceH ( P ) term, because in machine learningH ( P ) H(P)H ( P ) is fixed (the information entropy of the true distribution of the data), that is,H ( P ) H(P)H ( P ) is a constant term that can be calculated, so we only need to minimizeH ( P , Q ) H(P,Q)H(P,Q ) , this is why cross-entropy can be used as a loss function for classification algorithms.
In machine learning algorithm training, a single samplexi x_ixiThe resulting cross-entropy loss is:
  
                  loss ( x ) = − p ( xi ) log ( q ( xi ) ) loss(x)=-p(x_i)log(q(x_i))loss(x)=p(xi)log(q(xi))

Among them, p ( xi ) is the real label of the data, and q ( xi ) is the prediction result of the model for sample xi. p(x_i) is the true label of the data, and q(x_i) is the prediction result of the model for sample x_i.p(xi) is the true label of the data , q ( xi) is the model pair sample xiprediction results . _

Then, the cross-entropy loss of the entire batch or data set is:
  
                 loss ( X ) = − ∑ i = 1 N p ( xi ) log ( q ( xi ) ) loss(X)=-\sum_{i=1}^N ~p(x_i)log(q(x_i))loss(X)=i=1N p(xi)log(q(xi))

Among them, p ( xi ) is the true label of the data, q ( xi ) is the prediction result of the model for sample xi, and N is the total sample size of the data set or the size of the batch. p(x_i) is the true label of the data, q(x_i) is the prediction result of the model for sample x_i, and N is the total sample size or batch size of the data set.p(xi) is the true label of the data , q ( xi) is the model pair sample xiThe prediction result of , N is the total sample size of the data set or the size of bat c h .

In general classification tasks, for binary classification, q ( xi ) q(x_i)q(xi) is calculated by the activation function sigmoid; for multi-classification,q ( xi ) q(x_i)q(xi) is calculated by the activation function softmax. If you want to learn more, please move to the link:activation function and loss function of binary classification and multi-classification tasks.

Summarize

From the perspective of information theory, this blog explains what information volume (self-information) and information entropy are. Through information volume (self-information) and information entropy, relative entropy (KL divergence) and cross entropy are derived, and finally from KL divergence The significance of explains why cross entropy can be used as a loss function in the field of machine learning. I would like to record what I have learned in peacetime, and hope that it can help friends passing by.

Guess you like

Origin blog.csdn.net/Just_do_myself/article/details/123455479