Amount of Information, Entropy Entropy, Cross Entropy Cross Entropy, KL Divergence KL Divergence, Cross Entropy Loss Function Cross Entropy Loss

1. Amount of Information

  • Information volume: A measure of how difficult it is for an event to occur
    • Small probability events, it is more difficult to happen, so there is a larger amount of information
    • High probability event, it is less difficult to happen, so there is a smaller amount of information

Information formula: I ( x ) : = log 2 ( 1 p ( x ) ) = − log 2 ( p ( x ) ) I{(x)} := log_2(\frac{1}{p_{(x) }}) = - log_2(p_{(x)})I(x):=log2(p(x)1)=log2(p(x))

Properties: For independent events A and B: p ( AB ) = p ( A ) p ( B ) p_{(AB)} = p_{(A)}p_{(B)}p(AB)=p(A)p(B), the amount of information of two events occurring at the same time is equal to the addition of the amount of information of the two events: I ( AB ) = I ( A ) + I ( B ) I(AB) =I(A) + I(B)I(AB)=I(A)+I(B)

→ I ( A B ) = l o g 2 ( 1 p ( A B ) ) = l o g 2 ( 1 p ( A ) p ( B ) ) = l o g 2 ( 1 p ( A ) ) + l o g 2 ( 1 p ( B ) ) = I ( A ) + I ( B ) \quad \quad \rightarrow I{(AB)} = log_2(\frac{1}{p_{(AB)}}) = log_2(\frac{1}{p_{(A)}p_{(B)}}) = log_2(\frac{1}{p_{(A)}}) + log_2(\frac{1}{p_{(B)}}) = I(A) + I(B) I(AB)=log2(p(AB)1)=log2(p(A)p(B)1)=log2(p(A)1)+log2(p(B)1)=I(A)+I(B)

0 ≤ p ( x ) ≤ 1 0 \le p_{(x)} \le 10p(x)1

Example 1: Flip a coin, heads probability p ( A ) = 0.5 p_{(A)} =0.5p(A)=0.5 , tails probabilityp ( B ) = 0.5 p_{(B)}=0.5p(B)=0.5

→ I ( A ) = − l o g 2 ( 0.5 ) = 1 \quad \quad \rightarrow I{(A)} = - log_2(0.5) =1 I(A)=log2(0.5)=1 , I ( B ) = − l o g 2 ( 0.5 ) = 1 \quad \quad I{(B)} = - log_2(0.5) = 1 I(B)=log2(0.5)=1

Example 2: Flip a coin, heads probability p ( A ) = 0.2 p_{(A)}=0.2p(A)=0.2 , tails probabilityp ( B ) = 0.8 p_{(B)}=0.8p(B)=0.8

→ I ( A ) = − l o g 2 ( 0.2 ) = 2.32 \quad \quad \rightarrow I{(A)} = - log_2(0.2) =2.32 I(A)=log2(0.2)=2.32 , I ( B ) = − l o g 2 ( 0.8 ) = 0.32 \quad \quad I{(B)} = - log_2(0.8) = 0.32 I(B)=log2(0.8)=0.32

Conclusion: Small probability events have a large amount of information, and high probability events have a small amount of information


2. Entropy Entropy

Definition: The expected information content of a probability distribution: H ( p ) : = E ( I ( x ) ) H(p):=E(I(x))H(p):=E ( I ( x )) , (It can also be understood as: the amount of information of the system as a whole. Among them, the system as a whole is composed of all possible events. For example, flipping a coin, the front and back constitute a whole system)

Function: used to evaluate the degree of uncertainty of the probability model

  • The greater the uncertainty, the greater the entropy
  • The smaller the uncertainty, the smaller the entropy

Form : H ( p ) = ∑ pi I ip = − ∑ pilog 2 ( pi ) H(p) = \sum{p_iI_i^p} = -\sum{p_ilog_2(p_i)}H(p)=piIip=pilog2(pi)

Example 1: Toss a coin, the probability of heads p ( A ) = 0.5 p_{(A)}=0.5p(A)=0.5 , tails probabilityp ( B ) = 0.5 p_{(B)}=0.5p(B)=0.5

H ( p ) = ∑ p i I i p = p ( A ) ⋅ l o g 2 ( 1 / p ( A ) ) + p ( B ) ⋅ l o g 2 ( 1 / p ( B ) ) = 0.5 ⋅ l o g 2 ( 1 / 0.5 ) + 0.5 ⋅ l o g 2 ( 1 / 0.5 ) = 0.5 ⋅ 1 + 0.5 ⋅ 1 = 1 \quad \quad \begin{aligned} H(p) &= \sum{p_iI_i^p} \\ &= p_{(A)} \cdot log_2(1/p_{(A)}) + p_{(B)} \cdot log_2(1/p_{(B)}) \\ &= 0.5 \cdot log_2(1/0.5) + 0.5 \cdot log_2(1/0.5) \\ &= 0.5 \cdot 1 + 0.5 \cdot 1 \\ &= 1 \end{aligned} H(p)=piIip=p(A)log2(1/p(A))+p(B)log2(1/p(B))=0.5log2(1/0.5)+0.5log2(1/0.5)=0.51+0.51=1

Example 2: Flip a coin, heads probability p ( A ) = 0.2 p_{(A)}=0.2p(A)=0.2 , tails probabilityp ( B ) = 0.8 p_{(B)}=0.8p(B)=0.8

H ( p ) = ∑ p i I i p = p ( A ) ⋅ l o g 2 ( 1 / p ( A ) ) + p ( B ) ⋅ l o g 2 ( 1 / p ( B ) ) = 0.2 ⋅ l o g 2 ( 1 / 0.2 ) + 0.8 ⋅ l o g 2 ( 1 / 0.8 ) = 0.2 ⋅ 2.32 + 0.8 ⋅ 0.32 = 0.72 \quad \quad \begin{aligned} H(p) &= \sum{p_iI_i^p} \\ &= p_{(A)} \cdot log_2(1/p_{(A)}) + p_{(B)} \cdot log_2(1/p_{(B)}) \\ &= 0.2 \cdot log_2(1/0.2) + 0.8 \cdot log_2(1/0.8) \\ &= 0.2 \cdot 2.32 + 0.8 \cdot 0.32 \\ &= 0.72 \end{aligned} H(p)=piIip=p(A)log2(1/p(A))+p(B)log2(1/p(B))=0.2log2(1/0.2)+0.8log2(1/0.8)=0.22.32+0.80.32=0.72

Conclusion:
If the probability density is uniform, the uncertainty of the generated random variable is higher, and the value of entropy is larger.
If the probability density is gathered, the certainty of the generated random variable is higher, and the value of entropy is smaller


3. Cross Entropy

Suppose the true probability distribution is ppp , predicted probability distribution (estimated probability distribution) isqqq
definition: predictive probability distributionqqq versus the true probability distributionppThe estimation of the average amount of information of p is called cross entropy

公式 : H ( p , q ) = ∑ p i I i q = − ∑ p i l o g 2 ( q i ) H(p, q) = \sum{p_iI_i^q} = -\sum{p_i log_2(q_i)} H(p,q)=piIiq=pilog2(qi)

Example 1: Toss a coin, the true probability of heads p ( A ) = 0.5 p(A)=0.5p(A)=0.5 , the real probability of tailsp ( B ) = 0.5 p(B)=0.5p(B)=0.5 ; positive estimated probabilityq ( A ) = 0.2 q(A)=0.2q(A)=0.2 , negative estimated probabilityq ( B ) = 0.8 q(B)=0.8q(B)=0.8

H(p,q) = − ∑ pilog2(qi) = p(A) ⋅log2(1/q(A)) + p(B) ⋅log2(1/q(B)) = 0.5 ⋅ log 2 ( 1 / 0.2 ) + 0.5 ⋅ log 2 ( 1 / 0.8 ) = 0.5 ⋅ 2.32 + 0.5 ⋅ 0.32 = 1.32 \quad \quad \begin{aligned} H(p, q) &= -\sum{p_ilog_2( q_i)} \\ &= p_{(A)} \cdot log_2(1/q_{(A)}) + p_{(B)} \cdot log_2(1/q_{(B)}) \\ &= 0.5 \cdot log_2(1/0.2) + 0.5 \cdot log_2(1/0.8) \\ &= 0.5 \cdot 2.32 + 0.5 \cdot 0.32 \\ &= 1.32 \end{aligned};H(p,q)=pilog2(qi)=p(A)log2(1/q(A))+p(B)log2(1/q(B))=0.5log2(1/0.2)+0.5log2(1/0.8)=0.52.32+0.50.32=1.32

Example 2: Toss a coin, the true probability of heads p ( A ) = 0.5 p(A)=0.5p(A)=0.5 , the real probability of tailsp ( B ) = 0.5 p(B)=0.5p(B)=0.5 ; Positive estimated probabilityq ( A ) = 0.4 q(A)=0.4q(A)=0.4 , negative estimated probabilityq ( B ) = 0.6 q(B)=0.6q(B)=0.6

H(p,q) = − ∑ pilog2(qi) = p(A) ⋅log2(1/q(A)) + p(B) ⋅log2(1/q(B)) = 0.5 ⋅ log 2 ( 1 / 0.4 ) + 0.5 ⋅ log 2 ( 1 / 0.6 ) = 0.5 ⋅ 1.32 + 0.5 ⋅ 0.74 = 1.03 \quad \quad \begin{aligned} H(p, q) &= -\sum{p_ilog_2( q_i)} \\ &= p_{(A)} \cdot log_2(1/q_{(A)}) + p_{(B)} \cdot log_2(1/q_{(B)}) \\ &= 0.5 \cdot log_2(1/0.4) + 0.5 \cdot log_2(1/0.6) \\ &= 0.5 \cdot 1.32 + 0.5 \cdot 0.74 \\ &= 1.03 \end{aligned};H(p,q)=pilog2(qi)=p(A)log2(1/q(A))+p(B)log2(1/q(B))=0.5log2(1/0.4)+0.5log2(1/0.6)=0.51.32+0.50.74=1.03

Conclusion:
(1) The closer the estimated probability distribution is to the real probability distribution, the smaller the cross entropy.
(2) The value of cross entropy is always greater than the value of entropy (according to Gibbs inequality)


4. Relative entropy (KL divergence, KL Divergence)

KL divergence is named after Kullback and Leibler, also known as relative entropy

Role: Used to measure the difference between two probability distributions

Official:

D ( p ∣ ∣ q ) = H ( p , q ) − H ( p ) = ∑ p i l o g 2 ( 1 / q i ) − ∑ p i l o g 2 ( 1 / p i ) = ∑ p i [ l o g 2 ( 1 / q i ) − l o g 2 ( 1 / p i ) ] = ∑ p i [ I q − I p ] #    I q − I p 为信息量之差 = ∑ p i l o g 2 ( p i / q i ) \begin{aligned} D(p||q) &= H(p, q) - H(p) \\ & = \sum{p_i log_2(1 / q_i)} - \sum{p_i log_2(1 / p_i)} \\ & = \sum{p_i [log_2(1 / q_i) - log_2(1 / p_i) ]} \\ & = \sum{p_i [I_q - I_p ]} \quad \quad \quad \# \; I_q - I_p为信息量之差\\ & = \sum{p_i log_2(p_i / q_i)} \\ \end{aligned} D(p∣∣q)=H(p,q)H(p)=pilog2(1/qi)pilog2(1/pi)=pi[log2(1/qi)log2(1/pi)]=pi[IqIp]#IqIpdifference in amount of information=pilog2(pi/qi)

Important properties:
(1) According to Gibbs inequality: D ( p ∣ ∣ q ) ≥ 0 D(p||q) \ge 0D(p∣∣q)0 ; when the distribution q is exactly the same as the distribution p,D ( p ∣ ∣ q ) = 0 D(p||q) = 0D(p∣∣q)=0

insert image description here

(2) D ( p ∣ ∣ q ) D(p||q) D ( p ∣∣ q ) andD ( q ∣ ∣ p ) D(q||p)D ( q ∣∣ p ) is different, that is,D ( p ∣ ∣ q ) ≠ D ( q ∣ ∣ p ) D(p||q) \neq D(q||p)D(p∣∣q)=D(q∣∣p)

  • D ( p ∣ ∣ q ) D(p||q) D ( p ∣∣ q ) represents the estimated probability distribution qqbased on p (the real probability distribution)q and the true probability distributionppthe gap between p
  • D ( q ∣ ∣ p ) D(q||p) D ( q ∣∣ p ) represents the estimated probability distribution ppbased on q (the real probability distribution)p and the true probability distributionqqthe gap between q

5. Cross Entropy Loss Function Cross Entropy Loss

It can be seen from the above that the KL divergence D ( p ∣ ∣ q ) D(p||q)D ( p ∣∣ q ) represents the gap between the predicted distribution q and the real distribution p, so we can directly define the loss function as the KL divergence:L oss = D ( p ∣ ∣ q ) Loss =D(p|| q)Loss=D ( p ∣ ∣ q )
and we hope that the predicted distribution q of the model is exactly the same as the real distribution p, namely: Loss functionL oss = D ( p ∣ ∣ q ) = 0 Loss = D(p||q) = 0Loss=D(p∣∣q)=0

Loss function: L oss = D ( p ∣ ∣ q ) = H ( p , q ) − H ( p ) = ∑ pilog 2 ( 1 / qi ) − ∑ pilog 2 ( 1 / pi ) (1) Loss function: Loss = D(p||q) = H(p, q) - H(p) = \sum{p_i log_2(1/q_i)} -\sum{p_i log_2(1/p_i)} \tag{1}Loss function: L oss=D(p∣∣q)=H(p,q)H(p)=pilog2(1/qi)pilog2(1/pi)(1)

For classification problems, the real distribution is a single-point distribution, the probability of the real category is 1, and the probability of other categories is 0, similar to the following:

category class1 class 2 class 3 class 4
probability 0 0 1 0

p c l a s s 1 = p c l a s s 2 = p c l a s s 4 = 0 , l o g 2 ( 1 / p c l a s s 3 ) = 0 p_{class1} = p_{class2} = p_{class4} = 0, \quad \quad log_2(1/p_{class3}) = 0 pclass1=pclass2=pclass4=0,log2(1/pclass3)=0

所以, H ( p ) = ∑ p i l o g 2 ( 1 / p i ) = 0 H(p) = \sum{p_i log_2(1 / p_i)} = 0 H(p)=pilog2(1/pi)=0

The loss function (1) can be further simplified as: L oss = D ( p ∣ ∣ q ) = H ( p , q ) − H ( p ) = H ( p , q ) (2) Loss = D(p|| q) = H(p, q) - H(p) = H(p, q) \tag{2}Loss=D(p∣∣q)=H(p,q)H(p)=H(p,q)(2)

H ( p , q ) H(p, q) H(p,q ) is cross entropy, so the loss function is also called cross entropy loss function:
C ross _ Entropy _ L oss = H ( p , q ) = − ∑ pilog 2 ( qi ) (3) Cross\_Entropy\_Loss = H (p, q) = -\sum{p_i log_2(q_i)} \tag{3}Cross_Entropy_Loss=H(p,q)=pilog2(qi)(3)

And because the real distribution is a single-point distribution, the probability of the real class pclass = 1 p_{class}=1pclass=1 , the probability of other classespclass ˉ = 0 p_{\bar {class}}=0pclassˉ=0

C r o s s _ E n t r o p y _ L o s s = H ( p , q ) = − l o g 2 ( q c l a s s ) Cross\_Entropy\_Loss = H(p, q) = - log_2(q_{class}) Cross_Entropy_Loss=H(p,q)=log2(qclass)

Guess you like

Origin blog.csdn.net/weixin_37804469/article/details/126571956