Deep learning: an overview of common entropy and entropy calculation


1. Overview of entropy

Entropy in thermodynamics:

  • is a physical quantity that expresses the degree of disorder of the molecular state

Entropy in information theory:

  • The size of the uncertainty used to describe the source
  • It was proposed by American mathematician Shannon, who is the founder of information theory.
  • The entropy concepts that are often used are as follows: information entropy cross entropy relative entropy conditional entropy mutual information

2. Introduction to common entropy

2.1 Information entropy

We know that information is used to remove random uncertainties from things.

So, first we need to determine how to measure the uncertainty of source information.

The uncertainty function f ( p ) of source information usually satisfies two conditions:

  • is a monotonically decreasing function of the probability p.
  • The uncertainty generated by two independent symbols should be equal to the sum of their respective uncertainties, that is, f(p 1 , p 2 )=f(p 1 )+f(p 2 )

The logarithmic function satisfies both conditions (two equal signs in it):
f ( p ) = log 1 p = − logpf(p) = log^\frac1p= -log^pf(p)=logp1=logpShannon
information entropy: The average uncertainty of all possible occurrences of the source should be considered. If the source symbol has n values: U1,…,Ui,…,Un, that is, the state value that the source may present, the corresponding probability is P1,…,Pi,…,Pn, and The various occurrences are independent of each other. At this time, the average uncertainty of the source should be the statistical average (E) ofthe single symbol uncertainty-logpi


H ( U ) = E [ − l o g p i ] = ∑ i = 1 n p i l o g ( 1 p i ) H(U) = E[-log^{p_i}] = \sum_{i=1}^{n}p_ilog(\frac{1}{p_i}) H(U)=E[logpi]=i=1npilog(pi1)

2.2 Cross entropy (cross entropy):

Cross entropy definition:

  • Cross-entropy is an important concept in information theory, which is used to characterize the difference between two variable probability distributions P, Q (assuming that P represents the real distribution, and Q is the distribution predicted by the model).
  • The greater the cross entropy, the greater the degree of difference between the two variables .

Cross entropy formula:
H ( P , Q ) = − ∑ x ∈ X p ( x ) log Q ( x ) = ∑ x ∈ X p ( x ) log 1 Q ( x ) H(P,Q)=-\sum_ {x \in X}^{}p(x)log{Q(x)}=\sum_{x \in X}^{}p(x)log\frac{1}{Q(x)}H(P,Q)=xXp(x)logQ(x)=xXp(x)logQ(x)1
Cross entropy is widely used in deep learning. It is generally used as a loss function of neural network to measure the difference between the distribution predicted by the model and the real distribution.

2.3 Relative entropy:

Also known as KL divergence (Kullback-Leibler divergence, referred to as KLD), information divergence (information divergence), information gain (information gain).

Definition of relative entropy:

  • is the difference between cross entropy and information entropy. Indicates the additional information needed to simulate the real distribution P with the distribution Q.
  • A large value in the calculation result indicates a large gap in the distribution.

The calculation formula is:
DKL ( P ∣ ∣ Q ) = ∑ x ∈ X p ( x ) log 1 Q ( x ) − ∑ x ∈ XP ( x ) log 1 P ( x ) = ∑ x ∈ XP ( x ) log P ( x ) Q ( x ) D_{KL}(P||Q)=\sum_{x \in X}^{}p(x)log\frac{1}{Q(x)}-\sum_{x \in X}^{}P(x)log\frac{1}{P(x)}=\sum_{x \in X}^{}P(x)log\frac{P(x)}{Q (x)}DKL(P∣∣Q)=xXp(x)logQ(x)1xXP(x)logP(x)1=xXP(x)logQ(x)P(x)
Relative entropy example:

Assume that a character transmitter randomly sends two characters of 0 and 1, and its real sending probability distribution is A. Now there are two observed probability distributions B and C. The individual distributions are as follows:

  • A(0)=1/2 A(1)=1/2
  • B(0)=1/4 B(1)=3/4
  • C(0)=1/8 C(1)=7/8

Question: Which of B and C is closer to the actual distribution A?

answer:

Then we can use relative entropy to measure the similarity (closeness) between distributions
DKL ( A ∣ ∣ B ) = ∑ x ∈ XA ( x ) log A ( x ) B ( x ) = A ( 0 ) log A ( 0 ) B ( 0 ) + A ( 1 ) log A ( 1 ) B ( 1 ) DKL ( A ∣ ∣ C ) = ∑ x ∈ XA ( x ) log A ( x ) C ( x ) = A ( 0 ) log A ( 0 ) C ( 0 ) + A ( 1 ) log A ( 1 ) C ( 1 ) D_{KL}(A||B)=\sum_{x \in X}^{}A(x)log\frac{ A(x)}{B(x)}=A(0)log\frac{A(0)}{B(0)}+A(1)log\frac{A(1)}{B(1) } \\ D_{KL}(A||C)=\sum_{x\in X}^{}A(x)log\frac{A(x)}{C(x)}=A(0)log \frac{A(0)}{C(0)}+A(1)log\frac{A(1)}{C(1)}DKL(A∣∣B)=xXA(x)logB(x)A(x)=A(0)logB(0)A(0)+A(1)logB(1)A(1)DKL(A∣∣C)=xXA(x)logC(x)A(x)=A(0)logC(0)A(0)+A(1)logC(1)A(1)
The calculation result is:

D K L ( A ∣ ∣ B ) = 1 2 l o g 1 2 1 4 + 1 2 l o g 1 2 3 4 D K L ( A ∣ ∣ C ) = 1 2 l o g 1 2 1 8 + 1 2 l o g 1 2 7 8 D_{KL}(A||B)=\frac{1}{2}log\frac{\frac{1}{2}}{\frac{1}{4}}+\frac{1}{2}log\frac{\frac{1}{2}}{\frac{3}{4}} \\ D_{KL}(A||C)=\frac{1}{2}log\frac{\frac{1}{2}}{\frac{1}{8}}+\frac{1}{2}log\frac{\frac{1}{2}}{\frac{7}{8}} DKL(A∣∣B)=21log4121+21log4321DKL(A∣∣C)=21log8121+21log8721
Conclusion: B is closer to A than C.

Properties of relative entropy:

  • Asymmetry:

D K L ( A ∣ ∣ B ) ≠ D K L ( B ∣ ∣ A ) D_{KL}(A||B)\neq D_{KL}(B||A) DKL(A∣∣B)=DKL(B∣∣A)

  • Non-negativity:

D K L ( B ∣ ∣ A ) > 0 D_{KL}(B||A) > 0 DKL(B∣∣A)>0

Some people call the relative entropy (KL divergence) the KL distance, but in fact it does not really measure the distance.

2.4 JS divergence (Jensen-Shannon divergence)
  • Because the KL divergence is not symmetrical, the JS divergence is improved on the basis of the KL divergence. There are two distributions p1 and p2, and the JS divergence formula is:

J S ( P 1 ∣ ∣ P 2 ) = 1 2 K L ( P 1 ∣ ∣ P 1 + P 2 2 ) + 1 2 K L ( P 2 ∣ ∣ P 1 + P 2 2 ) JS(P_1||P_2) = \frac{1}{2} KL(P_1||\frac{P_1+P_2}{2}) + \frac{1}{2} KL(P_2||\frac{P_1+P_2}{2}) JS ( P _1∣∣P2)=21W L ( Sun1∣∣2P1+P2)+21W L ( Sun2∣∣2P1+P2)

  • It is also possible to measure how similar two distributions are.
2.5 combination entropy
  • Joint entropy is also called compound entropy (Joint Entropy):

  • Denoted by H(X,Y), the entropy of the joint distribution of two random variables X, Y forms the joint entropy.

2.6 The conditional entropy

Conditional entropy: H(X|Y) represents the uncertainty of random variable x under the condition of known random variable Y.

H(X|Y) = H(X, Y) - H(Y), which represents the joint entropy of (X, Y), minus the entropy contained in Y alone.

The derivation process:

  • Assuming that y = y j is known , then

H ( x ∣ y j ) = − ∑ i = 1 n p ( x i ∣ y i ) l o g p ( x i ∣ y j ) H(x|y_j) = - \sum_{i=1}^{n}p(x_i|y_i)logp(x_i|y_j) H(xyj)=i=1np(xiyi)logp(xiyj)

  • For various possible values ​​of y, a weighted average needs to be done according to the occurrence probability. Right now

H ( x ∣ y ) = − ∑ i = 1 n ∑ j = 1 m p ( y j ) p ( x i ∣ y j ) l o g p ( x i ∣ y j ) H ( x ∣ y ) = − ∑ i = 1 n ∑ j = 1 m p ( x i , y j ) l o g p ( x i , y j ) p ( y j ) H ( x ∣ y ) = H ( x , y ) − H ( y ) H(x|y) = - \sum_{i=1}^{n}\sum_{j=1}^{m}p(y_j)p(x_i|y_j)logp(x_i|y_j) \\ H(x|y) = - \sum_{i=1}^{n}\sum_{j=1}^{m}p(x_i,y_j)log\frac{p(x_i,y_j)}{p(y_j)} \\ H(x|y) = H(x,y) - H(y) H(xy)=i=1nj=1mp ( andj)p(xiyj)logp(xiyj)H(xy)=i=1nj=1mp(xi,yj)logp ( andj)p(xi,yj)H(xy)=H(x,y)H(y)

2.7 Mutual Information
  • Mutual information can be viewed as the amount of information contained in one random variable about another random variable
  • or the reduced uncertainty of one random variable due to the knowledge of another random variable

The derivation process:

I ( X ; Y ) = H ( X ) − H ( X ∣ Y ) I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) I ( X ; Y ) = ∑ x p ( x ) l o g 1 p ( x ) + ∑ y p ( y ) l o g 1 p ( y ) − ∑ x , y p ( x , y ) l o g 1 p ( x , y ) I (X;Y) = H(X) - H(X|Y) \\ I (X;Y)=H(X)+H(Y)-H(X,Y) \\ I (X;Y)=\sum_{x}^{}p(x)log\frac {1}{p(x)} + \sum_{y}^{}p(y)log\frac{1}{p(y)} - \sum_{x,y}^{}p(x,y)log\frac{1}{p(x,y)} I(X;Y)=H(X)H(XY)I(X;Y)=H(X)+H(Y)H(X,Y)I(X;Y)=xp(x)logp(x)1+yp(y)logp ( and )1x,yp(x,y)logp(x,y)1

in conclusion:

I ( X ; Y ) = ∑ x , y p ( x , y ) l o g p ( x , y ) p ( x ) p ( y ) I (X;Y)=\sum_{x,y}^{}p(x,y)log\frac{p(x,y)}{p(x)p(y)} I(X;Y)=x,yp(x,y)logp(x)p(y)p(x,y)
That is, the mutual information I(X,Y) is the relative entropy of the joint distribution p(x,y) and the product distribution p(x)p(y)

Venn diagram diagram:

insert image description here

3. Entropy calculation

# -*- coding: utf-8 -*-
#演示内容:香农信息熵的计算(例1和例2分别为两种不同类型的输入)以及互信息的计算(例3)。其中log默认为自然对数。

import numpy as np
from math import log

#例1: 计算香农信息熵(已知概率分布)
print("例1:") 
def calc_ent(x):   
    ent = 0.0
    for p in x:
        ent -= p * np.log(p)
    return ent

x1=np.array([0.4, 0.2, 0.2, 0.2])
x2=np.array([1])
x3=np.array([0.2, 0.2, 0.2, 0.2, 0.2])
print ("x1的信息熵:", calc_ent(x1))
print ("x2的信息熵:", calc_ent(x2))
print ("x3的信息熵:", calc_ent(x3))
print("") 

#例2: 计算香农信息熵(此时给定了信号发生情况) 
print("例2:") 
def calcShannonEnt(dataSet):  
    length,dataDict=float(len(dataSet)),{
    
    }  
    for data in dataSet:  
        try:dataDict[data]+=1  
        except:dataDict[data]=1  
    return sum([-d/length*log(d/length) for d in list(dataDict.values())])  
print("x1的信息熵:", calcShannonEnt(['A','B','C','D','A'])) 
print("x2的信息熵:",calcShannonEnt(['A','A','A','A','A'])) 
print("x3的信息熵:",calcShannonEnt(['A','B','C','D','E'])) 


#例3: 计算互信息(输入:给定的信号发生情况,其中联合分布已经手工给出)
print("") 
print("例3:") 
Ent_x4=calcShannonEnt(['3',  '4',   '5',  '5', '3',  '2',  '2', '6', '6', '1'])
Ent_x5=calcShannonEnt(['7',  '2',   '1',  '3', '2',  '8',  '9', '1', '2', '0'])
Ent_x4x5=calcShannonEnt(['37', '42', '51', '53', '32', '28', '29', '61', '62', '10', '37', '42'])
MI_x4_x5=Ent_x4+Ent_x5-Ent_x4x5
print ("x4和x5之间的互信息:",MI_x4_x5)

Guess you like

Origin blog.csdn.net/yt266666/article/details/127284543