Information entropy and information difference

Measure of information[ edit ]

Information entropy[ edit ]

American mathematician Claude Shannon is known as the "father of information theory". People usually regard Shannon's paper " A Mathematical Theory of Communication " published in the " Journal of Bell System Technology " in October 1948 as the beginning of modern information theory research. This article is based in part on research published in the 1920s by Harry Nyquist and then Ralph Hartley . In this article, Shannon gave the definition of information entropy :

 

�(�)=��[�(�)]=∑�∈��(�)log2⁡(1�(�)){\displaystyle H(X)=\mathbb {E} _{X}[I(x)]=\sum _{x\in {\mathcal {X}}}^{}p(x)\log _{2}\left({\frac {1}{p(x)}}\right)}

where {\mathcal  {X}}�� is a set of finite events x, Xand �� is a random variable defined {\mathcal  {X}}on ��. Information entropy is a measure of the uncertainty of random events .

Information entropy is closely related to thermodynamic entropy in physics :

�(�)=���(�){\displaystyle S(X)=k_{B}H(X)}

Where S(X) is thermodynamic entropy, H(X) is information entropy, and k_{B}�� is Boltzmann's constant . In fact, this relationship is also the generalized Boltzmann entropy formula , or the thermodynamic entropy expression in the canonical ensemble . It can be seen that Boltzmann and Gibbs' work on entropy in statistical physics inspired the entropy of information theory.

Information entropy is the lower limit of the compression rate in the source coding theorem . If the amount of information used in encoding is less than the information entropy, there must be a loss of information. Shannon defined canonical sets and canonical sequences on the basis of the law of large numbers and asymptotic equipartition . A canonical set is a collection of canonical sequences. Because the probability of an independent and identically distributed n- sequence belonging to the typical set defined by n is about 1, it is only necessary to encode the memoryless n as uniquely decodable, and encode other sequences at will to achieve Almost lossless compression.XXX

example[ edit ]

There is a dice with three sides, 1, 2, 3 are written on the three sides respectively {\displaystyle 1,2,3}, X� is the number thrown, and the probability of getting each side is

�(�=1)=1/5,�(�=2)=2/5,�(�=3)=2/5,{\displaystyle {\begin{aligned}\mathbb {P} (X=1)&=1/5,\\\mathbb {P} (X=2)&=2/5,\\\mathbb {P} (X=3)&=2/5,\end{aligned}}}

but

�(�)=15log2⁡(5)+25log2⁡(52)+25log2⁡(52)≈1.522.{\displaystyle H(X)={\frac {1}{5}}\log _{2}(5)+{\frac {2}{5}}\log _{2}\left({\frac {5}{2}}\right)+{\frac {2}{5}}\log _{2}\left({\frac {5}{2}}\right)\approx 1.522.}

Joint entropy and conditional entropy [ edit ]

Joint Entropy starts from the definition of entropy and calculates the entropy of the joint distribution :

�(�,�)=∑�∈�∑�∈��(�,�)log⁡(1�(�,�)).{\displaystyle H(X,Y)=\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}^{}p(x,y)\log \left({\frac {1}{p(x,y)}}\right).}

Conditional Entropy (Conditional Entropy), as the name implies, is calculated by conditional probability �(�|�) p(y|x):

�(�|�)=∑�∈�∑�∈��(�,�)log⁡(1�(�|�)).{\displaystyle H(Y|X)=\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}^{}p(x,y)\log \left({\frac {1}{p(y|x)}}\right).}

According to Bayesian theorem , there is �(�,�)=�(�|�)�(�) {\displaystyle p(x,y)=p(y|x)p(x)}, substituting into the definition of joint entropy, the conditional entropy can be separated, so the relationship between joint entropy and conditional entropy is obtained:

�(�,�)=�(�)+�(�|�)=�(�)+�(�|�)=�(�,�).{\displaystyle H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)=H(Y,X).}

Chain rule[ edit ]

The relationship between joint entropy and conditional entropy can be extended again. Assume that nthere �� random variables ��,��=1,2,..., {\displaystyle X_{i},i=1,2,...,n}��, and the conditional entropy can be separated repeatedly, as follows:

�(�1,�2,...,��)=�(�1)+�(�2,...,��|�1)=�(�1)+�(�2|�1)+�(�3,...,��|�1,�2)=�(�1)+∑�=2��(��|�1,...,��−1).{\displaystyle {\begin{aligned}H(X_{1},X_{2},...,X_{n})&=H(X_{1})+H(X_{2},...,X_{n}|X_{1})\\&=H(X_{1})+H(X_{2}|X_{1})+H(X_{3},...,X_{n}|X_{1},X_{2})\\&=H(X_{1})+\sum _{i=2}^{n}H(X_{i}|X_{1},...,X_{i-1})\end{aligned}}.}

Its intuitive meaning is as follows: If you receive a sequence of numbers {�1,�2,...,��} {\displaystyle \{X_{1},X_{2},...,X_{n}\}}, and you receive �1 first X_1, then �2 X_2, and so on. Then X_1the total message volume after receiving �1 is �(�1) {\displaystyle H(X_{1})}, X_2the total message volume after receiving �2 is �(�1)+�(�2|�1) {\displaystyle H(X_{1})+H(X_{2}|X_{1})}, until X_{n}after ��, the total message volume should be �(�1,...,��) {\displaystyle H(X_{1},...,X_{n})}, so this receiving process gives the chain rule.

Mutual information[ edit ]

Mutual information is another useful measure of information, which refers to the correlation between two sets of events. The mutual information of two events X�� and �� is defined as:Y

�(�;�)=�(�)−�(�|�)=�(�)+�(�)−�(�,�)=�(�)−�(�|�)=�(�;�).{\displaystyle I(X;Y)=H(X)-H(X|Y)=H(X)+H(Y)-H(X,Y)=H(Y)-H(Y|X)=I(Y;X).}

Its meaning is Y, Xhow much information �� contains ��. YBefore getting ��, Xthe uncertainty about �� is ��(��) {\displaystyle H(X)}, Yafter ��, the uncertainty is ��(��|��) {\displaystyle H(X|Y)}. So once � is obtained Y, {\displaystyle H(X)-H(X|Y)}the uncertainty of �(�)−�(�|�) is eliminated, which Yis Xthe amount of information from � to �.

If �, X,Y� are independent, then �(�,�)=�(�)+�(�) {\displaystyle H(X,Y)=H(X)+H(Y)}, so �(�;�)=0 I(X;Y)=0.

And because �(�|�)≤�(�) {\displaystyle H(X|Y)\leq H(X)}, so

�(�;�)≤min(�(�),�(�)),{\displaystyle I(X;Y)\leq \min(H(X),H(Y)),}

Among them, the condition for the establishment of the equal sign is �=�(�) {\displaystyle Y=g(X)}, gand � is a bijective function.

Mutual information is closely related to G-test and Pearson chi-square test .

application[ edit ]

Information theory is widely used in:

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132206410