B- probability theory - entropy and information gain

Newer and more full of "machine learning" to update the site, more python, go, data structures and algorithms, reptiles, artificial intelligence teaching waiting for you: https://www.cnblogs.com/nickchen121/

Entropy and information gain

First, the entropy (Entropy)

Entropy representation 随机变量不确定性的度量. Suppose discrete random variables \ (X-\) can take the \ (n-\) values, which is the probability distribution
\ [P (X = x_i)
= p_i, \ quad i = 1,2, \ ldots, n \] is \ (X-\) entropy is defined as
\ [H (X) = -
\ sum_ {i = 1} ^ n p_i log {p_i} \] Since the entropy depends only on \ (X-\) distribution, and \ (X-\ ) value itself does not matter, so the entropy can also be defined as a
\ [H (p) = -
\ sum_ {i = 1} ^ n p_i log {p_i} \] the greater the entropy, the greater the uncertainty of a random variable and \ (0 \ GEQ {H (P)} \ Leq \ n-log {} \) .

When the random variable takes only two values \ (0 \) and \ (1 \) time, \ (X-\) distribution of
\ [P (X = 1) = p, \ quad P (x = 0) = 1-p, \ quad 0 \
geq {p} \ leq {1} \] entropy is
\ [H (p) = -p
\ log_2 p- (1-p) \ log_2 (1-p) \] this Bernoulli random variables, the probability of change in entropy with a graph as shown below

import numpy as np
from math import log
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

p = np.arange(0.01, 1, 0.01)
entro = -p*np.log2(p) - (1-p)*np.log2(1-p)

plt.plot(p, entro)
plt.title('伯努利分布时熵和概率的关系', fontproperties=font)
plt.xlabel('p')
plt.ylabel('H(p)')
plt.show()

png

When \ (p = 0 \) and \ (p = 1 \) entropy value \ (0 \) , then there is no uncertainty in random variables; when \ (p = 0.5 \) maximum entropy value, and uncertainty maximum random variables.

Second, the conditional entropy (Conditional Entropy)

Suppose the random variable \ ((X-, the Y) \) , which is the joint probability
\ [p (X = x_i, Y = y_i), \ quad i = 1,2, \ ldots, n; \ quad j = 1, 2, \ ldots, m \]
conditional entropy \ (H (Y | X) \) represents the known random variable \ (X-\) under the conditions of random variables \ (the Y \) uncertainty is defined as
\ [ H (Y | X) = \
sum_ {i = 1} ^ n P (X = x_i) H (Y | X = x_i) \] by the equation can be understood as the conditional entropy 在得知某一确定信息的基础上获取另外一个信息时所获得的信息量.

When the probability of the entropy and the conditional entropy is obtained from data estimation, the corresponding entropy and the conditional entropy, respectively called entropy experience (empirical entropy) and experience conditional entropy (empirical conditional entropy).

Third, the joint entropy (Joint Entropy)

Suppose the random variable \ ((X-, the Y) \) , which is the joint probability
\ [p (X = x_i, Y = y_i) = p_ {ij}, \ quad i = 1,2, \ ldots, n; \ quad j = 1,2, \ ldots,
m \] joint entropy measurement uncertainty of a random joint distribution system, which is defined as
\ [H (X, Y) = - \ sum_ {i = 1} ^ n \ sum_ {j = 1
} ^ mp (X = x_i, Y = y_j) \ log {p (X = x_i, Y = y_j)} \] whereby simplification of the simple joint entropy
\ [\ begin {align} H (X, Y) & = - \ sum_ {i = 1} ^ n \ sum_ {j = 1} ^ mp (X = x_i, Y = y_j) \ log {p (X = x_i, Y = y_j)} \\ & = - \ sum_ {i = 1} ^ n \ sum_ {j = 1} ^ mp (X = x_i, Y = y_j) \ log {p (X = x_i) \ log {p ( Y = y_i | X = x_i) }} \\ & = - \ sum_ {i = 1} ^ n \ sum_ {j = 1} ^ mp (X = x_i, Y = y_j) \ log {p (X = x_i )} - \ sum_ {i = 1} ^ n \ sum_ {j = 1} ^ mp (X = x_i, Y = y_j) \ log {p (Y = y_i | X = x_i)} \\ & = - \ sum_ {i = 1} ^ np (X = x_i) \ log {p (X = x_i)} - \ sum_ {i = 1} ^ n \ sum_ {j = 1} ^ mp (X = x_i, Y = y_j ) \ log {p (Y =
y_i | X = x_i)} \\ & = H (X) + H (Y | X) \ end {align} \] same reason \ (H (X, Y) = H (the Y-) + H (the X-| the Y-) \) , that is, the joint entropy representation 对一个两个随机变量的随机系统,可以先观察一个随机变量获取信息,在这之后可以在拥有这个信息量的基础上观察第二个随机变量的信息量,并且无论先观察哪一个随机变量对信息量的获取都是没有任何影响的.

Similarly available contain a \ (n-\) stochastic systems independent random variables \ ((X_1, X_2, \ ldots, X_n) \) is the joint entropy is
\ [H (X_1, X_2, \ ldots, X_n) = \ sum_ {i = 1}
^ n H (X_i) \] can be found even contain \ (n-\) random variables random system to observe whether a random variable which is acquired without any impact on the amount of information.

Fourth, the relative entropy (Relative Entropy)

Sometimes referred to as relative entropy KL divergence (Kullback-Leibler divergence).

Set \ (P (X) \) , \ (Q (X) \) is a discrete random variable \ (X-\) two values of probability distribution, the \ (P \) of \ (Q \) relative entropy is:
\ [\} the begin {align = left DKL (P || Q) & = \ sum_ {I}. 1 ^ = NP (= X-x_i) \ log {\ FRAC {P (= X-x_i)} {Q (X- = x_i)}} \\ & = E_ {p (X = x_i)} \ log {\ frac {p (X = x_i)} {q (X = x_i)}} \ end {align} \]

4.1 relative entropy of nature

  1. If \ (p (x) \) and \ (q (x) \) the same two distributions, the relative entropy is equal to 0
  2. \ (DKL (P || Q) ≠ DKL (P || Q) \) , the relative entropy asymmetry
  3. \(DKL(p||q)≥0\)(利用Jensen不等式可证)
    \[ \begin{align} DKL(p||q) & = \sum_{i=1}^n p(X=x_i)\log{\frac{p(X=x_i)}{q(X=x_i)}} \\ & = - \sum_{i=1}^n p(X=x_i)\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - E_{p(X=x_i)}\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & \geq -\log{E_{p(X=x_i)}}\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - \log\sum_{i=1}^n p(X=x_i)\log{\frac{q(X=x_i)}{p(X=x_i)}} \\ & = - \log\sum_{i=1}^n q(X=x_i) \end{align} \]
    其中\(\sum_{i=1}^n q(X=x_i)=1\),得证\(DKL(p||q)≥0\)
  4. Relative entropy can be used to measure the differences between two probability distributions, meaning the formula above is seeking \ (p \) and \ (q \) of the difference between the number \ (p \) expectations on

Fifth, cross-entropy (Cross Entropy)

Definition: measure time based on the same two probability distributions \ (p (x) \) and \ (q (x) \) cross entropy means, when based on a "unnatural" (as opposed to "true distribution of" \ ( p (x) \) in terms of probability) distribution \ (q (x) \) when encoding, the average number of times in the event a uniquely identified set of bits required (using non-real distribution \ (q (x) \ ) specified policy, efforts to eliminate the uncertainty of the size of the system need to pay).

Suppose the random variable \ (X-\) can take the \ (n-\) values. There are two sample sets a probability distribution \ (p (X = x_i) \) and \ (q (= the X-x_i) \) , where \ (p (X = x_i) \) is the real distribution, \ (q (X = x_i) \) non-real distribution. If true distribution \ (p (X = x_i) \) to measure identify another sample of the desired (average code length) of the code length requires:
\ [\ the begin {align = left} H (P) & = \ sum_ {I = 1} ^ np (X = x_i) \ log {\ frac {1} {p (X = x_i)}} \\ & = - \ sum_ {i = 1} ^ np (X = x_i) \ log {p (X = x_i)} \ end
{align} \] If a non-genuine distribution \ (q (X = x_i) \) expressed from the real distribution \ (p (X = x_i) \) average code length, is : \
[H (P, Q) = \ sum_ {I =. 1} ^ NP (X-= x_i) \ log {\ FRAC {. 1} {Q (X-= x_i)}} \]
as a \ (q (X = x_i) \) coded samples from the distribution \ (Q (X-= x_i) \) , so \ (H (p, q) \) probability is \ (p (X = x_i) \)In this case it will \ (H (p, q) \) is called the cross-entropy.

for example. Consider a random variable \ (X-\) , the real distribution \ (p (X) = ( {\ frac {1} {2}}, {\ frac {1} {4}}, {\ frac {1} {8 }}, {\ FRAC {. 1} {. 8}}) \) , non-real distribution \ (q (X) = ( {\ frac {1} {4}}, {\ frac {1} {4}}, {\ FRAC. 1 {{}}. 4}, {\ FRAC. 1 {{}}}. 4) \) , then \ (H (p) = 1.75bits \ text { shortest average code length} \) , cross entropy
\ [ H (p, q) = { \ frac {1} {2}} \ log_24 + {\ frac {1} {4}} \ log_24 + {\ frac {1} {8}} \ log_24 + {\ frac {1} { 8}} \ log_24 = 2bits \
] It can be seen that the distribution of non real \ (q (X = x_i) \) average code length greater than obtained according to the real distribution \ (p (X = x_i) \) average obtained yards long, but this example is always a big so could this happen?

Sixth, the relationship between relative entropy, cross entropy and entropy

Here simplification about the relative entropy formula.
\ [\ Begin {align} DKL (p || q) & = \ sum_ {i = 1} ^ np (X = x_i) \ log {\ frac {p (X = x_i)} {q (X = x_i) }} \\ & = \ sum_ { i = 1} ^ np (X = x_i) \ log {p (X = x_i)} - p (X = x_i) \ log {q (X = x_i}) \ end { align} \]
If at this time the formula and the simultaneous entropy cross entropy formulas
\ [\ begin {align} entropy & = H (p) \\ & = - \ sum_ {i = 1} ^ np (X = x_i) \ log {P (= X-x_i)} \} End {align = left \]
\ [\} the begin cross entropy {align = left & = H (p, q) \\ & = \ sum_ {i = 1} ^ np (X = x_i) \ log {\ frac { 1} {q (X = x_i)}} \\ & = - \ sum_ {i = 1} ^ np (X = x_i) \ log {q (X = x_i)} \ end {align} \]
to Release
\ [DKL (p || q)
= H (p, q) -H (p) \] can be obtained by the above equation when the distribution of non real \ (q (x) \) average code length obtained than the true distribution \ (p (x) \) average code length obtained a plurality of number of bits is relative entropy.

And because \ (DKL (P || Q) ≧ 0 \) , then \ (H (P, Q) ≥H (P) \) , when the \ (p (x) = q (x) \) when this when the cross-entropy equal entropy.
And when the \ (H (p) \) is constant (Note: In machine learning, the training data distribution is fixed), relative entropy minimization \ (DKL (p || q) \) is equivalent to minimize cross entropy \ (H (p, q) \) is also equivalent to maximizing likelihood estimation.

Seven, information gain (Information Gain)

Suppose the random variable \ ((X-, the Y) \) , wherein the information indicates a gain \ (X-\) of information so that the class \ (the Y \) is the degree of reduction of the uncertainty information.

Wherein \ (A \) for the training set \ (D \) information gain referred to as \ (G (D, A) \) , it can be the information gain is defined as a set of \ (D \) experiences entropy and features \ (a \) under given conditions \ (D \) experience conditional entropy \ (H (D | a) \) difference
\ [g (D, a) = H (D) - H (D | a) \ ]
where \ (H (D) \) represents the data set \ (D \) the uncertainty classification; \ (H (D | a) \) represents the feature \ (a \) under given conditions data set \ (D \) classifying uncertainty; \ (G (D, a) \) represents characteristics due \ (a \) such that the data set \ (D \) classification uncertainty reduction degree. Thus the data set can be found in \ (D \) , the gain depends on the characteristic information, wherein different information often have different gain, the gain characteristic information having a large capacity of more classification.

Eight, information gain ratio (Information Gain Ratio)

Suppose the random variable \ ((X-, the Y) \) , wherein \ (A \) of the data set (D \) \ information gain ratio, referred to as \ (the g_R (D, A) \) , defined as
\ [the g_R (D, A) = {\
frac {g (D, A)} {h_A (D)}} \] wherein wherein entropy \ (h_A (D) = - \ sum_ {i = 1} ^ n {\ frac { {D}}} D_i \ log_2 {\ FRAC {D_i} {D}} \) , \ (n-\) is the characteristic \ (a \) number of values.

Nine, a map with you to understand entropy and information gain

A map with you to understand entropy and information gain

Suppose the random variable \ ((X-, the Y) \) , \ (H (X-) \) represents \ (X-\) entropy, \ (H (the Y) \) represents \ (the Y \) entropy, \ ( H (X | Y) \) represents a known \ (the Y \) when \ (X-\) conditional entropy, | \ (H (the Y X-) \) represents a known \ (X-\) when \ (the Y \) the conditional entropy, \ (the I (the X-, the Y-) \) indicating gain, \ (H (the X-, the Y-) \) represents the joint entropy.

Guess you like

Origin www.cnblogs.com/nickchen121/p/11686764.html