[Machine Learning] Basic Probability Theory and Information Theory

For the original markdown link of this blog post , click here! !

introduction

Probability theory is a framework for representing statements of uncertainty. It not only provides methods to quantify uncertainty, but also provides axioms for deriving new uncertainty statements.

In the field of artificial intelligence, probability theory is used in two ways. First, the laws of probability tell us how AI systems reason, and based on this, some algorithms are designed to calculate or estimate expressions derived from probability theory. Second, probability and statistics can be used to analyze and evaluate the behavior of AI systems.

Probability theory allows us to make uncertain statements and to reason about the presence of uncertainty, while information theory allows us to quantify the amount of uncertainty in a probability distribution.

Machine learning often has to deal with uncertain quantities, and sometimes may also deal with random quantities. Uncertainty comes from three sources: the inherent randomness of the system being modeled, incomplete observations, and incomplete modeling.

probability theory

probability

  • Bayesian genre probability : think that probability is the degree of certainty of something happening, that is, the degree of certainty;

  • Frequency genre probability : Thinks that probability is just how often things happen.

  • Random variable : A variable that randomly takes different values. Generally, plain lowercase letters are used to represent random variables, and italic lowercase letters are used to represent the values ​​that random variables can take. Such as random variable x \mathrm xThe values ​​that x can take arex 1 , x 2 x_1, x_2x1,x2

    If the random variable is in vector form, use bold x \mathbf xx means that its possible value isx \boldsymbol xx . On its own, a random variable is simply a description of possible states, and it must specify the likelihood of each state through a probability distribution.

    Random variables may be discrete or continuous. Discrete random variables have finite or countably infinite number of states . States do not have to be integers, there may be some named states without values , and continuous random variables are real values .

Probability distributions

Discrete Random Variables and Probability Mass Functions

For discrete random variables, the probability distribution can use the probability mass function ( probability mass function , PMF ) (\rm probability \ mass \ function, PMF)(probability mass function,P M F ) to describe, usually with capital lettersPPP means that the probability mass function maps each state that each random variable can take to the probability when the random variable takes that state. Such as random variablex \rm xx takesxxThe probability at time x is expressed asP ( x = x ) P(\text{x} = \it x)P(x=x)

In general, a random variable x \rm xThe probability distribution P ( x ) P(\rm x)followed by xP(x)用: x ∼ P ( x ) \text x \sim \mathit{P}(\text x) xP ( x ) to represent. Probability mass functionP ( x ) P(\rm x)P ( x ) must satisfy the following conditions:

⧫   \blacklozenge \, P P The domain of P must be x \text{x}The collection of all states of x ;

⧫   \blacklozenge \, ∀ x ∈ x , 0 ≤ P ( x ) ≤ 1 \forall x \in \text{x}, 0 \leq P(\text{x}) \leq 1 xx0P(x)1;

⧫   \blacklozenge \, ∑ x ∈ x P ( x ) = 1 \sum_{x \in \text{x}} P(x) = 1 xxP(x)=1.

Suppose a random variable x \text{x}x haskkk states, and follow a uniform distribution( uniform distribution ) (\rm uniform \ distribution)( u n i for m d i s t r i b u t i o n )  , by converting its PMF \ rm PMFP M F Set
P ( x = xi ) = 1 k P(\text{x} = x_i) = \frac1kP(x=xi)=k1
to all iii is established.

  • joint probability distribution

    Apply the probability mass function to multiple random variables at the same time, such as P ( x = x , y = y ) \mathit{P}(\text{x} = x, \text{y} = y)P(x=x,y=y),表示 x = x , y = y \mathrm{x} = x, \mathrm{y} = y x=x,y=The probability that y occurs at the same time.

Continuous Random Variables and Probability Density Functions

For continuous random variables, the probability distribution can be used as probability density function ( probability density function , PDF ) (\rm probability \ density \ function, PDF)(probability density function,PDF ) description, usually with lowercase letters ppp to indicate that it must satisfy:

⧫   \blacklozenge \, P P The domain of P must be x \text{x}The collection of all states of x ;

⧫   \blacklozenge \, ∀ x ∈ x , 0 ≤ P ( x ) \forall x \in \text{x}, 0 \leq P(\text{x}) xx0P(x);

⧫   \blacklozenge \, ∫ − ∞ ∞ p ( x ) d x = 1 \displaystyle \int _{-\infty}^{\infty} p(x)\mathrm{d}x = 1 p(x)dx=1.

So you can use PDF \rm PDFPDF is integrated to obtain the true probability mass of the point set.

For example, at ( a , b ) (a,b)(a,b ) ,
U ( x ; a , b ) = 1 b − a U(x;a,b) = \frac{1}{b - a}U(x;a,b)=ba1

marginal probability distribution

Definition : The probability distribution of a subset of the joint probability distribution of a set of variables. A probability distribution defined on a subset is called a marginal probability distribution .

Suppose there are discrete random variables x, y \text {x, y}x, y and knownP ( x,y ) P(\text{x,y})P(x,y)。那么
∀ x ∈ x , P ( x = x ) = ∑ y P ( x = x , y = y ) \forall x \in \text{x}, P(\text{x} = x) = \sum_{y} P(\text{x} = x, \text{y} = y) xx,P(x=x)=yP(x=x,y=y )
For continuous variables, use integral
p ( x ) = ∫ p ( x , y ) dyp(x) = \int p(x, y) dyp(x)=p(x,y ) d y

conditional probability distribution

Definition : The probability of an event occurring given the occurrence of other events .

For example, given x = x , \text{x} = x,x=x,事件 y = y \text y = y y=The probability of y happening is
P ( y = y ∣ x = x ) = P ( y = y , x = x ) P ( x = x ) P(\text y = y \mid \text x = x) = \frac {P(\text y = y, \text x = x)}{P(\text x = x)}P ( and=yx=x)=P(x=x)P ( and=y,x=x)

The chain rule of conditional probability

The joint probability distribution of any multidimensional random variable can be decomposed into the form of multiplying the conditional probability of only one variable:
P ( x ( 1 ) , ⋯ , x ( n ) ) = P ( x ( 1 ) ) ∏ i = 2 n P ( x ( i ) ∣ x ( 1 ) , ⋯ , x ( i − 1 ) ) P(\text x^{(1)}, \cdots, \text x^{(n)}) = P(\ text x^{(1)}) \prod_{i = 2}^n P(\text x^{(i)} \mid \text x^{(1)}, \cdots, \text x^{( i - 1)})P(x(1),,x(n))=P(x(1))i=2nP(x(i)x(1),,xLet ( i 1 ) )
determine,
P ( a , b , c ) = P ( a ) P ( b ∣ a ) P ( c ∣ a , b ) P( a , b , c ) = P( a ) P( b \mid a) P(c \mid a,b)P(a,b,c)=P(a)P(ba)P(ca,b )
The solution
P(a, b, c) = P(a ∣ b, c) P(b, c) P(b, c) = P(b ∣ c) P(c) P(a, b, c). ) = P ( a ∣ b , c ) P ( b ∣ c ) P ( c ) \begin{aligned} P(a, b, c) &= P(a \mid b, c) P(b, c) \\ P(b,c) &= P(b\midc) P(c) \\P(a,b,c) &= P(a\midb,c)P(b\midc)P (c) \end{aligned}P(a,b,c)P(b,c)P(a,b,c)=P(ab,c)P(b,c)=P(bc)P(c)=P(ab,c)P(bc)P(c)

Independence and Conditional Independence

Two random variables x \text xx y \text y y , if their joint probability distribution can be expressed as the product of individual probability distributions, then these two random variables are said to be independent of each other, that is,
∀ x ∈ x , y ∈ y , P ( x = x , y = y ) = P ( x = x ) P ( y = y ) \forall x \in \text x,y\in \text y,P(\text x = x, \text y = y) = P(\text x = x )P(\text y = y)xx,yy,P(x=x,y=y)=P(x=x)P(y=y )
if the random variablex \text xx y \text y The conditional probability distribution of y for the random variablez \text zEach value of z can be written in the form of a product, then the random variablex \text xx y \text y y in a given random variablez \text zz时是条件独立的,即
∀ x ∈ x , y ∈ y , z ∈ z , P ( x = x , y = y ∣ z = z ) = P ( x = x ∣ z = z ) P ( y = y ∣ z = z ) \forall x \in \text x,y\in \text y,z \in \text z,P(\text x = x, \text y = y \mid \text z = z) = P(\text x = x \mid \text z = z)P(\text y = y \mid \text z = z) xx,yy,zz,P(x=x,y=yz=z)=P(x=xz=z)P(y=yz=z)

Expectation, Variance, and Covariance

expect

期望 ( E x p e c t a t i o n ) (\rm Expectation) ( Ex p e c t a t i o n ) : function fff onthe probability distributionP ( x ) P(\text x)P ( x ) orP ( x ) P(\text x)The expectation of P ( x ) is expressed as,xxx , thenfff acts onxxAfter x on f ( x ) f(x)Mean value of f ( x ) . That is
E x ∼ P [ f ( x ) ] = ∑ x P ( x ) f ( x ) \mathrm{E}_{x \sim P}[f(x)] = \sum_x P(x)f(x )ExP[f(x)]=xP ( x ) f ( x )
is a continuous variable obtained by integral.

variance

Variance ( V ariance ) (\rm Variance)( V a r i a n c e ) : when pairxxWhen x is sampled according to its probability distribution, the random variablexxFunction valuef ( x ) f(x) of xf ( x ) willshow how much difference. That is,
V ar ( f ( x ) ) = [ ( f ( x ) − E [ f ( x ) ] ) 2 ] \mathrm{Var}(f(x)) = [(f(x) - \mathrm{E }[f(x)])^2]Var(f(x))=[(f(x)E[f(x)])2 ]
The square root is called the standard deviation.

Covariance

Covariance ( Co ovariance ) (\rm Covariance)( C o v a r i ance ) : In a sense , it givesthe linear correlation strength between variables and the scale of each variable. That is,
C ov ( f ( x ) , g ( y ) ) = E { ( f ( x ) − E [ f ( x ) ] ) ( g ( y ) − E [ g ( y ) ] ) } \mathrm{Cov }(f(x), g(y)) = \mathbb{E} \{ (f(x) - \mathbb{E}[f(x)]) (g(y) - \mathbb{E}[ g(y)]) \}C o v ( f ( x ) ,g(y))=E{ (f(x)E[f(x)])(g(y)E [ g ( y ) ] ) }
A large absolute value of the covariance means that the values ​​of the variables vary greatly and that they are far from their respective means. Ifthe covariance is positive, then both variables tend to take relatively large values ​​at the same time. Ifthe covariance is negative, then one of the variables tends to take relatively large values ​​while the other variable tends to take relatively small values, and vice versa.

Correlation coefficient ( correlation ) (\rm correlation)( c o r r e l a t i o n ) normalizes the contribution of each variable, in order toonly measure the correlation of variables without being affected by the scale of each variable.

  • covariance matrix

    Random vector x ∈ R n \boldsymbol x \in \mathbb{R}^nxRThe covariance matrix of n is a n × nn \times nn×n matrix, and satisfy
    C ov ( x ) i , j = C ov ( xi , xj ) \mathrm{Cov}(\mathbf{x})_{i,j} = \mathrm{Cov}(\text{ x}_i, \text{x}_j)C o v ( x )i,j=C o v ( xi,xj)
    where the diagonal elements are variances.

import numpy as np

x = np.array([1, 2, 3, 4, 5, 6, 7, 8])
y = np.array([8, 7, 6, 5, 4, 3, 2, 1])
Mean = np.mean(x)
Var = np.var(x)
Var_unbias = np.var(x, ddof=1)  # N - ddof
Cov = np.cov(x, y)
print(Mean)  # 4.5
print(Var)  # 5.25
print(Var_unbias)  # 6.0
print(Cov)
# [[ 6. -6.]
#  [-6.  6.]]

Common Probability Distributions

Bernoulli distribution

The distribution of a single binary random variable is also called the two-point distribution . ϕ \phiϕ gives a random variable equal to1 11的probability。
P ( x = 1 ) = ϕ P ( x = 0 ) = 1 − ϕ P ( x = x ) = ϕ x ( 1 − ϕ ) ( 1 − x ) E x [ x ] = ϕ V arx ( x ) = ϕ ( 1 − ϕ ) P(\text x = 1) = \phi \\ P(\text x = 0) = 1 - \phi \\ P(\text x = x) = \phi^ x(1 - \phi)^{(1-x)} \\ \mathbb{E}_{\text x}[\text x] = \phi \\ \mathrm{Var}_{\text x}( \text x) = \phi (1- \phi)P(x=1)=ϕP(x=0)=1ϕP(x=x)=ϕx(1) _(1x)Ex[x]=ϕWe are _x(x)=ϕ ( 1) _

Gaussian distribution

Also called normal distribution ( normal distribution ) (\rm normal \ distribution)( normal distribution ) N (x; µ ,  σ 2) = 1 2 π σ 2 e( x µ ) 2 2 σ 2 \ mathcal {
N}(x;\mu, \sigma^2) = \sqrt{\frac{1}{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}N(x;m ,p2)=2 p.s _21 e2 p2( x μ )2
Sometimes β = 1 σ 2 \beta = \frac{1}{\sigma^2}b=p21Indicates the precision of the distribution . The central limit theorem holds that the sum of a large number of independent random variables approximates a normal distribution, so the noise can be considered to be normally distributed .
Gaussian distribution

index distribution

f ( x ) = { λ e − λ x , x > 0 0 , other. f(x) = \begin{case}\lambda e^{-\lambda x}, &x \gt 0\\0, &\text{other.}\end{case};f(x)={ λeλx,0,x>0other.

The default value
p ( x ; λ ) = λ 1 x ≥ 0 e − λ xp(x;\lambda) = \lambda \bold symbol 1_{x \geq 0} e^{-\lambda x};p(x;l )=l 1x0eThe λ x
exponential distribution usesan indicator function1 x ≥ 0 \boldsymbol 1_{x \geq 0}1x0to make when xxWhen x takes a negative value, the probability is0 00

L a p l a c e \rm Laplace L a p l a c e distribution

allow at any point μ \muµis infinitesimal
p ( x ; γ , µ ) = 1 2 γ and − ∣ x − µ ∣ γ p(x;\gamma, \mu) = \frac{1}{2 \gamma} and ^{- \frac{\left | x - \mu \right |}{\gamma}}p(x;c ,m )=2 c1ecxμ
insert image description here

Common functions

s i g m o i d \rm sigmoid s i g m o i d function

σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ ( x )=1+ex1
insert image description here

This function is usually used to generate the parameter ϕ \phi in the Bernoulli distributionϕ , since its range is( 0 , 1 ) (0,1)(0,1 ) . Saturation occurs when a function's arguments become large or small, meaning it.

s o f t p l u s \rm softplus s o f t p l u s function

ζ ( x ) = log ⁡ ( 1 + ex ) \zeta(x) = \log (1 + e^x)z ( x )=log(1+ex)

该函数名来源于另一个函数 R e L U \rm ReLU ReLU
x + = max ⁡ { 0 , x } x^+ = \max\{0, x\} x+=max{ 0,x}
该函数可用于产生正态分布的参数 μ , σ \mu, \sigma μ,σ,其范围是 ( 0 , ∞ ) (0, \infty) (0,)
insert image description here

贝叶斯规则

已知 P ( y ∣ x ) P(\text y \mid \text x) P(yx)计算 P ( x ∣ y ) P(\text x \mid \text y) P(xy).
P ( x ∣ y ) = P ( y ∣ x ) P ( x ) P ( y ) = P ( y ∣ x ) P ( x ) ∑ x P ( y ∣ x = x ) P ( x = x ) P(\text x \mid \text y) = \frac{P(\text y \mid \text x)P(\text x)}{P(\text y)} = \frac{P(\text y \mid \text x)P(\text x)}{\sum_x P(\text y \mid \text x = x) P(\text x = x)} P(xy)=P ( and )P ( andx)P(x)=xP ( andx=x)P(x=x)P ( andx)P(x)
without knowing P ( y ) P(\text y)The value of P ( y ) .

information theory

Thought: A less likely event contains more information than a more likely event.

The information needs to meet three conditions:

  • Events that are very likely to occur should be less informative , and in extreme cases, events that are guaranteed to occur should be non-informative.
  • Less likely events are more informative ;
  • Independent events should have incremental information . For example, a coin toss that lands heads twice should convey twice as much information as a coin toss that lands heads once.

Self-information : an event x = x \text x = xx=The self-information of x
is I ( x ) = − log ⁡ P ( x ) I(x) = - \log P(x)I(x)=logP ( x )
withlog ⁡ \loglog to represent the natural logarithm, its base is eee . The unit of information is nats( nats ) (\rm nats)( n a t s ) . If2 22 as the base, the unit is bits( bits ) (\rm bits)(bits)

Shannon Entropy : Used to quantify the total amount of uncertainty in an entire probability distribution. That is, the optimal code length in coding theory.
H ( x ) = − E x ∼ P ( x ) [ log ⁡ P ( x ) ] = − ∑ x P ( x ) log ⁡ P ( x ) H(\text x) = -\mathbb{E}_{ x \sim P(x)}[\log P(x)] = -\sum_{x}P(x)\log P(x)H(x)=ExP(x)[logP(x)]=xP(x)logP(x)
insert image description here

# 香农熵 plot
p = np.linspace(1e-6, 1 - 1e-6, 100)
entropy = - p * np.log(p)
plt.figure(figsize=(6, 4))
plt.plot(p, entropy)
plt.xlabel('p')
plt.ylabel('entropy(nats)')
plt.savefig('entropy.jpg', dpi=300)
plt.show()
def Shannon_Entropy(string):
    """
    计算最优编码长度
    :param string:
    :return:
    """
    entropy = 0
    for ch in range(0, 256):
        Px = string.count(chr(ch)) / len(string)
        if Px > 0:
            entropy += -Px * math.log(Px, 2)
    return entropy


message = "".join([chr(random.randint(0, 64)) for i in range(100)])
print(message)
# .1??@41-'7;<;!
# / +
# =;9!<"6#0%3
print(Shannon_Entropy(message))
# 5.419819909835648

Joint entropy : consider the entropy of multiple events at the same time, multiple random variables
H ( X , Y ) = − ∑ x , y P ( x , y ) log ⁡ P ( x , y ) H(X, Y) = -\ sum_{x,y} P(x, y) \log P(x,y)H(X,Y)=x,yP(x,y)logP(x,y)
条件熵:一件事情发生的情况下,另一个事件的熵,
H ( X ∣ Y ) = − ∑ y P ( y ) ∑ x P ( x ∣ y ) log ⁡ P ( x ∣ y ) H(X \mid Y) = - \sum_y P(y) \sum_x P(x \mid y) \log P(x \mid y) H(XY)=yP(y)xP(xy)logP(xy)
互信息:表示两件事情信息相交的部分。
I ( X , Y ) = H ( X ) + H ( Y ) − H ( X , Y ) I(X,Y) = H(X) +H(Y) - H(X,Y) I(X,Y)=H(X)+H(Y)H(X,Y )
Information variation: Indicates the disjoint part of the information of two things.
V ( X , Y ) = H ( X , Y ) − I ( X , Y ) V(X,Y) = H(X, Y) - I(X,Y)V(X,Y)=H(X,Y)I(X,Y )
KL \rm KLK L divergence: used to measure two distributionsP ( x ) P(\text x)P ( x ) andQ ( x ) Q(\text x)The gap between Q ( x )
, DKL ( P ∣ ∣ Q ) = E x ∼ P [ log ⁡ P ( x ) Q ( x ) ] = ∑ P ( x ) log ⁡ P ( x ) Q ( x ) D_{ \rm KL}(P \mid \mid Q) = \mathbb{E} _{\text x \sim P}\left[\log \frac{P(x)}{Q(x)} \right] = \sum P(x) \log \frac{P(x)}{Q(x)}DKL(PQ)=ExP[logQ(x)P(x)]=P(x)logQ(x)P(x)

D K L ( Q ∣ ∣ P ) = E x ∼ Q [ log ⁡ Q ( x ) P ( x ) ] = ∑ Q ( x ) log ⁡ Q ( x ) P ( x ) D_{\rm KL}(Q \mid \mid P) = \mathbb{E} _{\text x \sim Q}\left[\log \frac{Q(x)}{P(x)} \right] = \sum Q(x) \log \frac{Q(x)}{P(x)} DKL(QP)=ExQ[logP(x)Q(x)]=Q(x)logP(x)Q(x)

For example, DKL ( P ∣ ∣ Q ) ≠ DKL ( Q ∣ ∣ P ) D_{\rm KL}(P \mid \mid Q) \neq D_{\rm KL}(Q \mid \mid P)DKL(PQ)=DKL(QP)

\bigstar Note: Under discrete random variables,KL \rm KLThe K L divergencemeasuredwhen we use a method designed to make the probability distributionQQThe minimum encoding of the length of the message generated by Q , sent containing the probability distributionPPThe amount of additional information required when P generates a symbolic message.

from scipy.stats import entropy  # 内置KL散度
import numpy as np


def KL_Divergence(p, q):
    """D(P || Q)"""
    return np.sum(np.where(p != 0, p * np.log(p / q), 0))


p = np.array([0.1, 0.9])
q = np.array([0.1, 0.9])

print(KL_Divergence(p, q))  # 0.0

print(entropy(p, q))  # 0.0  
# 两个分布一样
from scipy.stats import norm, entropy
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(1, 10, 1000)
y1 = norm.pdf(x, 3, 0.4)
y2 = norm.pdf(x, 6, 0.4)
P = y1 + y2  # 构造p(x)

KL_PQ = []
KL_QP = []
Q_list = []
for mu in np.linspace(0, 10, 100):
    for sigma in np.linspace(0.1, 5, 50):
        Q = norm.pdf(x, mu, sigma)
        Q_list.append(Q)
        KL_PQ.append(entropy(P, Q))
        KL_QP.append(entropy(Q, P))
# 寻找Q使得KL_PQ最小
KL_PQ_min = np.argmin(KL_PQ)
# 寻找Q使得KL_QP最小
KL_QP_min = np.argmin(KL_QP)

plt.rcParams['font.family'] = ["Times New Roman"]
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].plot(x, P / 2, 'g-', label="$\\bf P(x)$")
axes[0].plot(x, Q_list[KL_PQ_min], 'b-.', label="$\\bf {Q^{*}(x)}$")
axes[0].set_ylim(0, 1.2)
axes[0].set_xlabel("$\\bf x$")
axes[0].set_ylabel("$\\bf P(x)$")
axes[0].set_title("$\\bf \\rm \\arg\\min_{Q}(KL(P||Q))$")
axes[0].legend()

axes[1].plot(x, P / 2, 'g-', label="$\\bf P(x)$")
axes[1].plot(x, Q_list[KL_QP_min], 'r-.', label="$\\bf {Q^{*}(x)}$")
axes[1].set_ylim(0, 1.2)
axes[1].set_xlabel("$\\bf x$")
axes[1].set_ylabel("$\\bf P(x)$")
axes[1].set_title("$\\bf \\rm \\arg\\min_{Q}(KL(Q||P))$")
axes[1].legend()

plt.savefig("./images/KL.jpg", dpi=300)
plt.show()

insert image description here

Cross Entropy : A and KL \rm KLA closely related quantity to KL divergence is cross-entropy. The meaning is the actual output probabilityQQQ and expected output probabilityPPthe distance between P.
H ( P , Q ) = H ( P ) + DKL ( P ∣ ∣ Q ) = − E x ∼ P [ log ⁡ Q ( x ) ] = − ∑ x P ( x ) log ⁡ Q ( x ) \begin{ aligned} H(P,Q) &= H(P) + D_{\rm KL}(P \mid \mid Q) = - \mathbb{E}_{x \sim P} [\log Q(x) ] \\ & =-\sum_{x}P(x)\log Q(x) \end{aligned}H(P,Q)=H(P)+DKL(PQ)=ExP[logQ(x)]=xP(x)logQ(x)
For binary classification , the cross-entropy calculation method is as follows
H ( P , Q ) = − 1 N ∑ x P ( x ) log ⁡ Q ( x ) + ( 1 − P ( x ) ) log ⁡ ( 1 − Q ( x ) ) H(P,Q) =- \frac1N\sum_x P(x) \log Q(x) + (1- P(x))\log (1-Q(x))H(P,Q)=N1xP(x)logQ(x)+(1P(x))log(1Q(x))

其中, P ( x ) P(x) P(x)表示样本 x x x的期望(真实)标签,正类为1,负类为0; Q ( x ) Q(x) Q(x)表示模型实际预测该样本 x x x类别为正的概率。 N N N表示样本个数。

假设有 3 3 3个样本,其真实类别标签是 P = ( 1 , 0 , 1 ) P = (1, 0, 1) P=(1,0,1) ,实际预测样本为正的概率为 Q = ( 0.8 , 0.1 , 0.7 ) Q=(0.8, 0.1, 0.7) Q=(0.8,0.1,0.7)。计算期望输出和实际输出之间的交叉熵,如下

from sklearn.metrics import log_loss
import numpy as np

P = np.array([1, 0, 1])
Q1 = np.array([0.8, 0.1, 0.7])
Q2 = np.array([0.1, 0.7, 0.3])
# binary_ent = 0.0
# for i in range(P.size):
#     binary_ent += P[i] * np.log(Q1[i]) + (1 - P[i]) * np.log(1 - Q1[i])
# print(-binary_ent)  # 0.6851790109107685
binary_ent1 = - P * np.log(Q1) - (1 - P) * np.log(1 - Q1)
binary_ent2 = - P * np.log(Q2) - (1 - P) * np.log(1 - Q2)

print(binary_ent1.sum() / P.size)  # 0.22839300363692283
print(binary_ent2.sum() / P.size)  # 1.5701769005486392
# np.log 为自然对数

print(log_loss(P, Q1))
print(log_loss(P, Q2))
# 0.22839300363692283
# 1.5701769005486392

从上述结果中可以看出预测值 Q 1 Q1 Q1的交叉熵明显比 Q 2 Q2 Q2的小,说明 Q 1 Q1 Q1的预测更接近期望值。

对于多分类来说,其交叉熵计算方式如下
H ( P , Q ) = − 1 N ∑ i = 1 N ∑ j = 1 M P ( x i , j ) log ⁡ ( Q ( x i , j ) ) H(P,Q) = -\frac1N\sum_{i=1}^N \sum_{j=1}^M P(x_{i,j})\log(Q(x_{i,j})) H(P,Q)=N1i=1Nj=1MP(xi,j)log(Q(xi,j) )
among them,P ( xi , j ) P(x_{i,j})P(xi,j) means samplexi x_ixiThe true class label is jjj is1 11 , otherwise0 00 Q ( x i , j ) Q(x_{i,j}) Q(xi,j) means samplexi x_ixipredicted as the jjthProbability of class j , NNN is the sample size,MMM is the number of class labels.

Suppose there are 3 33 samples, a total of0 ∼ 9 0\sim909 out of10 1010 class labels,PPP represents the real label of the sample, which is a vector;QQQ represents the value of the model's category prediction for each sample, which is a3 33 lines10 10A matrix of 1 0 columns, each row represents a pair of samplesiiA vector consisting of category prediction probability values ​​of i , to calculate the cross entropy.

import numpy as np
from sklearn.metrics import log_loss
from sklearn.preprocessing import LabelBinarizer

P = np.array(['1', '5', '9'])  # 样本的真实标签

Q = np.array([
    [0.1, 0.7, 0, 0, 0.01, 0.19, 0, 0, 0, 0],
    [0, 0.1, 0, 0.1, 0, 0.8, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0, 0.1, 0, 0.9]
])  # Q[i]是一个向量 Q[i][j]表示P[i]被预测为j类的概率

labels = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']  # 标签集合
sklearn_multi_ent = log_loss(P, Q, labels=labels)
print(sklearn_multi_ent)  # 0.22839300363693002

# 根据公式计算
# 先对类别标签进行二值化为0&1  数字标签和非数字标签都需要
LB = LabelBinarizer()
LB.fit(labels)
bin_P = LB.transform(P)  # 真实标签进行转换
print(bin_P)
# [[0 1 0 0 0 0 0 0 0 0]
#  [0 0 0 0 0 1 0 0 0 0]
#  [0 0 0 0 0 0 0 0 0 1]]

N = P.size
M = len(labels)
eps = 1e-15  # 预测概率的控制值

multi_ent = 0.0
for i in range(N):
    for j in range(M):
        if Q[i, j] > 1 - eps:
            Q[i, j] = 1 - eps
        if Q[i, j] < eps:
            Q[i, j] = eps
        multi_ent += -bin_P[i, j] * np.log(Q[i, j])

print(multi_ent / N)  # 0.22839300363692283

Suppose PPP is the expected (true) distribution,QQQ is the model (actual) distribution,minimize the cross entropyH ( P , Q ) H(P,Q)H(P,Q ) can make the model distribution approximate the real distribution.

References

[1] "Deep Learning" flower book
[2] related github

Guess you like

Origin blog.csdn.net/qq_41139677/article/details/120924103