统计学习II.7 广义线性模型1 指数分布族
这一部分介绍广义线性模型,这是一类监督学习方法,通常用来构造分类器等。考虑 { ( X i , Y i ) } i = 1 N \{(X_i,Y_i)\}_{i=1}^N { (Xi,Yi)}i=1N,广义线性模型通常假设 Y i Y_i Yi服从某种指数分布族。因此这一部分先介绍指数分布族,然后介绍基于不同指数分布族导出的广义线性模型的不同效果。
指数分布族的定义
用 p ( x ∣ θ ) p(x|\theta) p(x∣θ)表示某个密度函数,称它是指数分布族(exponential family)如果:
p ( x ∣ θ ) = h ( x ) exp ( θ T ϕ ( x ) − A ( θ ) ) p(x|\theta) = h(x)\exp(\theta^T \phi(x)-A(\theta)) p(x∣θ)=h(x)exp(θTϕ(x)−A(θ))
根据密度函数的归一性,
∫ p ( x ∣ θ ) d x = ∫ h ( x ) exp ( θ T ϕ ( x ) − A ( θ ) ) d x = exp ( − A ( θ ) ) ∫ h ( x ) exp ( θ T ϕ ( x ) ) d x = 1 \int p(x|\theta)dx =\int h(x)\exp(\theta^T \phi(x)-A(\theta))dx \\ = \exp(-A(\theta))\int h(x)\exp(\theta^T \phi(x))dx =1 ∫p(x∣θ)dx=∫h(x)exp(θTϕ(x)−A(θ))dx=exp(−A(θ))∫h(x)exp(θTϕ(x))dx=1
于是
A ( θ ) = log Z ( θ ) , Z ( θ ) = ∫ h ( x ) exp ( θ T ϕ ( x ) ) d x A(\theta)=\log Z(\theta), Z(\theta)=\int h(x)\exp(\theta^T\phi(x))dx A(θ)=logZ(θ),Z(θ)=∫h(x)exp(θTϕ(x))dx
其中 θ \theta θ被称为natural parameter, ϕ ( X ) \phi(X) ϕ(X)是这个指数族的充分统计量(基于Fisher-Neyman定理), Z ( θ ) Z(\theta) Z(θ)是partition function, A ( θ ) A(\theta) A(θ)是cumulant function,如果 ϕ ( X ) = X \phi(X)=X ϕ(X)=X,称这样的指数族为自然指数族(natural exponential family)。
指数分布的另一种形式为
p ( x ∣ θ ) = h ( x ) exp ( η ( θ ) T ϕ ( x ) − A ( η ( θ ) ) ) p(x|\theta) = h(x)\exp(\eta(\theta)^T \phi(x)-A(\eta(\theta))) p(x∣θ)=h(x)exp(η(θ)Tϕ(x)−A(η(θ)))如果 dim ( θ ) < dim ( η ( θ ) ) \dim(\theta)<\dim(\eta(\theta)) dim(θ)<dim(η(θ)),称之为curved exponential family,此时充分统计量的数目比参数多;如果 dim ( θ ) = dim ( η ( θ ) ) \dim(\theta)=\dim(\eta(\theta)) dim(θ)=dim(η(θ)),称之为canonical form;
指数分布族的例子
Bernoulli分布
p ( x ∣ μ ) = μ x ( 1 − μ ) 1 − x = exp ( ϕ ( x ) T θ ) p(x|\mu)=\mu^x(1-\mu)^{1-x}=\exp(\phi(x)^T\theta) p(x∣μ)=μx(1−μ)1−x=exp(ϕ(x)Tθ)
其中
ϕ ( x ) = [ 1 x = 0 , 1 x = 1 ] T , θ = [ log ( μ ) , log ( 1 − μ ) ] T \phi(x)=[1_{x=0},1_{x=1}]^T,\theta=[\log(\mu),\log(1-\mu)]^T ϕ(x)=[1x=0,1x=1]T,θ=[log(μ),log(1−μ)]T
这并不是一个好的表示,因为 x ∈ { 0 , 1 } x \in \{0,1\} x∈{
0,1}, 1 T ϕ ( x ) = 1 1^T \phi(x)=1 1Tϕ(x)=1,也就是说 ϕ ( x ) \phi(x) ϕ(x)的两个分量是线性相关的,这会导致在估计的时候 θ \theta θ只有一个方程。一种更好的表示方法是
p ( x ∣ μ ) = ( 1 − μ ) exp [ x log ( μ 1 − μ ) ] = exp ( ϕ ( x ) T θ ) = exp ( ϕ ( x ) T θ ) p(x|\mu)=(1-\mu)\exp \left[ x\log \left( \frac{\mu}{1-\mu} \right) \right]=\exp(\phi(x)^T\theta)=\exp(\phi(x)^T\theta) p(x∣μ)=(1−μ)exp[xlog(1−μμ)]=exp(ϕ(x)Tθ)=exp(ϕ(x)Tθ)
其中
ϕ ( x ) = x , θ = log ( μ 1 − μ ) \phi(x)=x,\theta = \log \left( \frac{\mu}{1-\mu} \right) ϕ(x)=x,θ=log(1−μμ)
称 θ \theta θ为log-odds ratio;从natural parameter还原为 μ \mu μ的函数是sigmoid函数
μ = s i g m ( θ ) = 1 1 + e − θ \mu = sigm(\theta)=\frac{1}{1+e^{-\theta}} μ=sigm(θ)=1+e−θ1
Multinoulli分布
p ( x ∣ μ 1 , ⋯ , μ K ) = ∏ k = 1 K μ k x k = exp [ ∑ k = 1 K − 1 x k log ( μ k μ K ) + log μ K ] p(x|\mu_1,\cdots,\mu_K)=\prod_{k=1}^K \mu_k^{x_k}=\exp\left[ \sum_{k=1}^{K-1} x_k\log \left( \frac{\mu_k}{\mu_K}\right) +\log \mu_K\right] p(x∣μ1,⋯,μK)=k=1∏Kμkxk=exp[k=1∑K−1xklog(μKμk)+logμK]
其中
∑ k = 1 K μ k = 1 \sum_{k=1}^K \mu_k = 1 k=1∑Kμk=1
于是
p ( x ∣ θ ) = h ( x ) exp ( θ T ϕ ( x ) − A ( θ ) ) p(x|\theta)=h(x)\exp(\theta^T \phi(x)-A(\theta)) p(x∣θ)=h(x)exp(θTϕ(x)−A(θ))其中
θ = [ log μ 1 μ K , ⋯ , log μ K − 1 μ K ] T , ϕ ( x ) = [ 1 x = 1 , ⋯ , 1 x = K − 1 ] T A ( θ ) = log ( 1 + ∑ k = 1 K − 1 e θ k ) \theta=[\log \frac{\mu_1}{\mu_K},\cdots,\log \frac{\mu_{K-1}}{\mu_K}]^T,\phi(x)=[1_{x=1},\cdots,1_{x=K-1}]^T \\ A(\theta)=\log \left( 1+ \sum_{k=1}^{K-1} e^{\theta_k} \right) θ=[logμKμ1,⋯,logμKμK−1]T,ϕ(x)=[1x=1,⋯,1x=K−1]TA(θ)=log(1+k=1∑K−1eθk)
从natural parameter还原到 μ \mu μ的方法为
{ μ k = e θ k 1 + ∑ j = 1 K − 1 e θ j , k = 1 , ⋯ , K − 1 μ K = 1 ∑ j = 1 K − 1 e θ j \begin{cases} \mu_k = \frac{e^{\theta_k}}{1+\sum_{j=1}^{K-1}e^{\theta_j}},k=1,\cdots,K-1 \\ \mu_K = \frac{1}{\sum_{j=1}^{K-1}}e^{\theta_{j}} \end{cases} ⎩⎨⎧μk=1+∑j=1K−1eθjeθk,k=1,⋯,K−1μK=∑j=1K−11eθj
指数分布族的性质
性质1
d A d θ = E [ ϕ ( X ) ] \frac{dA}{d\theta}=E[\phi(X)] dθdA=E[ϕ(X)]
直接计算这个导数即可,下面的两个性质也都是直接计算导数
d A d θ = d d θ log ∫ h ( x ) exp ( θ T ϕ ( x ) ) d x = ∫ ϕ ( x ) p ( x ∣ θ ) d x \frac{dA}{d\theta}=\frac{d}{d\theta}\log \int h(x)\exp(\theta^T\phi(x))dx=\int \phi(x)p(x|\theta)dx dθdA=dθdlog∫h(x)exp(θTϕ(x))dx=∫ϕ(x)p(x∣θ)dx
性质2
d 2 A d θ 2 = V a r [ ϕ ( X ) ] \frac{d^2A}{d\theta^2}=Var[\phi(X)] dθ2d2A=Var[ϕ(X)]
性质3
∇ 2 A ( θ ) = C o v ( ϕ ( X ) ) \nabla^2 A(\theta)=Cov(\phi(X)) ∇2A(θ)=Cov(ϕ(X))
指数分布族的MLE
指数分布族MLE的moment matching equation
假设 X 1 , ⋯ , X N ∼ i i d p ( x ∣ θ ) X_1,\cdots,X_N \sim_{iid} p(x|\theta) X1,⋯,XN∼iidp(x∣θ), 似然函数为
L ( θ ∣ X 1 , ⋯ , X N ) = [ ∏ i = 1 N h ( X i ) ] exp ( θ T ∑ i = 1 N ϕ ( X i ) − N A ( θ ) ) L(\theta|X_1,\cdots,X_N)=\left[ \prod_{i=1}^N h(X_i) \right] \exp \left( \theta^T \sum_{i=1}^N \phi(X_i) -NA(\theta)\right) L(θ∣X1,⋯,XN)=[i=1∏Nh(Xi)]exp(θTi=1∑Nϕ(Xi)−NA(θ))
对数似然为
log L ( θ ∣ X 1 , ⋯ , X N ) = log [ ∏ i = 1 N h ( X i ) ] + θ T ∑ i = 1 N ϕ ( X i ) − N A ( θ ) \log L(\theta|X_1,\cdots,X_N)=\log \left[ \prod_{i=1}^N h(X_i) \right] +\theta^T \sum_{i=1}^N \phi(X_i) -NA(\theta) logL(θ∣X1,⋯,XN)=log[i=1∏Nh(Xi)]+θTi=1∑Nϕ(Xi)−NA(θ)
考虑MLE满足的方程
∇ log L ( θ ∣ X 1 , ⋯ , X N ) = ∑ i = 1 N ϕ ( X i ) − N ∇ A ( θ ) = ∑ i = 1 N ϕ ( X i ) − N E [ ϕ ( X ) ] = 0 \nabla \log L(\theta|X_1,\cdots,X_N) = \sum_{i=1}^N \phi(X_i)-N\nabla A(\theta)=\sum_{i=1}^N \phi(X_i)-NE[\phi(X)]=0 ∇logL(θ∣X1,⋯,XN)=i=1∑Nϕ(Xi)−N∇A(θ)=i=1∑Nϕ(Xi)−NE[ϕ(X)]=0
也就是
E [ ϕ ( X ) ] = 1 N ∑ i = 1 N ϕ ( X i ) E[\phi(X)]=\frac{1}{N}\sum_{i=1}^N \phi(X_i) E[ϕ(X)]=N1i=1∑Nϕ(Xi)
这里 ϕ ( X ) \phi(X) ϕ(X)是指数分布的充分统计量,称这个方程为moment matching equation,它的含义是充分统计量的样本均值等于理论均值。
指数分布族的贝叶斯方法
指数分布族是一个共轭分布族
我们把似然函数写成下面的形式:
L ( θ ∣ X 1 , ⋯ , X N ) ∝ g ( θ ) N e η ( θ ) T s N , s N = ∑ i = 1 N s ( X i ) L(\theta|X_1,\cdots,X_N)\propto g(\theta)^N e^{\eta(\theta)^T s_N},s_N = \sum_{i=1}^N s(X_i) L(θ∣X1,⋯,XN)∝g(θ)Neη(θ)TsN,sN=i=1∑Ns(Xi)
引入指数函数族先验,
p ( θ ∣ n u 0 , τ 0 ) ∝ g ( θ ) ν 0 e η ( θ ) T τ 0 p(\theta|nu_0,\tau_0) \propto g(\theta)^{\nu_0}e^{\eta(\theta)^T \tau_0} p(θ∣nu0,τ0)∝g(θ)ν0eη(θ)Tτ0
则后验为
p ( θ ∣ ν 0 + N , τ 0 + s N ) ∝ g ( θ ) ν 0 + N e η ( θ ) T ( τ 0 + s N ) p(\theta|\nu_0+N,\tau_0+s_N)\propto g(\theta)^{\nu_0+N}e^{\eta(\theta)^T(\tau_0+s_N)} p(θ∣ν0+N,τ0+sN)∝g(θ)ν0+Neη(θ)T(τ0+sN)