统计推断(三) Exponential Family

1. Exponential family

  • Definition

    • PDF: p ( y ; x ) = exp ( λ ( x ) T t ( y ) α ( x ) + β ( y ) ) p(y;x)=\exp(\lambda(x)^T t(y)-\alpha(x)+\beta(y))
      y ε ( x ; λ ( ) , t ( ) , β ( ) ) y\sim \varepsilon(x;\lambda(\cdot),t(\cdot),\beta(\cdot))
    • nature statistic: t ( y ) t(y)
    • nature parameter: λ ( x ) \lambda(x)
    • log-partition function: α ( x ) \alpha(x)
    • partition function: Z ( x ) = exp ( α ( x ) ) Z(x)=\exp(\alpha(x))
    • distribution: exp ( β ( y ) ) \exp(\beta(y))
  • 正则条件(regular):若分布族中的任意一个分布 p ( y ; x ) p(y;x) 都有其支集(support)与 x 无关,则为正则

    • 实质上是要求 CRB 正则条件中求导和积分可换序
      E [ x ln p ( y ; x ) ] = x p ( y ; x ) d y = x a b p ( y ; x ) d y = 0 \mathbb{E}\left[\frac{\partial}{\partial x}\ln p(y;x)\right]=\int\frac{\partial}{\partial x}p(y;x)dy = \frac{\partial}{\partial x}\int_a^b p(y;x)dy = 0
  • 指数分布族可以有多种获得方式

    • 很多分布本身可以写成指数分布族形式

      • Bernulli distribution: y B ( x ) y\sim \mathcal{B}(x)

      p ( y ; x ) = x y ( 1 x ) ( 1 y ) ln p ( y ; x ) = ( ln ( x 1 x ) ) y ( ln ( 1 x ) ) p(y;x)=x^y (1-x)^{(1-y)} \\ \ln p(y;x)=\left(\ln(\frac{x}{1-x})\right)y-(-\ln(1-x))

      • Gaussian y = [ y 1 , y 2 ] T N ( x , 1 ) y=[y_1,y_2]^T\sim \mathcal{N}(x,1)

      p ( y ; x ) = 1 2 π exp ( ( y 1 + y 2 ) x x 2 y 1 2 + y 2 2 2 ) p(y;x)=\frac{1}{\sqrt{2\pi}}\exp\left((y_1+y_2)x-x^2-\frac{y_1^2+y_2^2}{2}\right)

    • 多个分布的几何均值
      p ( y ; x ) = p 1 x ( y ) p 2 ( 1 x ) ( y ) Z ( x ) ln p ( y ; x ) = x ln ( p 1 ( y ) p 2 ( y ) ) ln Z ( x ) + ln p 2 ( y ) p(y;x)=\frac{p_1^x(y)*p_2^{(1-x)}(y)}{Z(x)} \\ \ln p(y;x)=x\ln\left(\frac{p_1(y)}{p_2(y)}\right)-\ln Z(x)+\ln p_2(y)

      • 例如 p 1 ( y ) B ( 1 1 + e 1 ) , p 2 ( y ) B ( 1 / 2 ) p_1(y)\sim \mathcal{B}(\frac{1}{1+e^{-1}}), p_2(y)\sim \mathcal{B}(1/2)
        p ( y ; x ) = ( 1 1 + e 1 ) x y ( e 1 1 + e 1 ) x ( 1 y ) ( 1 / 2 ) ( 1 x ) B ( 1 1 + e x ) p ( y = 1 ; x ) p ( y = 0 ; x ) = e x p(y;x)=(\frac{1}{1+e^{-1}})^{xy}(\frac{e^{-1}}{1+e^{-1}})^{x(1-y)}(1/2)^{(1-x)}\sim \mathcal{B}(\frac{1}{1+e^{-x}}) \\ \frac{p(y=1;x)}{p(y=0;x)}=e^x
    • Tilting
      p ( y ; x ) = p ( y ) e x y Z ( x ) ln p ( y ; x ) = x y ln Z ( x ) + ln p ( y ) p(y;x)=\frac{p(y)e^{xy}}{Z(x)} \\ \ln p(y;x)=xy - \ln Z(x) + \ln p(y)

      • 例如 p ( y ) N ( 0 , 1 ) p(y)\sim \mathcal{N}(0,1) p ( y ; x ) N ( x , 1 ) p(y;x)\sim \mathcal{N}(x,1)
  • linear exponential family

    • 定义: t ( x ) = x t(x)=x ln p ( y ; x ) = x   t ( y ) α ( x ) + β ( y ) \ln p(y;x)=x\ t(y) - \alpha(x)+\beta(y)
    • 性质: α ˙ ( x ) = E [ t ( y ) ] ,    α ˙ ˙ ( x ) = E [ t 2 ( y ) ] E [ t ( y ) ] 2 = V a r ( t ( y ) ) = J y ( x ) \dot{\alpha}(x)=\mathbb{E}[t(y)], \ \ \dot{\dot{\alpha}}(x)=\mathbb{E}[t^2(y)]-\mathbb{E}[t(y)]^2=Var(t(y)) = J_y(x)

    Proof
    KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ Z(x) &= e^{\al…

    α ˙ ˙ ( x ) = t ( y ) p ( y ; x ) ( t ( y ) α ˙ ( x ) ) d y J y ( x ) = E [ 2 x 2 ln p ( y ; x ) ] = α ˙ ˙ ( x ) \dot{\dot{\alpha}}(x)=\int t(y)\cdot p(y;x)\cdot (t(y)-\dot{\alpha}(x))dy \\ J_y(x) = \mathbb{E}\left[-\frac{\partial^2}{\partial x^2} \ln p(y;x)\right]=\dot{\dot{\alpha}}(x)

  • 指数族分布与有效统计量(efficient statistics)

    • 必要条件:若有效统计量存在,则可以写成指数族分布形式,且有
      t ( x ) = x J y ( u ) d u ,     α ( x ) = x u J y ( u ) d u t(x)=\int^x J_y(u)du, \ \ \ \alpha(x)=\int^x u J_y(u) du

    Proof
    KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \hat {x}_{eff}…

    • 充分条件:对于线性指数分布族,若有 J y ( x ) J_y(x) 不依赖于 x,也即 J y ( x ) J_y(x) 等于一个常数时,有效统计量存在

    Proof J y ( x ) = J J_y(x)=J
    α ˙ ˙ ( x ) = J ,     α ˙ ( x ) = J x c x ^ e f f ( y ) = x + 1 J x ln p ( y ; x ) = x + 1 J ( t ( y ) α ˙ ( x ) ) = x + 1 J ( t ( y ) J x + c ) = t ( y ) J + c J \dot{\dot{\alpha}}(x)=J, \ \ \ \dot{\alpha}(x)=Jx-c \\ \hat x_{eff}(y) = x + \frac{1}{J}\frac{\partial}{\partial x}\ln p(y;x) = x + \frac{1}{J} (t(y)-\dot{\alpha}(x)) = x + \frac{1}{J}(t(y)-Jx+c)=\frac{t(y)}{J}+\frac{c}{J}
    由于
    x ln p ( y ; x ) x = x ^ M L = 0 = t ( y ) α ˙ ( x ) x = x ^ M L \frac{\partial}{\partial x}\ln p(y;x)|_{x=\hat x_{ML}} = 0 = t(y) - \dot{\alpha}(x)|_{x=\hat x_{ML}}

    x ^ e f f ( y ) = c / J + 1 J α ˙ ( x ) x = x ^ M L = x ^ M L ( y ) \hat x_{eff}(y) = c/J + \frac{1}{J}\dot{\alpha}(x)|_{x=\hat x_{ML}} = \hat x_{ML}(y)

2. Sufficient statistics

2.1 Non-Bayesian case

  • Definition:t(y) 是关于分布 p y ( ; x ) p_{\mathsf{y}}(\cdot;x) 的充分统计量,如果 p ( y t ( y ) ; x ) p(y|t(y);x) 与 x 无关

Theorem 1(likelihood characterization):

t ( y ) t(y) is sufficient w.r.t p ( y ; x ) p(y;x)         p y ( y ; x ) p t ( t ( y ) ; x ) \iff \ \frac{p_{y}(y;x)}{p_t(t(y);x)} doesn’t depend on x, for all x and y

Proof:omit…

Theorem 2(Neyman Factorization theorem):

t ( y ) t(y) is sufficient w.r.t p ( y ; x ) p(y;x)         a ( , ) b ( ) 使    p ( y ; x ) = a ( t ( y ) , x ) b ( y ) \iff \ 存在a(\cdot,\cdot)和b(\cdot)使得 \ \ p(y;x)=a\left(t(y),x\right) \cdot b(y)

Proof:omit…

  • minimum sufficient statistic t t^* 是 minimal 的,如果对任意其他充分统计量 t ,都存在 g() 使得 t = g ( t ) t^*=g(t)
  • complete t t^* 是 complete 的如果对任意函数 ϕ ( ) \phi(\cdot) ,有 E [ ϕ ( t ( y ) ) ] = 0    x       ϕ ( ) 0 E[\phi(t^*(y))]=0 \ \ \forall x \iff \phi(\cdot) \equiv 0

Theorem:complete \Longrightarrow minimal

Proof:假设 t 为complete,s 为 minimal,存在 s = g ( t ) s=g(t) E [ t ] = E [ E [ t s = s ] ] E[t]=E\left[E\left[t|s=s\right]\right]

E [ t s = s ] = f ( s ) = f ( g ( t ) ) = f ~ ( t ) E[t|s=s]=f(s)=f(g(t))=\tilde{f}(t)

ϕ ( t ) = t f ~ ( t ) \phi(t)=t-\tilde{f}(t) ,有 E [ ϕ ( t ) ] = 0 E[\phi(t)] = 0

根据 complete 的定义,有 ϕ ( t ) 0 t = f ~ ( t ) = f ( s ) \phi(t)\equiv0 \Longrightarrow t = \tilde{f}(t)=f(s)

故 t 也是 minimal

2.2 Bayesian case

  • Definition:t(y) 是关于分布 p y , x ( , ) p_{\mathsf{y,x}}(\cdot,\cdot) 的充分统计量,如果 p y t , x ( y t ( y ) , x ) = p y t ( y t ( y ) ) p_{\mathsf{y|t,x}}(y|t(y),x)=p_\mathsf{y|t}(y|t(y)) 与 x 无关

Theorem(Belief characterization):

t ( y ) t(y) is sufficient w.r.t p ( y , x ) p(y,x)         p ( x y ) = p ( x t ( y ) ) \iff \ p(x|y)=p(x|t(y)) , for all x and y

Proof:omit…

Theorem(Neyman Factorization theorem):

t ( y ) t(y) is sufficient w.r.t p ( y , x ) p(y,x)         p ( y x ) = p ( t ( y ) x ) p ( y t ( y ) ) \iff \ p(y|x)=p(t(y)|x)\cdot p(y|t(y)) , for all x and y

Proof:omit…

3. Conjugate priors

  • Idea: Given a model p y x p_\mathsf{y|x} , look for a family of prior p x p_\mathsf{x} such that the induced posterior p x y p_\mathsf{x|y} also in this family
  • Definition: a family of distribution q ( ; θ ) q(\cdot;\theta) is conjugate to a model p y x p_{y|x} if
    • p y x ( y 1 , . . . , y N x ) q ( x ; θ ) p_{y|x}(y_1,...,y_N|x) \propto q(x;\theta)
    • q ( x ; θ 1 ) q ( x ; θ 2 ) q ( x ; θ 3 ) q(x;\theta_1)q(x;\theta_2)\propto q(x;\theta_3)
  • Theorem: 对于采样数 N,联合分布 p y x N ( ) p^N_{y|x}() 有充分统计量,且其维度不依赖于 N,则对该模型存在共轭先验分布

其他内容请看:
统计推断(一) Hypothesis Test
统计推断(二) Estimation Problem
统计推断(三) Exponential Family
统计推断(四) Information Geometry
统计推断(五) EM algorithm
统计推断(六) Modeling
统计推断(七) Typical Sequence
统计推断(八) Model Selection
统计推断(九) Graphical models
统计推断(十) Elimination algorithm
统计推断(十一) Sum-product algorithm

发布了42 篇原创文章 · 获赞 34 · 访问量 3万+

猜你喜欢

转载自blog.csdn.net/weixin_41024483/article/details/104165233
今日推荐