统计推断(七) Typical Sequence

1. 一些定理

Markov inequality: r . v .    x 0 r.v. \ \ \mathsf{x}\ge0
P ( x μ ) E [ x ] μ \mathbb{P}(x\ge\mu)\le \frac{\mathbb{E}[x]}{\mu}
Proof: omit…

Weak law of large numbers(WLLN): y = [ y 1 , y 2 , . . . , y N ] T ,      y i p     i . i . d \vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d
lim N P ( L p ( y ) + H ( p ) > ε ) = 0 ,    ε > 0 \lim_{N\to\infty}\mathbb{P}(|L_p(\vec{y})+H(p)|>\varepsilon)=0, \ \ \forall \varepsilon>0
Proof: omit…

2. Typical set

  • Definition: T ε ( p ; N ) = { y Y N : L p ( y ) + H ( p ) < ε } \mathcal{T}_\varepsilon(p;N)=\{\vec{y}\in\mathcal{Y}^N:|L_p(\vec{y})+H(p)|<\varepsilon\}

  • Properties

    • WLLN P ( y T ε ( p ; N ) ) 1 \Longrightarrow P\left(\vec{y}\in\mathcal{T}_\varepsilon(p;N)\right)\simeq1 , N N large
    • L p ( y ) H ( p ) p y ( y ) 2 N H ( p ) L_p(\vec{y})\simeq H(p) \Longrightarrow p_y(\vec{y})\simeq 2^{-NH(p)}
    • T ε ( p ; N ) 2 N H ( p ) \Longrightarrow |\mathcal{T}_\varepsilon(p;N)|\simeq 2^{NH(p)}
    • 当 p 不是均匀分布的时候, T ε ( p ; N ) Y N 0 \frac{|\mathcal{T}_\varepsilon(p;N)|}{|\mathcal{Y}^N|}\to0 ,也就是说典型集中元素(序列)个数在所有可能的元素(序列)中所占比例趋于 0,但是典型集中元素概率的和却趋近于 1
  • Theorem

    Asymptotic Equipartition Property(AEP)

    lim N P ( T ε ( p ; N ) ) = 1 \lim_{N\to\infty}P(\mathcal{T}_\varepsilon(p;N))=1 \\

    2 N ( H ( p ) + ϵ ) p y ( y ) 2 N ( H ( p ) ϵ ) , y T ϵ ( p ; N ) 2^{-N(H(p)+\epsilon)} \leq p_{\mathrm{y}}(\boldsymbol{y}) \leq 2^{-N(H(p)-\epsilon)}, \forall \boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)

    • for a sufficient large N N

    ( 1 ϵ ) 2 N ( H ( p ) ϵ ) T ϵ ( p ; N ) 2 N ( H ( p ) + ϵ ) (1-\epsilon) 2^{N(H(p)-\epsilon)} \leq\left|\mathcal{T}_{\epsilon}(p ; N)\right| \leq 2^{N(H(p)+\epsilon)}

    Proof:
    T ϵ ( p ; N ) = y T ϵ ( p ; N ) 1 = 2 N ( H ( p ) + ϵ ) y T ϵ ( p ; N ) 2 N ( H ( p ) + ϵ ) 2 N ( H ( p ) + ϵ ) y T ϵ ( p ; N ) p y ( y ) = 2 N ( H ( p ) + ϵ ) P { T ϵ ( p ; N ) } 2 N ( H ( p ) + ϵ ) \begin{aligned}\left|\mathcal{T}_{\epsilon}(p ; N)\right| &=\sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} 1 \\ &=2^{N(H(p)+\epsilon)} \sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} 2^{-N(H(p)+\epsilon)} \\ & \leq 2^{N(H(p)+\epsilon)} \sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} p_{\mathbf{y}}(\boldsymbol{y}) \\ &=2^{N(H(p)+\epsilon)} P\left\{\mathcal{T}_{\epsilon}(p ; N)\right\} \\ & \leq 2^{N(H(p)+\epsilon)} \end{aligned}
    typical_set

3. Divergence ε \varepsilon -typical set

  • WLLN: y = [ y 1 , y 2 , . . . , y N ] T ,      y i p     i . i . d \vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d
    $$
    L_{p | q}(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{\mathbf{y}}(\boldsymbol{y})}{q_{\mathbf{y}}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p\left(y_{n}\right)}{q\left(y_{n}\right)} \

    \lim {N \rightarrow \infty} \mathbb{P}\left(\left|L{p | q}(\boldsymbol{y})-D(p | q)\right|>\epsilon\right)=0
    $$
    Remarks: 前面只考虑的均值,这里还考虑了另一个分布

  • Definition: y = [ y 1 , y 2 , . . . , y N ] T ,      y i p     i . i . d \vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d
    T ϵ ( p q ; N ) = { y Y N : L p q ( y ) D ( p q ) ϵ } \mathcal{T}_{\epsilon}(p | q ; N)=\left\{\boldsymbol{y} \in \mathcal{Y}^{N}:\left|L_{p | q}(\boldsymbol{y})-D(p \| q)\right| \leq \epsilon\right\}

  • Properties

    • WLLN q y ( y ) p y ( y ) 2 N D ( p q ) \Longrightarrow q_{\mathbf{y}}(\boldsymbol{y}) \approx p_{\mathbf{y}}(\boldsymbol{y}) 2^{-N D(p \| q)}
    • Q { T ϵ ( p q ; N ) } 2 N D ( p q ) 0 Q\left\{\mathcal{T}_{\epsilon}(p | q ; N)\right\} \approx 2^{-N D(p \| q)} \to0
    • Remarks: p 的典型集可能是 q 的非典型集,在 N N 很大的时候,不同分布的 typical set 是正交的
  • Theorem
    ( 1 ϵ ) 2 N ( D ( p q ) + ϵ ) Q { T ϵ ( p q ; N ) } 2 N ( D ( p q ) ϵ ) (1-\epsilon) 2^{-N(D(p \| q)+\epsilon)} \leq Q\left\{\mathcal{T}_{\epsilon}(p \| q ; N)\right\} \leq 2^{-N(D(p \| q)-\epsilon)}

4. Large deviation of sample averages

Theorem (Cram´er’s Theorem): y = [ y 1 , y 2 , . . . , y N ] T ,     y i q     i . i . d \vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ y_i \sim q \ \ \ i.i.d with mean μ < \mu<\infty and γ > μ \gamma>\mu
lim N 1 N log P ( 1 N n = 1 N y n γ ) = E C ( γ ) \lim _{N \rightarrow \infty}-\frac{1}{N} \log \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right)=E_{C}(\gamma)
where E C ( γ ) E_C(\gamma) is referred as Chernoff exponent
E C ( γ ) D ( p ( ; x ) q ) ,     p ( ; x ) = q ( y ) e x y α ( x ) E_{C}(\gamma) \triangleq D(p(\cdot ; x) \| q),\ \ \ p(\cdot ; x)=q(y) e^{x y-\alpha(x)}
and with x > 0 x>0 chosen such that
E p ( ; x ) [ y ] = γ \mathbb{E}_{p(\cdot;x)}[y]=\gamma
Proof:

  1. P ( 1 N n = 1 N y n γ ) = P ( e x n = 1 N y n e N x γ ) e N x γ E [ e x n = 1 N y n ] = e N x γ ( E [ e x y ] ) N e N ( x γ α ( x ) ) \begin{aligned} \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) &=\mathbb{P}\left(e^{x \sum_{n=1}^{N} y_{n}} \geq e^{N x \gamma}\right) \\ & \leq e^{-N x \gamma} \mathbb{E}\left[e^{x \sum_{n=1}^{N} y_{n}}\right] \\ &=e^{-N x \gamma}\left(\mathbb{E}\left[e^{x y}\right]\right)^{N} \\ & \leq e^{-N\left(x_{*} \gamma-\alpha\left(x_{*}\right)\right)} \end{aligned}
  2. φ ( x ) = x γ α ( x ) \varphi(x)=x\gamma-\alpha(x) 是凸的,最大值取在 E p ( ; x ) [ y ] = α ˙ ( x ) = γ \mathbb{E}_{p\left(\cdot ; x_{*}\right)}[y]=\dot{\alpha}\left(x_{*}\right)=\gamma
  3. 可以证明 x γ α ( x ) = x α ˙ ( x ) α ( x ) = D ( p ( ; x ) q ) x_{*} \gamma-\alpha\left(x_{*}\right)=x_{*} \dot{\alpha}\left(x_{*}\right)-\alpha\left(x_{*}\right)=D\left(p\left(\cdot ; x_{*}\right) \| q\right)
  4. 于是有 P ( 1 N n = 1 N y n γ ) e N E C ( γ ) \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \leq e^{-N E_{C}(\gamma)}
  5. 下界的证明,暂时略…

用到的两个事实: p ( y ; x ) = q ( y ) exp ( x y α ( x ) ) p(y;x)=q(y)\exp(xy-\alpha(x))

  1. D ( p ( y ; x ) q ( y ) ) D(p(y;x)||q(y)) 随着 x 单调增加
  2. E p ( ; x ) [ y ] \mathbb{E}_{p(;x)}[y] 随着 x 单调增加

Remarks:

  1. 这个定理也相当于表达了 P ( 1 N n = 1 N y n γ ) 2 N E C ( γ ) \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \cong 2^{-N E_{\mathrm{C}}(\gamma)}
  2. 相当于是分布 q 向由 E [ y ] = n = 1 N y n γ \mathbb{E}[y]=\sum_{n=1}^{N} y_{n} \geq \gamma 所定义的一个凸集中投影,恰好投影到边界(线性分布族) E [ y ] = γ \mathbb{E}[y]=\gamma 上,而 q q 向线性分布族的投影恰好就是 (10) 中的指数族表达式

cramer_thm

5. Types and type classes

  • Definition: y = [ y 1 , y 2 , . . . , y N ] T \vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T (不关心真实服从的是哪个分布)

    • type(实质上就是一个经验分布)定义为

    p ^ ( b ; y ) = 1 N n = 1 N 1 b ( y n ) = N b ( y ) N \hat{p}(b ; \mathbf{y})=\frac{1}{N} \sum_{n=1}^{N} \mathbb{1}_{b}\left(y_{n}\right)=\frac{N_{b}(\mathbf{y})}{N}

    • P N y \mathcal{P}_{N}^{y} 表示长度为 N N 的序列所有可能的 types
    • type class: T N y ( p ) = { y y N : p ^ ( ; y ) p ( ) } ,     p P N y \mathcal{T}_{N}^{y}(p)=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \equiv p(\cdot)\right\},\ \ \ p\in\mathcal{P}_{N}^{y}
  • Exponential Rate Notation: f ( N ) 2 N α f(N) \doteq 2^{N \alpha}
    lim N log f ( N ) N = α \lim _{N \rightarrow \infty} \frac{\log f(N)}{N}=\alpha
    Remarks: α \alpha 表示了指数上面关于 N N 的阶数(log、线性、二次 …)

  • Properties

    • P N y ( N + 1 ) y \left|\mathcal{P}_{N}^{y}\right| \leq(N+1)^{|y|}
    • q N ( y ) = 2 N ( D ( p ^ ( y ) q ) + H ( p ^ ( ; y ) ) ) q^{N}(\mathbf{y})=2^{-N(D(\hat{p}(\cdot \mathbf{y}) \| q)+H(\hat{p}(\cdot ; \mathbf{y})))}
      p N ( y ) = 2 N H ( p )  for  y T N y ( p ) p^{N}(\mathbf{y})=2^{-N H(p)} \quad \text { for } \mathbf{y} \in \mathcal{T}_{N}^{y}(p)
    • c N y 2 N H ( p ) T N y ( p ) 2 N H ( p ) c N^{-|y|} 2^{N H(p)} \leq\left|\mathcal{T}_{N}^{y}(p)\right| \leq 2^{N H(p)}
  • Theorem
    c N y 2 N D ( p q ) Q { T N y ( p ) } 2 N D ( p q ) Q { T N y ( p ) } 2 N D ( p q ) c N^{-|y|} 2^{-N D(p \| q)} \leq Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \leq 2^{-N D(p \| q)} \\ Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \doteq 2^{-N D(p \| q)}

6. Large Deviation Analysis via Types

  • Definition: R = { y y N : p ^ ( ; y ) S P N y } \mathcal{R}=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \in \mathcal{S} \cap \mathcal{P}_{N}^{y}\right\}

Sanov’s Theorem:
Q { S P N y } ( N + 1 ) y 2 N D ( p q ) Q { S P N y } ˙ 2 N D ( p q ) p = arg min p S D ( p q ) Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \leq(N+1)^{|y|} 2^{-N D\left(p_{*} \| q\right)} \\ Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \dot\leq 2^{-N D\left(p_{*} \| q\right)} \\ p_{*}=\underset{p \in \mathcal{S}}{\arg \min } D(p \| q)

7. Asymptotics of hypothesis testing

  • LRT: L ( y ) = 1 N log p 1 N ( y ) p 0 N ( y ) = 1 N n = 1 N log p 1 ( y n ) p 0 ( y n ) > < γ L(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{1}^{N}(\boldsymbol{y})}{p_{0}^{N}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p_{1}\left(y_{n}\right)}{p_{0}\left(y_{n}\right)} \frac{>}{<} \gamma
  • P F = P 0 { 1 N n = 1 N t n γ } 2 N D ( p p 0 ) P_{F}=\mathbb{P}_{0}\left\{\frac{1}{N} \sum_{n=1}^{N} t_{n} \geq \gamma\right\} \approx 2^{-N D\left(p^{*} \| p_{0}^{\prime}\right)}
  • P M = 1 P D 2 N D ( p p 1 ) P_{M}=1-P_{D} \approx 2^{-N D\left(p^{*} \| p_{1}^{\prime}\right)}
  • D ( p p 0 ) D ( p p 1 ) = p ( t ) log p 1 ( t ) p 0 ( t ) d t = p ( t ) t d t = E p [ t ] = γ D\left(p^{*} \| p_{0}^{\prime}\right)-D\left(p^{*} \| p_{1}^{\prime}\right)=\int p^{*}(t) \log \frac{p_{1}^{\prime}(t)}{p_{0}^{\prime}(t)} \mathrm{d} t=\int p^{*}(t) t \mathrm{d} t=\mathbb{E}_{p^{*}}[\mathrm{t}]=\gamma

asymptotic

8.Asymptotics of parameter estimation

Strong Law of Large Numbers(SLLN):
P ( lim N 1 N n = 1 N w n = μ ) = 1 \mathbb{P}\left(\lim _{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^{N} w_{n}=\mu\right)=1
Central Limit Theorem(CLT):
lim N P ( 1 N n = 1 N ( w n μ σ ) b ) = Φ ( b ) \lim _{N \rightarrow \infty} \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^{N}\left(\frac{w_{n}-\mu}{\sigma}\right) \leq b\right)=\Phi(b)
以下三个强度依次递减

  1. 依概率 1 收敛(SLLN): x N w . p . 1 a \mathsf{x}_{N} \stackrel{w . p .1}{\longrightarrow} a
  2. 概率趋于 0(WLLN):
  3. 依分布收敛: x N d p \mathsf{x}_{N} \stackrel{d}{\longrightarrow} p
  • Asymptotics of ML Estimation

    Theorem:
    x ^ N = arg max x L N ( x ; y ) \hat{x}_{N}=\arg \max _{x} L_{N}(x ; \mathbf{y})
    在满足某些条件下(mild conditions),有
    x ^ N w p 1 x 0 N ( x ^ N x 0 ) d N ( 0 , J y ( x 0 ) 1 ) \begin{array}{c}{\hat{x}_{N} \stackrel{w \cdot p \cdot 1}{\longrightarrow} x_{0}} \\ {\sqrt{N}\left(\hat{x}_{N}-x_{0}\right) \stackrel{d}{\longrightarrow} \mathcal{N}\left(0, J_{y}\left(x_{0}\right)^{-1}\right)}\end{array}


其他内容请看:
统计推断(一) Hypothesis Test
统计推断(二) Estimation Problem
统计推断(三) Exponential Family
统计推断(四) Information Geometry
统计推断(五) EM algorithm
统计推断(六) Modeling
统计推断(七) Typical Sequence
统计推断(八) Model Selection
统计推断(九) Graphical models
统计推断(十) Elimination algorithm
统计推断(十一) Sum-product algorithm

发布了42 篇原创文章 · 获赞 34 · 访问量 3万+

猜你喜欢

转载自blog.csdn.net/weixin_41024483/article/details/104165242
今日推荐