稀疏贝叶斯学习 3

稀疏贝叶斯学习 3

FSBL 快速贝叶斯学习算法

前面已经求得的
L ( t σ 2 , α ) = 1 2 log σ 2 I + Φ Λ α Φ T 1 2 t T ( σ 2 I + Φ Λ α Φ T ) 1 t \boxed{ \begin{aligned} \mathcal L( \mathbf t \vert \sigma^2, \boldsymbol \alpha) = -\frac{1}{2} \log\Big\vert \sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf{\Lambda}_{\alpha}\mathbf{\Phi}^{\rm T} \Big\vert -\frac{1}{2} \mathbf{t}^{\rm T} \Big( \sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf{\Lambda}_{\alpha}\mathbf{\Phi}^{\rm T} \Big)^{-1} \mathbf{t} \end{aligned} }

Σ w 1 = 1 σ 2 Φ T Φ + Λ α 1 μ w = σ 2 Σ w Φ T t \boxed{ \begin{aligned} \mathbf \Sigma_w^{-1} &= \frac{1}{\sigma^2}\mathbf{\Phi}^{\rm T}\mathbf{\Phi}+\mathbf{\Lambda}_{\alpha}^{-1} \\ \bm\mu_w &= \sigma^{-2} \mathbf \Sigma_w \mathbf{\Phi}^{\rm T}\mathbf t \end{aligned} }

稀疏贝叶斯学习过程如下:
α i new = γ i μ i 2 ( σ 2 ) new = t Φ μ w 2 N i γ i γ i = 1 [ Σ w ] i , i α i \begin{aligned} \alpha_i^{\text{new}} &= \frac{\gamma_i}{\mu_i^2} \\ (\sigma^2)^{\text{new}} &= \frac{\Vert \mathbf t - \mathbf{\Phi}\bm\mu_w\Vert^2}{N-\sum_i \gamma_i} \\ \gamma_i &= 1- [\mathbf\Sigma_w]_{i,i} \cdot \alpha_i \end{aligned}
实际操作中, α \bm \alpha 中大部分元素将变得很大,这在迭代更新过程中会引入较大的运算量,如果仅仅将无限大的元素停止更新不删除,导致的(ill-conditioned) Σ \bf \Sigma 矩阵来更新 α \bm \alpha 将毫无意义。

上一节中得到的对数,为
L ( t σ 2 , α ) = N 2 log ( 2 π ) 1 2 log Σ t 1 2 t T Σ t 1 t \begin{aligned} \mathcal L( \mathbf t \vert \sigma^2, \boldsymbol \alpha) = {-\frac{N}{2}} \log (2\pi)-\frac{1}{2} \log\Big\vert \mathbf{\Sigma}_t \Big\vert -\frac{1}{2} \mathbf{t}^{\rm T} \mathbf{\Sigma}^{-1}_t \mathbf{t} \end{aligned}
其中 Σ t = σ 2 I + Φ Λ α Φ T \mathbf{\Sigma}_t=\sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf{\Lambda}_{\alpha}\mathbf{\Phi}^{\rm T} ,令 C = Σ t \mathbf C=\mathbf{\Sigma}_t ,以及 A 1 = Λ α \mathbf A^{-1} = \mathbf{\Lambda}_{\alpha}

重新整理得到
L ( t σ 2 , α ) = 1 2 [ N log ( 2 π ) + log C + t T C 1 t ] \boxed{ \begin{aligned} \mathcal L( \mathbf t \vert \sigma^2, \boldsymbol \alpha) = -\frac{1}{2} \bigg[ N\log (2\pi)+ \log\Big\vert \mathbf C \Big\vert + \mathbf{t}^{\rm T} \mathbf{C}^{-1} \mathbf{t} \bigg] \end{aligned} }
其中, C = σ 2 I + Φ A 1 Φ T \mathbf C=\sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf A^{-1}\mathbf{\Phi}^{\rm T}

为了减小运算量,我们做以下分解:

C = σ 2 I + Φ A 1 Φ T = σ 2 I + m i α m 1 ϕ m ϕ m T + α i 1 ϕ i ϕ i T = C i + α i 1 ϕ i ϕ i T \begin{aligned} \mathbf C& = \sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf A^{-1}\mathbf{\Phi}^{\rm T} \\ &= \sigma^2 \mathbf{I}+\sum_{m\neq i}\alpha_m^{-1}{\bm \phi_m} \bm \phi_m^{\rm T}+ \alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T}\\ &=\mathbf C_{-i}+\alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T} \end{aligned}
其中 C i \mathbf C_{-i} C \mathbf C 去掉基向量 i i 贡献后的结果。给定 L ( t σ 2 , α ) \mathcal L( \mathbf t \vert \sigma^2, \boldsymbol \alpha) 中的行列式计算(大名鼎鼎的 The Matrix Determinant Lemma 定理来啦)为
C = C i + α i 1 ϕ i ϕ i T = C i 1 + α i 1 ϕ i T C i 1 ϕ i \begin{aligned} \Big\vert \mathbf C \Big\vert &= \Big\vert \mathbf C_{-i}+\alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T} \Big\vert \\ &= \Big\vert \mathbf C_{-i}\Big\vert \cdot \Big\vert 1 +\alpha_i^{-1} \bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \Big\vert \end{aligned}
那么接下来去计算逆矩阵,利用
( P + ρ a a H ) 1 = P 1 ρ P 1 a ( 1 + ρ a H P 1 a ) 1 a H P 1 (\mathbf P+\rho \mathbf a\mathbf a^{H})^{-1} =\mathbf P^{-1} -\rho \mathbf P^{-1}\mathbf a ( 1+ \rho\mathbf a^{H} \mathbf P^{-1}\mathbf a )^{-1} \mathbf a^{H} \mathbf P^{-1}
因此可以得到
C 1 = ( C i + α i 1 ϕ i ϕ i T ) 1 = C i 1 α i 1 C i 1 ϕ i ( 1 + α i 1 ϕ i T C i 1 ϕ i ) 1 ϕ i T C i 1 = C i 1 C i 1 ϕ i ϕ i T C i 1 α i + ϕ i T C i 1 ϕ i \begin{aligned} \mathbf C^{-1} &= \bigg(\mathbf C_{-i}+\alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T}\bigg)^{-1} \\ &=\mathbf C_{-i}^{-1}- \alpha_i^{-1}\mathbf C_{-i}^{-1} \bm \phi_i \bigg(1+\alpha_i^{-1}\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i \bigg)^{-1} \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} \\ &=\mathbf C_{-i}^{-1}- \frac{\mathbf C_{-i}^{-1} \bm \phi_i \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} }{\alpha_i+\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i} \end{aligned}
于是令 1 + α i 1 ϕ i T C i 1 ϕ i = α i 1 ( α i + ϕ i T C i 1 ϕ i ) 1 +\alpha_i^{-1} \bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} =\alpha_i^{-1}(\alpha_i+\bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} ) L ( α ) \mathcal L( \boldsymbol \alpha) 可以写为
L ( α ) = 1 2 [ N log ( 2 π ) + log C + t T C 1 t ] = 1 2 [ N log ( 2 π ) + log C i + log ( 1 + α i 1 ϕ i T C i 1 ϕ i ) + t T ( C i 1 C i 1 ϕ i ϕ i T C i 1 α i + ϕ i T C i 1 ϕ i ) t ] = 1 2 [ N log ( 2 π ) + log C i log α i + log ( α i + ϕ i T C i 1 ϕ i ) + t T C i 1 t t T C i 1 ϕ i ϕ i T C i 1 t α i + ϕ i T C i 1 ϕ i ] = 1 2 [ N log ( 2 π ) + log C i log α i + log ( α i + ϕ i T C i 1 ϕ i ) + t T C i 1 t ( ϕ i T C i 1 t ) 2 α i + ϕ i T C i 1 ϕ i ] \begin{aligned} \mathcal L( \boldsymbol \alpha) &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log\Big\vert \mathbf C \Big\vert + \mathbf{t}^{\rm T} \mathbf{C}^{-1} \mathbf{t} \bigg] \\ &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log \Big\vert \mathbf C_{-i}\Big\vert + \log \Big( 1 +\alpha_i^{-1} \bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \Big)+ \mathbf{t}^{\rm T} \bigg(\mathbf C_{-i}^{-1}- \frac{\mathbf C_{-i}^{-1} \bm \phi_i \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} }{\alpha_i+\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i} \bigg) \mathbf{t} \bigg] \\ &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log \Big\vert \mathbf C_{-i}\Big\vert - \log\alpha_i+\log \Big( \alpha_i+\bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \Big) + \mathbf{t}^{\rm T} \mathbf C_{-i}^{-1}\mathbf{t}- \frac{\mathbf{t}^{\rm T} \mathbf C_{-i}^{-1} \bm \phi_i \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} \mathbf{t}}{\alpha_i+\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i } \bigg] \\ &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log \Big\vert \mathbf C_{-i}\Big\vert - \log\alpha_i+\log \Big( \alpha_i+\bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \Big) + \mathbf{t}^{\rm T} \mathbf C_{-i}^{-1}\mathbf{t} - \frac{(\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} \mathbf{t})^2}{\alpha_i+\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i } \bigg] \end{aligned}
我们令
s i = ϕ i T C i 1 ϕ i q i = ϕ i T C i 1 t \begin{aligned} s_i &= \bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \\ q_i &= \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} \mathbf{t} \end{aligned}
从而 L ( α ) \mathcal L( \boldsymbol \alpha) 可以写为
L ( α ) = 1 2 [ N log ( 2 π ) + log C i + t T C i 1 t log α i + log ( α i + s i ) q i 2 α i + s i ] = L ( α i ) + 1 2 [ log α i log ( α i + s i ) + q i 2 α i + s i ] = L ( α i ) + ( α i ) \begin{aligned} \mathcal L( \boldsymbol \alpha) &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log \Big\vert \mathbf C_{-i}\Big\vert + \mathbf{t}^{\rm T} \mathbf C_{-i}^{-1}\mathbf{t} - \log\alpha_i+\log \Big( \alpha_i+s_i \Big) - \frac{q_i^2}{\alpha_i+ s_i } \bigg] \\ &=\mathcal L( \boldsymbol \alpha_{-i})+ \frac{1}{2} \bigg[ \log\alpha_i-\log \Big( \alpha_i+s_i \Big) + \frac{q_i^2}{\alpha_i+ s_i } \bigg] \\ &= \mathcal L( \boldsymbol \alpha_{-i})+ \ell(\alpha_i) \end{aligned}

其中两个常数系数:

  • 稀疏系数 s i s_i 可以被视为衡量基向量 ϕ i {\bm \phi_i} 与模型中已经存在的 ϕ i {\bm \phi_i} 重叠的程度。
  • 质量系数 q i q_i ϕ i {\bm \phi_i} 与不包括该向量的模型误差对齐的一种度量,是衡量 ϕ i {\bm \phi_i} 增加 L ( α ) \mathcal L( \boldsymbol \alpha) 的效果。

上面的分析(求导,通分,令分子为 0,注意结果应该是合理的方差取值范围)表明 L ( α ) \mathcal L( \boldsymbol \alpha) 相对于 α i \alpha_i 具有唯一的最大值:

α i = s i 2 q i 2 s i , q i 2 > s i α i = , q i 2 s i \begin{aligned} \alpha_i &= \frac{s_i^2}{q_i^2-s_i},\quad \quad q_i^2>s_i \\ \alpha_i &= \infty,\quad \quad\quad\quad q_i^2\leq s_i \end{aligned}

通过迭代过程,计算所有的基向量 ϕ i \bm \phi_i 对应的 s i s_i q i q_i ,更新参数向量 α \bm \alpha ,总结基于快速 RVM 的贝叶算法如下:

  • 1) 对于稀疏回归问题,初始化 σ 2 \sigma^2

  • 2) 初始化一个基向量 ϕ i \bm \phi_i ,根据
    α i = s i 2 q i 2 s i , q i 2 > s i \alpha_i = \frac{s_i^2}{q_i^2-s_i},\quad q_i^2>s_i
    这时候初始化的 C = σ 2 I + α i 1 ϕ i ϕ i T \mathbf C =\sigma^2\mathbf I+ \alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T} ,因此 C i = σ 2 I C i 1 = σ 2 I \mathbf C_{-i}=\sigma^2\mathbf I \rightarrow \mathbf C_{-i}^{-1}=\sigma^{-2}\mathbf I 。使超参数
    α i = s i 2 q i 2 s i = σ 2 ϕ i T ϕ i 2 σ 2 ϕ i T t 2 σ 2 ϕ i T ϕ i = ϕ i 2 ϕ i T t 2 / ϕ i 2 σ 2 \begin{aligned} \alpha_i &= \frac{s_i^2}{q_i^2-s_i}=\frac{\Vert \sigma^{-2}\bm \phi_i^{\rm T } {\bm \phi_i} \Vert^2}{\Vert \sigma^{-2}\bm \phi_i^{\rm T } \mathbf{t} \Vert^2- \sigma^{-2}\bm \phi_i^{\rm T } {\bm \phi_i}} \\ &=\frac{\Vert {\bm \phi_i} \Vert^2}{\Vert \bm \phi_i^{\rm T } \mathbf{t} \Vert^2/\Vert \bm \phi_i \Vert^2- \sigma^{2}} \end{aligned}
    其他 α m \alpha_m 理论上为无穷大。

  • 3)显示计算初始化的均值 μ \bm \mu 和方差 Σ \mathbf \Sigma (最初是标量):
    Σ = [ σ 2 Φ T Φ + A ] 1 μ = σ 2 Σ Φ T t \begin{aligned} \mathbf \Sigma &= \Big[ {\sigma^{-2}}\mathbf{\Phi}^{\rm T}\mathbf{\Phi}+\mathbf{A}\Big] ^{-1}\\ \bm\mu &= \sigma^{-2} \mathbf \Sigma \mathbf{\Phi}^{\rm T}\mathbf t \end{aligned}
    连同一起初始化其他基向量 ϕ m \bm \phi_m 以及对应的 s m s_m q m q_m

  • 4)从具有 M M 个基向量的全部集合中选择候选基 ϕ i \bm \phi_i

  • 5)计算 θ i = q i 2 s i \theta_i =q_i^2-s_i

    • 如果 θ i > 0 \theta_i >0 α i < \alpha_i < \infty ,更新估计 α i \alpha_i
    • 如果 θ i > 0 \theta_i >0 α i = \alpha_i = \infty ,添加基向量 ϕ i \bm \phi_i 并更新估计 α i \alpha_i
    • 如果 θ i 0 \theta_i \leq 0 α i < \alpha_i < \infty ,删除基向量 ϕ i \bm \phi_i 并设置 α i = \alpha_i =\infty
  • 6) 估计噪声水平,更新其方差值
    σ 2 = t Φ μ 2 N i γ i = t Φ μ 2 N M + m α m [ Σ ] m , m \sigma^2= \frac{\Vert \mathbf t - \mathbf{\Phi}\bm\mu\Vert^2}{N-\sum_i \gamma_i} =\frac{\Vert \mathbf t - \mathbf{\Phi}\bm\mu\Vert^2}{N-M+\sum_m \alpha_m [\mathbf\Sigma]_{m,m}}

  • 7)再次更新均值 μ \bm \mu 和方差 Σ \mathbf \Sigma ,及所有的 s m s_m q m q_m

  • 8)判断是否收敛,收敛则算法结束,否则转到步骤(4)继续迭代。

猜你喜欢

转载自blog.csdn.net/qq_23947237/article/details/90440265