Feng, J., Xu, H., & Yan, S. (2013). Online robust pca via stochastic optimization. In Advances in Neural Information Processing Systems (pp. 404-412).
本文是这篇 NIPS 会议论文的笔记，主要是对文中的理论方法进行展开详解。本人学术水平有限，文中如有错误之处，敬请指正。

摘要：RPCA 是一种典型的基于批量数据的优化方法，并需要在优化过程中，将所有样本载入内存中。这阻碍了它用于高效地处理大数据。此文设计了一种 Online RPCA 的算法，可以一次处理一张图像，使其内存消耗与样本的数量无关，极大地提升了计算和存储的效率。提出的方法是基于随机优化的，等价于批量形式的 RPCA 。确实，ORPCA 提供了一系列的子空间估计，收敛至其批量形式的最优点，所以被证明是对于稀疏的损坏是鲁棒的。另外，ORPCA 是很自然地用于动态子空间跟踪。基于子空间恢复和跟踪的仿真展示了其鲁棒性和效率上的优势。

1 简介

略

2 相关工作

略

3 构建问题

3.1 符号

向量用粗体小写字母表示， $\mathbf{x} \in \mathbb{R}^p$ 表示真实的样本，没有噪声， $\mathbf{e} \in \mathbb{R}^p$ 是噪声， $\mathbf{z} \in \mathbb{R}^p$ 是实际的样本， $\mathbf{z} = \mathbf{x} + \mathbb{e}$ 。这里 $p$ 表示样本的维度。 $r$ 表示潜在子空间 $\{x_i\}_{i=1}^{n}$ 的内部维度。 $n$ 是样本的个数。 $t$ 是样本的索引。矩阵用大写字母表示。 $Z \in \mathbb{R}^{p \times n}$ 是实际观测的矩阵，其每一列 $\mathbf{z}_i$ 表示一个样本。对任意的实矩阵 $E$ ， $||E||_{\text{F}}$ 表示 Frobenius 范数， $||E||_{\ell_1} = \sum_{i,j} |E_{ij}|$ 表示 $\ell_1$ 范数，将 $E \in \mathbb{R}^{p \times n}$ 看作一个长向量， $||E||_*=\sum_i \sigma_i(E)$ 表示核范数，也就是奇异值之和。

3.2 目标函数构建

Robust PCA (RPCA) 可以准确地估计观测样本的潜在子空间，即使样本被严重的、不稀疏的噪声破坏。非常流行的 RPCA 方法之一，Principal Component Pursuit (PCP) 方法 1 提出用于解决：将样本矩阵 $\mathbf{Z}$ 分解为一个低秩部分 $\mathbf{X}$ 代表低维的子空间，加上总体的稀疏矩阵 $\mathbf{E}$ 表示稀疏的损坏。在合适的条件下，PCP 保证这两项 $\mathbf{X}$ 和 $\mathbf{E}$ 可以被准确地恢复，通过

min X, E 1 2 | | Z - X - E | | 2 F + λ 1 | | X | | * + λ 2 | | E | | 1 . (1)

$\begin{equation} \min_{\mathbf{X},\mathbf{E}} \ \frac{1}{2} || \mathbf{Z} - \mathbf{X} - \mathbf{E} ||_\mathrm{F}^2 + \lambda_1 || \mathbf{X} ||_* + \lambda_2 || \mathbf{E} ||_1 . \tag{1} \end{equation}$

要求解该问题，迭代优化的方法比如 Accelerated Proximal Gradient (APG) 2 或 Augmented Lagrangian Multiplier (ALM) 3 通常被采用。然而，这些方法都是以批量的形式实现的。在优化的每一次迭代中，它们需要通过所有的样本进行 SVD 操作。所以，大量的存储的代价就出现，当其用于处理大数据时，比如网络数据，大规模图像集。

此文考虑的是 online 的 PCP 方法的实现。主要的困难是核范数紧密地结合了所有的样本，使得样本不能被分离地考虑，像典型的在线优化问题。为了克服这个，此文使用一个核范数的等价形式：一个矩阵 $\mathbf{X}$ 的秩最大值为 $r$ ，正如 4

| | X | | * = inf L \in R p \times r, R \in R n \times r {1 2 | | L | | 2 F + 1 2 | | R | | 2 F : X = L R T} . (2)

$\begin{equation} ||\mathbf{X}||_* = \inf_{\mathbf{L}\in\mathbb{R}^{p \times r}, \mathbf{R}\in\mathbb{R}^{n \times r}} \left\{ \frac{1}{2} ||\mathbf{L}||_\mathrm{F}^2 + \frac{1}{2} ||\mathbf{R}||_\mathrm{F}^2 : \mathbf{X} = \mathbf{L} \mathbf{R}^\mathrm{T} \right\}. \tag{2} \end{equation}$
也就是说，核范数可以明确地用低秩分解的形式表示。其最初于 5 中提出，并在 6 7 中很好地应用。在公式中，

L∈Rp×r $\mathbf{L}\in\mathbb{R}^{p \times r}$ 可以被看作低秩子空间的基，

R∈Rn×r $\mathbf{R}\in\mathbb{R}^{n \times r}$ 则表示样本的关于基的系数。于是，RPCA 问题可以被重写为

min X, L \in R p \times r, R \in R n \times r, E 1 2 | | Z - X - E | | 2 F + λ 1 2 (| | L | | 2 F + | | R | | 2 F) + λ 2 | | E | | 1, s . t . X = L R T . (3)

$\begin{equation} \min_{\mathbf{X}, \mathbf{L}\in\mathbb{R}^{p \times r}, \mathbf{R}\in\mathbb{R}^{n \times r}, \mathbf{E}} \ \frac{1}{2} || \mathbf{Z} - \mathbf{X} - \mathbf{E} ||_\mathrm{F}^2 + \frac{\lambda_1}{2} \left( ||\mathbf{L}||_\mathrm{F}^2 + ||\mathbf{R}||_\mathrm{F}^2 \right) + \lambda_2 || \mathbf{E} ||_1 , \ \mathrm{s.t.} \ \mathbf{X} = \mathbf{L}\mathbf{R}^\mathrm{T} . \tag{3} \end{equation}$
将

X $\mathbf{X}$ 用

LRT $\mathbf{L}\mathbf{R}^\mathrm{T}$ 代入，除去等式约束，以上的问题可以等价为

min L \in R p \times r, R \in R n \times r, E 1 2 | | Z - L R T - E | | 2 F + λ 1 2 (| | L | | 2 F + | | R | | 2 F) + λ 2 | | E | | 1 . (4)

$\begin{equation} \min_{\mathbf{L}\in\mathbb{R}^{p \times r}, \mathbf{R}\in\mathbb{R}^{n \times r}, \mathbf{E}} \ \frac{1}{2} || \mathbf{Z} - \mathbf{L}\mathbf{R}^\mathrm{T} - \mathbf{E} ||_\mathrm{F}^2 + \frac{\lambda_1}{2} \left( ||\mathbf{L}||_\mathrm{F}^2 + ||\mathbf{R}||_\mathrm{F}^2 \right) + \lambda_2 || \mathbf{E} ||_1 . \tag{4} \end{equation}$

尽管此目标函数并不是关于 $\mathbf{L}$ 和 $\mathbf{R}$ 凸的，此文可以证明其局部极小值就是原问题的全局的最优值。

给定一个有限的样本集 $\mathbf{Z}=[\mathbf{z}_1, \cdots, \mathbf{z}_n] \in \mathbb{R}^{p \times n}$ ，求解以上的问题也就是最小经验的代价函数

f n (L) ≜ 1 n \sum i = 1 n ℓ (z i, L) + λ 1 2 n | | L | | 2 F, (5)

$\begin{equation} f_n (\mathbf{L}) \triangleq \frac{1}{n} \sum_{i=1}^{n} \ell (\mathbf{z}_i, \mathbf{L}) + \frac{\lambda_1}{2n} ||\mathbf{L}||_\mathrm{F}^2, \tag{5} \end{equation}$
其中，每一个样本的损失函数定义如下

ℓ (z i, L) ≜ min r, e 1 2 | | z i - L r - e | | 22 + λ 1 2 | | r | | 22 + λ 2 | | e | | 1 . (6)

$\begin{equation} \ell(\mathbf{z}_i, \mathbf{L}) \triangleq \min_{\mathbf{r},\mathbf{e}} \ \frac{1}{2} ||\mathbf{z}_i -\mathbf{L}\mathbf{r} - \mathbf{e}||_2^2 + \frac{\lambda_1}{2} ||\mathbf{r}||_2^2 + \lambda_2 ||\mathbf{e}||_1 . \tag{6} \end{equation}$
该损失函数测量一个样本

z $\mathbf{z}$ 基于固定的基

L $\mathbf{L}$ 的表达误差，其中每一个样本的系数

r $\mathbf{r}$ 和稀疏的噪声

e $\mathbf{e}$ 通过最小化损失得到。在随机优化过程中，一般通常关心的是最小化期望的全样本的损失 [16]

f (L) ≜ E z [ℓ (z, L)] = lim n \to \infty f n (L), (7)

$\begin{equation} f(\mathbf{L}) \triangleq \mathbb{E}_{\mathbf{z}} [\ell (\mathbf{z}, \mathbf{L})] = \lim_{n \rightarrow \infty} f_n (\mathbf{L}) , \tag{7} \end{equation}$
其中期望是通过样本

z $\mathbf{z}$ 的分布计算的。此文首先通过建立一个替代函数，来近似期望，然后再以在线的形式优化它。

4 Online RPCA 随机优化

该算法的主要思想是设计一个随机优化的算法，最小化代价函数，每一个时间点处理一个样本。系数 $\mathbf{r}$ ，噪声 $\mathbf{e}$ ，基 $\mathbf{L}$ 被交替优化。在第 $t$ 个时间点，可以获得基的估计 $\mathbf{L}_t$ ，通过最小化累计的，关于之前的系数 $\{\mathbf{r}_i\}_{i=1}^t$ 和稀疏噪声 $\{\mathbf{e}_i\}_{i=1}^t$ 的损失。更新 $\mathbf{L}_t$ 的目标函数定义为

g t (L) ≜ 1 t \sum i = 1 t (1 2 | | z i - L r i - e i | | 22 + λ 1 2 | | r i | | 22 + λ 2 | | e i | | 1) + λ 1 2 t | | L | | 2 F . (8)

$\begin{equation} g_t(\mathbf{L}) \triangleq \frac{1}{t} \sum_{i=1}^{t} \left( \frac{1}{2} ||\mathbf{z}_i -\mathbf{L}\mathbf{r}_i - \mathbf{e}_i||_2^2 + \frac{\lambda_1}{2} ||\mathbf{r}_i||_2^2 + \lambda_2 ||\mathbf{e}_i||_1 \right) + \frac{\lambda_1}{2t} ||\mathbf{L}||_\mathrm{F}^2. \tag{8} \end{equation}$
这是经验的代价函数

ft(L) $f_t(\mathbf{L})$ 的代替函数，可以证明它是一个上界

gt(L)≥ft(L) $g_t(\mathbf{L}) \geq f_t(\mathbf{L})$ 。

算法被总结于 Algorithm 1 中。其中第一个子问题涉及一个小规模的凸问题，可以被有效地求解。具体推导见 Appendix 。为了更新基 $\mathbf{L}$ ，采用了块坐标下降法 8 。具体来说，基 $\mathbf{L}$ 的每一列都是独立地更新，而同时固定其他列。

接下来的部分为理论推导省略，详见原文。

Algorithm 1 Online RPCA 随机优化
输入： $\{ \mathbf{z}_1, \cdots, \mathbf{z}_{\mathrm{T}} \}$ 观测的数据， $\lambda_1, \lambda_2$ 约束系数， $\mathbf{L}_0 \in \mathbb{R}^{p \times r}$ ， $\mathbf{r}_0 \in \mathbb{R}^{r}$ ， $\mathbf{e}_0 \in \mathbb{R}^{p}$ 初始值， $T$ 最大迭代次数。
for $t$ = 1 to $T$ do
$\quad$ 1) 取得样本 $\mathbf{z}_t$ ；
$\quad$ 2) 新样本计算

{r t, e t} = arg min 1 2 | | z t - L t - 1 r - e | | 22 + λ 1 2 | | r | | 22 + λ 2 | | e | | 1 . (9)

$\begin{equation} \{\mathbf{r}_t, \mathbf{e}_t \} = \arg\min \ \frac{1}{2} ||\mathbf{z}_t -\mathbf{L}_{t-1}\mathbf{r} - \mathbf{e}||_2^2 + \frac{\lambda_1}{2} ||\mathbf{r}||_2^2 + \lambda_2 ||\mathbf{e}||_1 . \tag{9} \end{equation}$

$\quad$ 3) 更新中间变量

At←At−1+rtrTt $\mathbf{A}_t \leftarrow \mathbf{A}_{t-1} + \mathbf{r}_t\mathbf{r}_t^{\mathrm{T}}$ ，

Bt←Bt−1+(zt−et)rTt $\mathbf{B}_t \leftarrow \mathbf{B}_{t-1} + (\mathbf{z}_t - \mathbf{e}_t ) \mathbf{r}_t^{\mathrm{T}}$ ；

$\quad$ 4) 计算

Lt $\mathbf{L}_t$ 使用

Lt−1 $\mathbf{L}_{t-1}$ 使用 Algorithm 2；

L t ≜ arg min 1 2 t r [L T (A t + λ 1 I) L] - t r (L T B t) . (10)

$\begin{equation} \mathbf{L}_t \triangleq \arg\min \ \frac{1}{2} \mathrm{tr} \left[ \mathbf{L}^{\mathrm{T}}(\mathbf{A}_t + \lambda_1 \mathbf{I}) \mathbf{L} \right] - \mathrm{tr} (\mathbf{L}^{\mathrm{T}} \mathbf{B}_t). \tag{10} \end{equation}$
end for
Return

XT=LTRTT $\mathbf{X}_{\mathrm{T}}=\mathbf{L}_{\mathrm{T}} \mathbf{R}_{\mathrm{T}}^{\mathrm{T}}$ ，低秩数据矩阵，

ET $\mathbf{E}_{\mathrm{T}}$ 稀疏噪声矩阵。

Algorithm 2 基更新步骤
输入： $\mathbf{L} = [\mathbf{l}_1,\cdots,\mathbf{l}_r] \in \mathbb{R}^{p \times r}$ ， $\mathbf{A}=[\mathbf{a}_1,\cdots,\mathbf{a}_r] \in \mathbb{R}^{r \times r}$ ， $\mathbf{B}=[\mathbf{b}_1,\cdots,\mathbf{b}_r] \in \mathbb{R}^{p \times r}$ 。
for $j$ = 1 to $r$ do

A ~ \leftarrow A + λ 1 I, l j \leftarrow 1 A ~ j , j (b j - L a ~ j) + l j . (11)

$\begin{equation} \begin{gathered} \tilde{\mathbf{A}} \leftarrow \mathbf{A} + \lambda_1 \mathbf{I}, \\ \mathbf{l}_j \leftarrow \frac{1}{\tilde{\mathbf{A}}_{j,j}} (\mathbf{b}_j - \mathbf{L} \tilde{\mathbf{a}}_j) + \mathbf{l}_j . \end{gathered} \tag{11} \end{equation}$
end for
Return

L $\mathbf{L}$ 。

5 实验

略

Appendix

Algorithm 1: 2) 过程推导。 $\mathbf{r}$ 的更新公式

L = 1 2 | | z t - L t - 1 r - e | | 22 + λ 1 2 | | r | | 22 + λ 2 | | e | | 1, \partial L \partial r = L T t - 1 (L t - 1 r + e - z t) + λ 1 r = 0, (L T t - 1 L t - 1 + λ 1 I) r = L T t - 1 (z t - e), r * = (L T t - 1 L t - 1 + λ 1 I) - 1 L T t - 1 (z t - e) . (12)

$\begin{equation} \begin{gathered} \mathcal{L} = \frac{1}{2} ||\mathbf{z}_t -\mathbf{L}_{t-1}\mathbf{r} - \mathbf{e}||_2^2 + \frac{\lambda_1}{2} ||\mathbf{r}||_2^2 + \lambda_2 ||\mathbf{e}||_1 , \\ \frac{\partial \mathcal{L}}{\partial \mathbf{r}} = \mathbf{L}_{t-1}^{\mathrm{T}} (\mathbf{L}_{t-1}\mathbf{r} + \mathbf{e} - \mathbf{z}_t) + \lambda_1 \mathbf{r} = 0 , \\ (\mathbf{L}_{t-1}^{\mathrm{T}} \mathbf{L}_{t-1} + \lambda_1 \mathbf{I}) \mathbf{r} = \mathbf{L}_{t-1}^{\mathrm{T}} (\mathbf{z}_t - \mathbf{e}), \\ \mathbf{r}^* = (\mathbf{L}_{t-1}^{\mathrm{T}} \mathbf{L}_{t-1} + \lambda_1 \mathbf{I})^{-1} \mathbf{L}_{t-1}^{\mathrm{T}} (\mathbf{z}_t - \mathbf{e}) . \end{gathered} \tag{12} \end{equation}$

e $\mathbf{e}$ 的更新公式

arg min e 1 2 | | e - (z t - L t - 1 r) | | 22 + λ 2 | | e | | 1, e = S λ 2 (z t - L t - 1 r), (13)

$\begin{equation} \begin{gathered} \arg\min_{\mathbf{e}} \ \frac{1}{2} || \mathbf{e} - (\mathbf{z}_t - \mathbf{L}_{t-1}\mathbf{r}) ||_2^2 + \lambda_2 ||\mathbf{e}||_1, \\ \mathbf{e} = S_{\lambda_2} (\mathbf{z}_t - \mathbf{L}_{t-1} \mathbf{r}), \end{gathered} \tag{13} \end{equation}$
其中

Sσ(x)=sign(x)⋅max(x−σ,0) $S_{\sigma} (x) = \mathrm{sign}(x) \cdot \max(x-\sigma, 0)$ 是一个 shrinkage 函数，用于近似求解

ℓ1 $\ell_1$ 约束问题。

Algorithm 2: 过程推导。

F = 1 2 | | L r t + e t - z t | | 22 + λ 1 2 | | L | | 2 F = 1 2 [t r ((L r t) (L r t) T) + 2 t r (L r t (e t - z t)) + t r ((e t - z t) (e t - z t) T)] + λ 1 2 t r (L L T) = 1 2 t r (L r t r T t L T + λ 1 L L T) + t r (L r t (e t - z t) T) = 1 2 t r [L T (r t r T t + λ 1 I) L] - t r [L T (z t - e t) r T t], (14)

$\begin{equation} \begin{aligned} \mathcal{F} &= \frac{1}{2} ||\mathbf{L}\mathbf{r}_t + \mathbf{e}_t - \mathbf{z}_t||_2^2 + \frac{\lambda_1}{2} || \mathbf{L} ||_{\mathrm{F}}^2 \\ &= \frac{1}{2} \left[ \mathrm{tr} \left( (\mathbf{L}\mathbf{r}_t) (\mathbf{L}\mathbf{r}_t)^{\mathrm{T}} \right) + 2 \, \mathrm{tr} \left( \mathbf{L}\mathbf{r}_t (\mathbf{e}_t - \mathbf{z}_t) \right) + \mathrm{tr} \left( (\mathbf{e}_t - \mathbf{z}_t) (\mathbf{e}_t - \mathbf{z}_t)^{\mathrm{T}} \right) \right] + \frac{\lambda_1}{2} \mathrm{tr} \left( \mathbf{L} \mathbf{L}^{\mathrm{T}} \right) \\ &= \frac{1}{2} \mathrm{tr} \left( \mathbf{L}\mathbf{r}_t \mathbf{r}_t^{\mathrm{T}} \mathbf{L}^{\mathrm{T}} + \lambda_1 \mathbf{L} \mathbf{L}^{\mathrm{T}} \right) + \mathrm{tr} \left( \mathbf{L}\mathbf{r}_t (\mathbf{e}_t - \mathbf{z}_t)^{\mathrm{T}} \right) \\ &= \frac{1}{2} \mathrm{tr} \left[ \mathbf{L}^{\mathrm{T}}(\mathbf{r}_t \mathbf{r}_t^{\mathrm{T}} + \lambda_1 \mathbf{I}) \mathbf{L} \right] - \mathrm{tr} \left[ \mathbf{L}^{\mathrm{T}} (\mathbf{z}_t - \mathbf{e}_t) \mathbf{r}_t^{\mathrm{T}} \right] , \\ \end{aligned} \tag{14} \end{equation}$

F = 1 2 t r (L T (A t + λ 1 I) L) - t r (L T B t), \partial F \partial L = L (A t + λ 1 I) - B t . (15)

$\begin{equation} \begin{gathered} \mathcal{F} = \frac{1}{2} \mathrm{tr} \left( \mathbf{L}^{\mathrm{T}}(\mathbf{A}_t + \lambda_1 \mathbf{I}) \mathbf{L} \right) - \mathrm{tr} \left( \mathbf{L}^{\mathrm{T}} \mathbf{B}_t \right) , \\ \frac{\partial \mathcal{F}}{\partial \mathbf{L}} = \mathbf{L} (\mathbf{A}_t + \lambda_1 \mathbf{I}) - \mathbf{B}_t . \end{gathered} \tag{15} \end{equation}$
采用块坐标下降，其更新

L $\mathbf{L}$ 每一列的公式为

l j + 1 \leftarrow l j - 1 A ~ j , j (L a ~ j - b j) . (16)

$\begin{equation} \mathbf{l}_{j+1} \leftarrow \mathbf{l}_j - \frac{1}{\tilde{\mathbf{A}}_{j,j}} (\mathbf{L} \tilde{\mathbf{a}}_j - \mathbf{b}_j) . \tag{16} \end{equation}$

E.J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? ArXiv:0912.3599, 2009. ↩
Z. Lin, A. Ganesh, J. Wright, L.Wu, M. Chen, and Y. Ma. Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix. Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2009. ↩
Z. Lin, M. Chen, and Y. Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055, 2010. ↩
B. Recht, M. Fazel, and P.A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010. ↩
Samuel Burer and Renato Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Progam., 2003. ↩
B. Recht, M. Fazel, and P.A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010. ↩
Jasson Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. ↩
D.P. Bertsekas. Nonlinear programming. Athena Scientific, 1999. ↩

笔记：Online Robust PCA via Stochastic Optimization