Prerequisite: Matrix Multiplication Sampling

常规矩阵乘法的计算：

设 $A\in M_{m\times n}, \ B\in M_{n\times p}$
$AB=\sum_{k=1}^n A(:,k)B(k,:)$
时间复杂度为 $O(n^3)$ ，太慢了，我们希望通过随机取样的方法计算矩阵乘法 $A B$ ，降低时间复杂度

一次取样：

随机取 $A$ 中的一行与 $B$ 中对应的一列
设取第 $k$ 行与第 $k$ 列的概率为 $p_k$ ，我们有 $p_1+\cdots+p_n=1$
令 $X=\dfrac{1}{p_k}A(:,k)B(k,:)$
则有
$E(X)=\sum_{k=1}^n p_k \cdot \frac{1}{p_k}A(:,k)B(k,:)=AB$
$\xlongequal{def} \sum\limits_{ij}Var(X_{ij}) =\sum_{ij}E(X_{ij}^2)-\sum_{ij}E^2(X_{ij})$

注：
在矩阵方差的这个定义下，我们有 $Var(X)=E\big(\|X-E(X)\|_F^2\big)$

前半部分
$\begin{aligned} \sum_{ij}E(X_{ij}^2) = & \sum_{ij}\sum_k p_k \cdot \dfrac{1}{p_k^2}A_{ik}^2 B_{kj}^2 \\ = & \sum_{k} \dfrac{1}{p_k} \Big(\sum_{i} A_{ik}^2 \Big) \Big( \sum_j B_{kj}^2 \Big) \\ = & \sum_{k} \dfrac{1}{p_k} \|A(:,k)\|^2 \|B(k,:)\|^2 \\ \xlongequal{p_k=\frac{\|A(:,k)\|^2}{\|A\|_F^2}} & \|A\|_F^2\sum_k\|B(k,:)\|^2 \\ = & \|A\|_F^2 \|B\|_F^2 \end{aligned}$

注：
对于倒数第二个等号，由柯西不等式，本来当 $p_k \propto \|A(:,k)\| \|B(k,:)\|$ 时取最小值。
但是为了计算方便，这里让 $p_k \propto \|A(:,k)\|^2$ ，也就是 $p_k=\dfrac{\|A(:,k)\|^2}{\|A\|_F^2}$

后半部分
$\sum_{ij}E^2(X_{ij}) = \|E(X)\|_F^2 = \|AB\|_F^2$
因此
$\|A\|_F^2 \|B\|_F^2 - \|AB\|_F^2 \le \|A\|_F^2 \|B\|_F^2$

多次取样：

核心思想：通过多次取样降低方差
随机取 $A$ 中的 $s$ 列组成 $\in M_{m\times s}$ 和 $B$ 中相应的 $s$ 行组成 $\in M_{s\times p}$
当然还要乘上相应的系数，具体来说：
$\begin{bmatrix} \dfrac{A(:,k_1)}{\sqrt {sp_{k_1}}},\cdots,\dfrac{A(:,k_s)}{\sqrt {sp_{k_s}}} \end{bmatrix}$
$\begin{bmatrix} \dfrac{B(k_1,:)}{\sqrt {sp_{k_1}}} \\ \vdots \\ \dfrac{B(k_s,:)}{\sqrt {sp_{k_s}}} \end{bmatrix}$
其中 $k_1,\cdots,k_s$ 是从 $\{1,2,\cdots,n\}$ 中随机取 $s$ 个的结果
这样
$\begin{aligned} CR & = \frac{1}{s} \sum _i \dfrac{1}{p_{k_i}}A(:,k_i)B(k_i,:) \\ & = \frac{1}{s}\sum_i X_i \end{aligned}$
因此 $C R$ 就相当于重复取样多次后的 $X$ 的平均值，相应地
$Var(\frac{1}{s}\sum_i X_i) = \frac{1}{s^2} \sum_i Var(X_i) \le \frac{1}{s} \|A\|_F^2 \|B\|_F^2$
最终结论：
$E\left(\|AB-CR\|_F^2\right) \le \frac{1}{s} \|A\|_F^2 \|B\|_F^2$

CUR Decomposition

Motivation: Sketch of Matrix

Let $\in M_{m\times n}$
The time complexity of computing $A x$ is $O (m n)$
But if we decomposite $A$ into $C U R$ , where $\in M_{n\times s}, U \in M_{s \times r}, R \in M_{r\times n}$ , then the time complexity would be $O (m s + s r + r n)$ , which would be $O (m + n)$ if $s$ and $r$ are $O (1)$ .

Comparison with SVD

Pros

faster
preserve the actual data $\Rightarrow$ easier to interpret than the linear combination of data as in SVD
preserve some property, like the sparsity

Cons

less accurate approximation
stronger error bound

Intuitions

First Try: Identity Matrix

Just let $B = I$ and repeat the matrix multiplication sampling process.
However, $I$ is full rank and each row has the same weight, so sampling will lose too much information, and require the bound to be too strong:
$E\left( \|AI-CR\|_F^2 \right) \le \frac{1}{s} \|A\|_F^2 \|I\|_F^2 = \frac{n}{s} \|A\|_F^2$

Projection Matrix

We want a low rank $B$ , so that sampling won’t cause information loss.
Consider the projection matrix $P = R^T(RR^T)^{-1}R$ , where $\in M_{s\times n}$ is a matrix of $r$ rows of $A$ picked according to length squared sampling. $P$ is just like a pseudo indentity.

Note: Properties of projection matrix
$P x = x$ if $x$ is in the row space of $R$
$P x = 0$ otherwise

Main Theorem

Let $A\in M_{m\times n},\ r,s \in \mathbb N^+$
Let $C\in M_{m\times r}$ be a matrix of $s$ columns of $A$ picked according to length squared sampling, $\in M_{s\times n}$ be a matrix of $r$ rows of $A$ picked according to length squared sampling again.
Then we can find a $\in M_{r\times s}$ , so that
$E\left(\|A-CUR\|_2^2\right) \le \|A\|_F^2 \left( \frac{1}{\sqrt{r}} + \frac{r}{s} \right)$

Note:
Unlike what we do in matrix multiplication sampling, the rows are not necessarily corresponding to the columns.

And we find $U$ in this way:
Wanting to make the rows of $U R$ be the rows of $P=R^T(RR^T)^{-1}R$ corresponding to the columns of $A$ , we just picked some rows of $R^T$ , right-times it by $RR^T)^{-1}$ , which comes to $U$

Proof

$\|A-CUR\|_2 \le \|A-AP\|_2 + \|AP-CUR\|_2$

Note:
The two norm of matrix $A$ is $\|A\|_2 \xlongequal{def}\max\limits_x\|Ax\|_2$
satisfying the triangle inequality $\|A+B\|_2\le\|A\|_2+\|B\|_2$

first part:
$A-AP\|_2^2=\max_x \|Ax-APx\|^2$
For the $x$ in row space of $R$ , $Ax-APx\|_2 = 0$ , so just consider the $\bot$ row space of $R$
$\begin{aligned} \|Ax-APx\|^2 & = \|Ax\|^2 \\ & = x^TA^TAx=x^T(A^TA-R^TR)x \\ & \le \|x\|\|(A^TA-R^TR)x\| \\ & \le \|A^TA-R^TR\|_2 \\ & \le \frac{\|A\|_F^4}{r} \end{aligned}$

Note:
The last step comes from $E\left(\|AB-CR\|_F^2\right) \le \dfrac{1}{s} \|A\|_F^2 \|B\|_F^2$ . Replace $A$ by $A^T$ and replace $B$ by $A$

second part:
$\begin{aligned} E(\|AP-CUR\|_2) & \le E(\|AP-CUR\|_F)\\ & \le \dfrac{1}{s} \|A\|_F^2 \|P\|_F^2 \\ & \le \frac{r}{s}\|A\|_F^2 \end{aligned}$

Note:
The last step comes from $\|P\|_F^2 \le r$
That is because $P\|_F^2 =$ sum of squared singular value, having $r$ in total, each of which $\le 1$ because $\|Pv\| \le 1$ for any unit vector $v$ since $P$ is a projection.

Computation Step by Step

Pick $s$ columns of $A$ according to length squared sampling, which forms $C$
Pick $r$ rows of $A$ according to length squared sampling, which forms $R$
Compute $RR^T)^{-1}$
Pick $s$ rows of $R^T$ corresponding to step 1, and right-times the resulting matrix by $RR^T)^{-1}$ , which forms $U$

Note:
What if the matrix $RR^T$ doesn’t have an inverse?
Use psedo inverse, which can be represented and computed based on some matrix decomposition, such as SVD.
As for why the pseudo inverse work, see this passage to know about the intuition behind pseudo inverse and you will understand.
Whatever, the $RR^T$ is a $r\times r$ matrix, so the time complexity is quite low.

CUR Decomposition

Prerequisite: Matrix Multiplication Sampling

常规矩阵乘法的计算：

一次取样：

多次取样：

CUR Decomposition

Motivation: Sketch of Matrix

Comparison with SVD

Pros

Cons

Intuitions

First Try: Identity Matrix

Projection Matrix

Main Theorem

Proof

Computation Step by Step

猜你喜欢