CUR Decomposition

Prerequisite: Matrix Multiplication Sampling

常规矩阵乘法的计算:

A ∈ M m × n ,   B ∈ M n × p A\in M_{m\times n}, \ B\in M_{n\times p} AMm×n, BMn×p
A B = ∑ k = 1 n A ( : , k ) B ( k , : ) AB=\sum_{k=1}^n A(:,k)B(k,:) AB=k=1nA(:,k)B(k,:)
时间复杂度为 O ( n 3 ) O(n^3) O(n3),太慢了,我们希望通过随机取样的方法计算矩阵乘法 A B AB AB,降低时间复杂度

一次取样:

随机取 A A A中的一行与 B B B对应的一列
设取第 k k k行与第 k k k列的概率为 p k p_k pk,我们有 p 1 + ⋯ + p n = 1 p_1+\cdots+p_n=1 p1++pn=1
X = 1 p k A ( : , k ) B ( k , : ) X=\dfrac{1}{p_k}A(:,k)B(k,:) X=pk1A(:,k)B(k,:)
则有
E ( X ) = ∑ k = 1 n p k ⋅ 1 p k A ( : , k ) B ( k , : ) = A B E(X)=\sum_{k=1}^n p_k \cdot \frac{1}{p_k}A(:,k)B(k,:)=AB E(X)=k=1npkpk1A(:,k)B(k,:)=AB
V a r ( X ) = d e f ∑ i j V a r ( X i j ) = ∑ i j E ( X i j 2 ) − ∑ i j E 2 ( X i j ) Var(X) \xlongequal{def} \sum\limits_{ij}Var(X_{ij}) =\sum_{ij}E(X_{ij}^2)-\sum_{ij}E^2(X_{ij}) Var(X)def ijVar(Xij)=ijE(Xij2)ijE2(Xij)

注:
在矩阵方差的这个定义下,我们有 V a r ( X ) = E ( ∥ X − E ( X ) ∥ F 2 ) Var(X)=E\big(\|X-E(X)\|_F^2\big) Var(X)=E(XE(X)F2)

前半部分
∑ i j E ( X i j 2 ) = ∑ i j ∑ k p k ⋅ 1 p k 2 A i k 2 B k j 2 = ∑ k 1 p k ( ∑ i A i k 2 ) ( ∑ j B k j 2 ) = ∑ k 1 p k ∥ A ( : , k ) ∥ 2 ∥ B ( k , : ) ∥ 2 = p k = ∥ A ( : , k ) ∥ 2 ∥ A ∥ F 2 ∥ A ∥ F 2 ∑ k ∥ B ( k , : ) ∥ 2 = ∥ A ∥ F 2 ∥ B ∥ F 2 \begin{aligned} \sum_{ij}E(X_{ij}^2) = & \sum_{ij}\sum_k p_k \cdot \dfrac{1}{p_k^2}A_{ik}^2 B_{kj}^2 \\ = & \sum_{k} \dfrac{1}{p_k} \Big(\sum_{i} A_{ik}^2 \Big) \Big( \sum_j B_{kj}^2 \Big) \\ = & \sum_{k} \dfrac{1}{p_k} \|A(:,k)\|^2 \|B(k,:)\|^2 \\ \xlongequal{p_k=\frac{\|A(:,k)\|^2}{\|A\|_F^2}} & \|A\|_F^2\sum_k\|B(k,:)\|^2 \\ = & \|A\|_F^2 \|B\|_F^2 \end{aligned} ijE(Xij2)===pk=AF2A(:,k)2 =ijkpkpk21Aik2Bkj2kpk1(iAik2)(jBkj2)kpk1A(:,k)2B(k,:)2AF2kB(k,:)2AF2BF2

注:
对于倒数第二个等号,由柯西不等式,本来当 p k ∝ ∥ A ( : , k ) ∥ ∥ B ( k , : ) ∥ p_k \propto \|A(:,k)\| \|B(k,:)\| pkA(:,k)B(k,:)时取最小值。
但是为了计算方便,这里让 p k ∝ ∥ A ( : , k ) ∥ 2 p_k \propto \|A(:,k)\|^2 pkA(:,k)2,也就是 p k = ∥ A ( : , k ) ∥ 2 ∥ A ∥ F 2 p_k=\dfrac{\|A(:,k)\|^2}{\|A\|_F^2} pk=AF2A(:,k)2

后半部分
∑ i j E 2 ( X i j ) = ∥ E ( X ) ∥ F 2 = ∥ A B ∥ F 2 \sum_{ij}E^2(X_{ij}) = \|E(X)\|_F^2 = \|AB\|_F^2 ijE2(Xij)=E(X)F2=ABF2
因此
V a r ( X ) = ∥ A ∥ F 2 ∥ B ∥ F 2 − ∥ A B ∥ F 2 ≤ ∥ A ∥ F 2 ∥ B ∥ F 2 Var(X) = \|A\|_F^2 \|B\|_F^2 - \|AB\|_F^2 \le \|A\|_F^2 \|B\|_F^2 Var(X)=AF2BF2ABF2AF2BF2

多次取样:

核心思想:通过多次取样降低方差
随机取 A A A中的 s s s列组成 C ∈ M m × s C \in M_{m\times s} CMm×s B B B相应的 s s s行组成 R ∈ M s × p R \in M_{s\times p} RMs×p
当然还要乘上相应的系数,具体来说:
C = [ A ( : , k 1 ) s p k 1 , ⋯   , A ( : , k s ) s p k s ] C = \begin{bmatrix} \dfrac{A(:,k_1)}{\sqrt {sp_{k_1}}},\cdots,\dfrac{A(:,k_s)}{\sqrt {sp_{k_s}}} \end{bmatrix} C=[spk1 A(:,k1),,spks A(:,ks)]
R = [ B ( k 1 , : ) s p k 1 ⋮ B ( k s , : ) s p k s ] R = \begin{bmatrix} \dfrac{B(k_1,:)}{\sqrt {sp_{k_1}}} \\ \vdots \\ \dfrac{B(k_s,:)}{\sqrt {sp_{k_s}}} \end{bmatrix} R=spk1 B(k1,:)spks B(ks,:)
其中 k 1 , ⋯   , k s k_1,\cdots,k_s k1,,ks是从 { 1 , 2 , ⋯   , n } \{1,2,\cdots,n\} { 1,2,,n}中随机取 s s s个的结果
这样
C R = 1 s ∑ i 1 p k i A ( : , k i ) B ( k i , : ) = 1 s ∑ i X i \begin{aligned} CR & = \frac{1}{s} \sum _i \dfrac{1}{p_{k_i}}A(:,k_i)B(k_i,:) \\ & = \frac{1}{s}\sum_i X_i \end{aligned} CR=s1ipki1A(:,ki)B(ki,:)=s1iXi
因此 C R CR CR就相当于重复取样多次后的 X X X的平均值,相应地
V a r ( C R ) = V a r ( 1 s ∑ i X i ) = 1 s 2 ∑ i V a r ( X i ) ≤ 1 s ∥ A ∥ F 2 ∥ B ∥ F 2 Var(CR) = Var(\frac{1}{s}\sum_i X_i) = \frac{1}{s^2} \sum_i Var(X_i) \le \frac{1}{s} \|A\|_F^2 \|B\|_F^2 Var(CR)=Var(s1iXi)=s21iVar(Xi)s1AF2BF2
最终结论:
V a r ( C R ) = E ( ∥ A B − C R ∥ F 2 ) ≤ 1 s ∥ A ∥ F 2 ∥ B ∥ F 2 Var(CR) = E\left(\|AB-CR\|_F^2\right) \le \frac{1}{s} \|A\|_F^2 \|B\|_F^2 Var(CR)=E(ABCRF2)s1AF2BF2

CUR Decomposition

Motivation: Sketch of Matrix

Let A ∈ M m × n A \in M_{m\times n} AMm×n
The time complexity of computing A x Ax Ax is O ( m n ) O(mn) O(mn)
But if we decomposite A A A into C U R CUR CUR, where C ∈ M n × s , U ∈ M s × r , R ∈ M r × n C \in M_{n\times s}, U \in M_{s \times r}, R \in M_{r\times n} CMn×s,UMs×r,RMr×n, then the time complexity would be O ( m s + s r + r n ) O(ms+sr+rn) O(ms+sr+rn), which would be O ( m + n ) O(m+n) O(m+n) if s s s and r r r are O ( 1 ) O(1) O(1).

Comparison with SVD

Pros
  • faster
  • preserve the actual data ⇒ \Rightarrow easier to interpret than the linear combination of data as in SVD
  • preserve some property, like the sparsity
Cons
  • less accurate approximation
  • stronger error bound

Intuitions

First Try: Identity Matrix

Just let B = I B=I B=I and repeat the matrix multiplication sampling process.
However, I I I is full rank and each row has the same weight, so sampling will lose too much information, and require the bound to be too strong:
E ( ∥ A I − C R ∥ F 2 ) ≤ 1 s ∥ A ∥ F 2 ∥ I ∥ F 2 = n s ∥ A ∥ F 2 E\left( \|AI-CR\|_F^2 \right) \le \frac{1}{s} \|A\|_F^2 \|I\|_F^2 = \frac{n}{s} \|A\|_F^2 E(AICRF2)s1AF2IF2=snAF2

Projection Matrix

We want a low rank B B B, so that sampling won’t cause information loss.
Consider the projection matrix P = R T ( R R T ) − 1 R P = R^T(RR^T)^{-1}R P=RT(RRT)1R, where R ∈ M s × n R \in M_{s\times n} RMs×n is a matrix of r r r rows of A A A picked according to length squared sampling. P P P is just like a pseudo indentity.

Note: Properties of projection matrix
P x = x Px=x Px=x if x x x is in the row space of R R R
P x = 0 Px=0 Px=0 otherwise

Main Theorem

Let A ∈ M m × n ,   r , s ∈ N + A\in M_{m\times n},\ r,s \in \mathbb N^+ AMm×n, r,sN+
Let C ∈ M m × r C\in M_{m\times r} CMm×r be a matrix of s s s columns of A A A picked according to length squared sampling, R ∈ M s × n R \in M_{s\times n} RMs×n be a matrix of r r r rows of A A A picked according to length squared sampling again.
Then we can find a U ∈ M r × s U \in M_{r\times s} UMr×s, so that
E ( ∥ A − C U R ∥ 2 2 ) ≤ ∥ A ∥ F 2 ( 1 r + r s ) E\left(\|A-CUR\|_2^2\right) \le \|A\|_F^2 \left( \frac{1}{\sqrt{r}} + \frac{r}{s} \right) E(ACUR22)AF2(r 1+sr)

Note:
Unlike what we do in matrix multiplication sampling, the rows are not necessarily corresponding to the columns.

And we find U U U in this way:
Wanting to make the rows of U R UR UR be the rows of P = R T ( R R T ) − 1 R P=R^T(RR^T)^{-1}R P=RT(RRT)1R corresponding to the columns of A A A, we just picked some rows of R T R^T RT, right-times it by ( R R T ) − 1 (RR^T)^{-1} (RRT)1, which comes to U U U

Proof

∥ A − C U R ∥ 2 ≤ ∥ A − A P ∥ 2 + ∥ A P − C U R ∥ 2 \|A-CUR\|_2 \le \|A-AP\|_2 + \|AP-CUR\|_2 ACUR2AAP2+APCUR2

Note:
The two norm of matrix A A A is ∥ A ∥ 2 = d e f max ⁡ x ∥ A x ∥ 2 \|A\|_2 \xlongequal{def}\max\limits_x\|Ax\|_2 A2def xmaxAx2
satisfying the triangle inequality ∥ A + B ∥ 2 ≤ ∥ A ∥ 2 + ∥ B ∥ 2 \|A+B\|_2\le\|A\|_2+\|B\|_2 A+B2A2+B2

first part:
∥ A − A P ∥ 2 2 = max ⁡ x ∥ A x − A P x ∥ 2 \|A-AP\|_2^2=\max_x \|Ax-APx\|^2 AAP22=xmaxAxAPx2
For the x x x in row space of R R R, ∥ A x − A P x ∥ 2 = 0 \|Ax-APx\|_2 = 0 AxAPx2=0, so just consider the x ⊥ x \bot x row space of R R R
∥ A x − A P x ∥ 2 = ∥ A x ∥ 2 = x T A T A x = x T ( A T A − R T R ) x ≤ ∥ x ∥ ∥ ( A T A − R T R ) x ∥ ≤ ∥ A T A − R T R ∥ 2 ≤ ∥ A ∥ F 4 r \begin{aligned} \|Ax-APx\|^2 & = \|Ax\|^2 \\ & = x^TA^TAx=x^T(A^TA-R^TR)x \\ & \le \|x\|\|(A^TA-R^TR)x\| \\ & \le \|A^TA-R^TR\|_2 \\ & \le \frac{\|A\|_F^4}{r} \end{aligned} AxAPx2=Ax2=xTATAx=xT(ATARTR)xx(ATARTR)xATARTR2rAF4

Note:
The last step comes from E ( ∥ A B − C R ∥ F 2 ) ≤ 1 s ∥ A ∥ F 2 ∥ B ∥ F 2 E\left(\|AB-CR\|_F^2\right) \le \dfrac{1}{s} \|A\|_F^2 \|B\|_F^2 E(ABCRF2)s1AF2BF2. Replace A A A by A T A^T AT and replace B B B by A A A

second part:
E ( ∥ A P − C U R ∥ 2 ) ≤ E ( ∥ A P − C U R ∥ F ) ≤ 1 s ∥ A ∥ F 2 ∥ P ∥ F 2 ≤ r s ∥ A ∥ F 2 \begin{aligned} E(\|AP-CUR\|_2) & \le E(\|AP-CUR\|_F)\\ & \le \dfrac{1}{s} \|A\|_F^2 \|P\|_F^2 \\ & \le \frac{r}{s}\|A\|_F^2 \end{aligned} E(APCUR2)E(APCURF)s1AF2PF2srAF2

Note:
The last step comes from ∥ P ∥ F 2 ≤ r \|P\|_F^2 \le r PF2r
That is because ∥ P ∥ F 2 = \|P\|_F^2 = PF2= sum of squared singular value, having r r r in total, each of which ≤ 1 \le 1 1 because ∥ P v ∥ ≤ 1 \|Pv\| \le 1 Pv1 for any unit vector v v v since P P P is a projection.

Computation Step by Step

  1. Pick s s s columns of A A A according to length squared sampling, which forms C C C
  2. Pick r r r rows of A A A according to length squared sampling, which forms R R R
  3. Compute ( R R T ) − 1 (RR^T)^{-1} (RRT)1
  4. Pick s s s rows of R T R^T RT corresponding to step 1, and right-times the resulting matrix by ( R R T ) − 1 (RR^T)^{-1} (RRT)1, which forms U U U

Note:
What if the matrix R R T RR^T RRT doesn’t have an inverse?
Use psedo inverse, which can be represented and computed based on some matrix decomposition, such as SVD.
As for why the pseudo inverse work, see this passage to know about the intuition behind pseudo inverse and you will understand.
Whatever, the R R T RR^T RRT is a r × r r\times r r×r matrix, so the time complexity is quite low.

猜你喜欢

转载自blog.csdn.net/w112348/article/details/112851191
今日推荐