SVD vs PCA vs 1bitMC

EigenDecomposition

For any real symmetric square d × d d \times d matrix A , we can find its eigenvalues λ 1 λ 2 . . . λ d \lambda_1 \ge \lambda_2 \ge ... \ge \lambda_d and corresponding orthonormal eigenvectors, such that:

A x 1 = λ 1 x 1 Ax_1 = \lambda_1x_1

A x 2 = λ 1 x 2 Ax_2 = \lambda_1x_2

A x d = λ 1 x d Ax_d = \lambda_1x_d

  • Suppose there are only m different non-zero eigenvalues, so we have

A x 1 = λ 1 x 1 Ax_1 = \lambda_1x_1

A x 2 = λ 1 x 2 Ax_2 = \lambda_1x_2

A x r = λ 1 x r Ax_r = \lambda_1x_r

which can be written into a compact form such as ( U T = U 1 U^T = U^{-1} and ( U U T = I UU^T=I ):

A = U Σ U T (1) A=U\Sigma U^T \tag{1}

Σ = U A U T (2) \Sigma=UAU^T \tag{2}
Then we use above matrix A A to tranform a vector x, then

A x = U Σ U T x = [ x 1 x 2 x r ] [ λ 1 λ r ] [ x 1 T x 2 T x r T ] x \begin{aligned} Ax & =U\Sigma U^Tx \\ & = \left[ \begin{matrix} x_1 & x_2 \cdots & x_r \end{matrix} \right] \left[ \begin{matrix} \lambda_1 & & \\ & \ddots & \\ & & \lambda_r \end{matrix} \right] \left[ \begin{matrix} x_1^T \\ x_2^T\\ \cdots\\ x_r^T \end{matrix} \right] x \\ \end{aligned}

Start the calculation from right to left:

  1. Re-Orient: Apply U T U^T on x <==> dot proct of x with each rows of U T U^T . x will be projected on each orthonormal basis (new coordinate) which are the rows of the U T U^T . The geometric meaning of U T x U^Tx is equivalent to rotate x based on the new cordinate (orthonormal basis) in U T U^T .
  2. Σ U T x \Sigma U^T x is to use Σ \Sigma to extend or shrink the rotated x x along with the new coordinate. (If there is one λ \lambda =0, then that dimension will be removed. )
  3. Re-Orient(to original):Same meaning with U T U^T . U U will rotate back x to original coordinate.
    在这里插入图片描述

PCA

Main idea: Find a coordinate transformation matrix P applied on data suc that
(一种直观的看法是:希望投影后的投影值尽可能分散。数据越分散,可分性就越强,可分性越强,概率分布保存的就越完整)

  1. The variability of the transformed data can be explained as much as possible along the new coordinates (降维后的信息损失尽可能小,尽可能保留原始样本的概率分布)

  2. For each new coordinate, they should be orthoganal to each other to avoid the redundancy. (降维后的基之间是完全正交的)

  3. So the final objective function is:
    arg min P R m , d , U R d , m i = 1 n x i U P x i 2 2 (*) \argmin_{P \in R^{m,d}, U\in R^{d,m}} \sum^{n}_{i=1} ||x_i - UPx_i||^2_2 \tag{*}
    , where d is the original dimension of data and m is the new dimension of the transoformed data (r=rank(A) <=m ), x 1 , . . . , x n R d x_1,..., x_n \in R^d , P R d , m P\in R^{d, m} is the transformation of new coordinate, U R d , m U \in R^{d,m} (in the end U = P T U = P^T )is the transoformation of original coordinate.

  4. In order to figure out above problems, the essential idea is to maxmize the variances of transfomred data point projected on new coordinates, and at the same time we need to minimize the covariance between two different coordinates(orthogonal<==>covarince is 0). Therefore, equivalently, we can diagolize the symmetric covariance matrix of the transformed data.

  • Example (Let d =2).

Put data into matrix X:

X = [ x 1 , x 2 , . . . , x n ] = [ a 1 a 2 . . . a n b 1 b 2 . . . b n ] R d = 2 , n X=\left[ \begin{matrix} x_1, x_2, ...,x_n \end{matrix} \right] =\left[ \begin{matrix} a_1 & a_2 & ... &a_n\\ b_1 & b_2 & ... &b_n \end{matrix} \right] \in R^{d=2, n}

Suppose the data points has been centralized among each features. Therefore, the covariance matrix is the following:

A = 1 n X X T = [ 1 n i = 1 n a i 2 1 n i = 1 n a i b i 1 n i = 1 n a i b i   1 n i = 1 n b i 2 ] A=\frac{1}{n}XX^T= \left[ \begin{matrix} \frac{1}{n}\sum_{i=1}^{n} a^2_i & \frac{1}{n}\sum_{i=1}^{n} a_ib_i\\ \frac{1}{n}\sum_{i=1}^{n} a_ib_i\ & \frac{1}{n}\sum_{i=1}^{n} b^2_i \end{matrix} \right]

Transformed data by matrix P is Y = P X Y = PX , where P R m , d = 2 P\in R^{m,d=2} . Then the transofrmed covariance matrix is:

1 n Y Y T = 1 n ( P X ) ( P X ) T = 1 n P X X T P T = P ( 1 n X X T ) P T = P A P T \begin{aligned} \frac{1}{n}YY^T & = \frac{1}{n}(PX)(PX)^T \\ & = \frac{1}{n}PXX^TP^T \\ & = P(\frac{1}{n}XX^T)P^T \\ & = PAP^T \\ \end{aligned}

Once the minimization (*) is done , 1 n Y Y T \frac{1}{n}YY^T will look like Σ \Sigma . Underline technique used to achieve the goal is the formula (2) , so that
P A P T = Σ = [ λ 1 λ m ] \begin{aligned} PAP^T & = \Sigma \\ & = \left[ \begin{matrix} \lambda_1 & & \\ & \ddots & \\ & & \lambda_m \end{matrix} \right] \end{aligned}

equivalently, A = P T Σ P A=P^T \Sigma P

where λ 1 = σ 1 2 λ 2 = σ 2 2 . . . λ m = σ m 2 \lambda_1 =\sigma_1^2 \ge \lambda_2=\sigma_2^2 \ge ... \ge \lambda_m=\sigma_m^2 , their corresponding eigenvectors construct the transformation matrix P = [ u 1 T u 2 T . . . u m T ] R m , d = 2 P=\left[ \begin{matrix} u_1^T\\ u_2^T\\ ...\\ u_m^T \end{matrix} \right] \in R^{m, d=2} , depends on your problems, we can select different r<=m rows in P.

在这里插入图片描述

SVD

A A will not be a symmetric matrix but with m × n m \times n dimension.
注意(SVD): Original Feature is n (the number of colmns), New Feature is m
注意(PCA above) : Original Feature is m (the number of rows)

Formula of Singular Value Decomposition
A m , n = U Σ V T A_{m,n} = U \Sigma V^T

The main word of SVD is trying to solve the following two tasks:

  1. First, find a set of orthonormal bais in n dimensional space
  2. Second, once performed on above space, the resulting basis in m dimensional space are still orthonormal.

Suppose we have already got the set of orthonormal bais in n dimensional space V n , n = [ v 1 , v 2 , . . . , v n ] V_{n,n} = \left[ v_1, v_2, ..., v_n \right] (note: some vs can be 0 vectors), where v i v j v_i \perp v_j

Project A on n dimensional bais, we have:
[ A v 1 , A v 2 , . . . , A v n ] \left[ Av_1, Av_2, ..., Av_n \right]

  1. To make the projected basis be orthogonal as well, we need let the following happen:
    A v i A v j = ( A v i ) T A v j = v i T A T A v j = 0 \begin{aligned} Av_i \cdot Av_j & = (Av_i)^T Av_j \\ & = v_i^T A^T A v_j \\ & = 0 \end{aligned}
    Therefore, if v i s v_i s are eigen vectors of A T A A^TA , then the resulting projected basis are also orthogonal because:
    A v i A v j = ( A v i ) T A v j = v i T ( A T A v j ) = v i T λ j v j = 0 \begin{aligned} Av_i \cdot Av_j & = (Av_i)^T Av_j \\ & = v_i^T (A^T A v_j )\\ & = v_i^T\lambda_j v_j \\ & = 0 \end{aligned}

  2. To scale the projected orthogonal basis into unit length.
    We have
    u i = A v i A v i = A v i A v i 2 = A v i ( A v i ) T A v i ) = A v i v i T ( A T A v i ) = A v i v i T λ i v i ( n o t e : v i T v i = 1 ) = A v i λ i \begin{aligned} u_i & = \frac{ Av_i}{|Av_i|} \\ & = \frac{Av_i}{\sqrt{|Av_i|^2}} \\ & = \frac{Av_i}{\sqrt{(Av_i)^TAv_i)}} \\ & = \frac{Av_i}{\sqrt{v_i^T(A^TAv_i)}} \\ & = \frac{Av_i}{\sqrt{v_i^T\lambda_iv_i}} (note: v_i^Tv_i=1)\\ & = \frac{ Av_i}{\sqrt{\lambda_i}} \\ \end{aligned}

u i λ i = u i σ i = A v i , u_i \sqrt{\lambda_i}= u_i \sigma_i = Av_i, , where σ i = λ i \sigma_i = \sqrt{\lambda_i} is called singular value, 0 i r 0 \le i \le r , r=rank(A).

  1. In the end, we expand [ u 1 , u 2 , . . . , u r ] \left[ u_1, u_2, ...,u_r\right] to [ u 1 , u 2 , . . . , u r u r + 1 , . . . , u m ] \left[ u_1, u_2, ...,u_r | u_{r+1}, ..., u_m\right] and select [ v r + 1 , v r + 1 , . . . , v m ] \left[ v_{r+1}, v_{r+1}, ...,v_{m}\right] from the nullspace of A where A v i = 0 Av_i=0 , i>r, and σ i = 0 \sigma_i=0
    下图中k就是r
    在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

A = X Y A=XY

Take-away message

The PCA just compute Left or Right Singular Matrix (depends how you define covariance matrix) of SVD.

Reference:
Ref1. (综合1+2)
Ref2. (1)
Ref3. (2)

发布了14 篇原创文章 · 获赞 1 · 访问量 1219

猜你喜欢

转载自blog.csdn.net/weixin_32334291/article/details/105064551
VS
今日推荐