Machine Learning (13) Principle of PCA Dimensionality Reduction

I suddenly saw a question recently, what is the relationship between PCA and SVD? I vaguely remember that when I implemented PCA, SVD was clearly used, but SVD (Singular Value Decomposition) and PCA (Eigenvalue Decomposition) seem to be quite different, so I drilled down to collect some information, and put some of my Get a summary of the harvest so you don't forget it later.

Reference: https://blog.csdn.net/dark_scope/article/details/53150883

Simple derivation of PCA

PCA has two easy-to-understand explanations, 1) is to maximize the variance of the data after projection (make the data more scattered) ; 2) is to minimize the loss caused by the projection . Both approaches lead to the same result in the end. 

The following picture should be the best picture for the second explanation of PCA (ref: svd, pca, relation

vector

Assuming that the included angle between two non-zero vectors a and b is θ, then | b |·cosθ is called the projection or scalar projection of vector b on the direction of vector a .
Introducing the unit the vector

Enter image description here
The data shown in the figure has been decentralized (the center point is the origin) , and this step can be done simply by xi=xix¯xi=xi−x¯  to achieve, where x¯ is the mean value of the sample. For convenience, x ¯ in the followingx is the result of decentralization. 
It can be seen that the so-called dimensionality reduction operation of PCA is to find a new coordinate system (the two rotated straight lines are vertical, we can use a set of standard orthonormal basis { uj} , j = 1 , . . . , n{uj},j=1,...,n ), and then subtract some of these dimensions to make the error small enough. 
Suppose the projection direction we are looking for is ujuj  ( ujuj is the unit vector, i.e. uTjuj=1ujTuj=1 ) ,点 xiThe projection of xi in this direction is ( xTiuj) uj



So to minimize J, just remove the dimension corresponding to the smallest t eigenvalues ​​in the transformed dimension. 
Now, when we look back at the PCA process, we will find that everything corresponds:

A matrix A multiplied by x represents a transformation (rotation or stretching) of the vector x (a linear transformation), and the effect of the transformation is a constant c times the vector x (ie, only stretching).

We usually find the eigenvalues ​​and eigenvectors to find out which vectors (of course, the eigenvectors) can only be stretched by the matrix, and to what extent (eigenvalues) are stretched. The significance of this is to see clearly in which aspects a matrix can produce the greatest effect (power), and to conduct classification discussions and research according to each eigenvector generated (the ones with the largest eigenvalues ​​are generally studied).



Author: Rex
Link: https://www.zhihu.com/question/20507061/answer/16610027
Source: Zhihu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.
  1. decentralization of data
  2. Calculate XXTXXT , Note: The sample size M is divided or not divided hereM or M1M-1 actually has no effect on the obtained eigenvectors
  3. to XXTEigen decomposition with XXT
  4. Select the dimensions with the largest eigenvalues ​​for data mapping. (remove the smaller dimension)

Remaining problem

Seeing this, someone wants to ask, why do I remember that the standard process is to calculate the covariance matrix of the matrix? 
Let's look at the formula for calculating the covariance matrix: 


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325579880&siteId=291194637