This article mainly talks about the mathematical derivation of PCA. The mathematical derivation of PCA can be done with the knowledge of linear algebra.
Refer to deeplearningbook.org book 2.12 Example: Principal Components Analysis
Refer to Chapter 16 Principal Component Analysis of Li Hang's Statistical Learning Method
The content of this article is as follows:
Table of contents
Calculation method based on singular value decomposition
Let's first talk about two linear algebra knowledge points used:
Knowledge points used
1. The sum of the diagonal elements of the matrix (the trace operator)
The sum of matrix diagonal elements (the trace operator), denoted as Tr, is defined as follows:
It has the following properties:
1 The trace of a matrix is equal to the trace of its transpose
2 Cyclic replacement
2. The Frobenius norm of the matrix:
It has the following properties:
Well, after finishing the two small knowledge points, we will start PCA~
PCA mathematical derivation
We have m points in space , each point is an n-dimensional vector. To store these points, we need to use a unit of memory space. If we have limited space, can we use less space to store the same information as the original, so that the reduction of information is as small as possible. One method is dimensionality reduction. For each point , we find the corresponding point , and in this way, the storage space of the original data can be reduced. Expressed in a mathematical expression, it is , , where refers to , for the convenience of viewing , we will use and later to represent the original vector of any point and the vector after dimension reduction, the function is the encoding function (encoding function), and the function is the decoding function (decoding function).
PCA provides such a dimensionality reduction method.
The official derivation begins~
The derivation of PCA starts with the decoding function. In order to make the decoding function as simple as possible, the method of matrix multiplication can be selected so that , among them . If there are no restrictions, it is difficult to calculate the optimal one, so if we make some restrictions, see if we can get the effect we want. In fact, we can assume that each column is orthogonal, and each column is a unit vector. After making this restriction , it is much simpler to calculate the optimal one, and we can get this .
1. Objective function:
Suppose , among them , , our goal is to find the optimal :
2. Use to represent the function , because , that is, use to represent :
The formula in the first step above is a formula for solving the optimal value. Since we want the best , there can only be known and unknown in the formula . Now there is an unknown function in the formula, we need to express the function, that is , the existing sum .
How do you express it? Our goal has not changed, we can decompose the formula in step 1 into each sample. Then find each optimal one , and in the process of solving the optimal one , we think that the sum is known. The optimal formula to solve is as follows:
can be converted into matrix form:
Then expand:
Because and are equal, both are real numbers, which can be simplified to:
Omitting one item (because it has nothing to do with optimization), we get:
will be replaced by :
Since any matrix multiplication by the identity matrix does not change, we have:
We use matrix calculus operations to differentiate , setting the derivative to 0, we get:
We have found the optimal one, and then substitute its expression into the objective function.
3. Go back to the objective function in step 1
Convert the norm into the square of the norm; and assume that it becomes a vector at this time, we write it down , then:
( The situation we proved here first, the derivation is similar to this)
Since it is a real number, we can shift to the left of the vector:
Since it is a real number, its transpose is itself:
We can cancel out, let us use the Frobenius norm we talked about earlier :
Let's ignore the restrictions and simplify the previous formula into a matrix form, using the first knowledge point we talked about earlier:
Then we expand the matrix multiplication:
We use what we said earlier (take as , as ) to get:
In the same way, this time we treat as , as available:
Don't forget that we have another constraint, , which can be substituted into:
As mentioned earlier , (take as , as ) can get:
Because it is a real number, you can remove the previous one :
subject to
So far, it has been simplified to the form we want. It is a real symmetric matrix, so the eigenvector corresponding to the largest eigenvalue can be maximized . The specific proof will not be written here, but you can search it on the Internet~
What we prove is the special case when , we can easily deduce:
subject to , , ,
The eigenvector that maximizes the largest eigenvalue of is .
On the basis of reaching the maximum , the optimal one is the unit vector orthogonal to , and the one that reaches the maximum is the eigenvector of the second largest eigenvalue.
At that time , it was the same as when it was equal to 2.
PCA decentralization
In the process of mathematical proof, we only need the eigenvector corresponding to the required first large eigenvalue. But there is a problem that we have to consider, that is the big difference in the dimension of each dimension data. If the value of a certain feature (a column of the matrix) in the data is particularly large, then it will have a large proportion in the entire error calculation, and then the vector projected on the new dimensional space will try to approach the largest feature, While ignoring the features with relatively small values, we do not know the importance of each feature before PCA, which is likely to result in a large amount of missing information. So decentralization is necessary.
After decentralization, it means that the data in each column in the matrix should be subtracted from the average value of this column.
Assuming that the matrix after decentralization is , then PCA is the eigenvector corresponding to the previous large eigenvalue.
In fact, it is the covariance matrix of .
Calculation method based on singular value decomposition
The traditional principal component analysis is performed through the eigenvalue decomposition of the covariance matrix of the data, and the commonly used method is through the singular value decomposition of the data matrix.
I won’t talk much about finding the principal components through the eigenvalue decomposition of the covariance matrix of the data. Let’s talk about finding the principal components through the method of singular value decomposition. Singular value decomposition can refer to Singular Value Decomposition (SVD) (Singular Value Decomposition) .
We know that for a real matrix , assuming its rank is , then the matrix can be truncated singular value decomposition:
define a new matrix
( is the matrix after decentralization)
So
It can be seen that the covariance matrix of .
The principal component analysis is due to finding the eigenvalues and corresponding unit eigenvectors of the covariance matrix, so the problem is transformed into finding the eigenvalues and corresponding unit eigenvectors of the matrix.
Assuming The truncated singular value decomposition is , Then the column vector of is the first k principal components of . Therefore, the principal components obtained can be realized by the singular value decomposition obtained.
in conclusion
PCA is a dimensionality reduction method. Take the eigenvectors corresponding to the first large eigenvalues of the covariance matrix , put these eigenvectors together as each column to obtain the decoding matrix . The encoding matrix is .
The implementation method can calculate the eigendecomposition of the covariance matrix of the original data matrix, and can also calculate the singular value decomposition of the decentralized original data matrix.
Huh, it's finally over~