Principal component analysis (PCA) (principal component analysis)

This article mainly talks about the mathematical derivation of PCA. The mathematical derivation of PCA can be done with the knowledge of linear algebra.

Refer to deeplearningbook.org book 2.12 Example: Principal Components Analysis

Refer to Chapter 16 Principal Component Analysis of Li Hang's Statistical Learning Method

The content of this article is as follows:

Table of contents

Knowledge points used

PCA mathematical derivation

PCA decentralization

Calculation method based on singular value decomposition

in conclusion


Let's first talk about two linear algebra knowledge points used:

Knowledge points used

1. The sum of the diagonal elements of the matrix (the trace operator)

The sum of matrix diagonal elements (the trace operator), denoted as Tr, is defined as follows:

It has the following properties:

1 The trace of a matrix is ​​equal to the trace of its transpose

2 Cyclic replacement

 

 2. The Frobenius norm of the matrix:

 It has the following properties:

 Well, after finishing the two small knowledge points, we will start PCA~

PCA mathematical derivation

We have  \mathbb{R}^{n} m points in space \{\boldsymbol{x}^{(1)},\boldsymbol{x}^{(2)},\cdot \cdot \cdot ,\boldsymbol{x}^{(m)} \}, each point is an n-dimensional vector. To store these points, we need to use  m\times n a unit of memory space. If we have limited space, can we use less space to store the same information as the original, so that the reduction of information is as small as possible. One method is dimensionality reduction. For each point  \boldsymbol{x}^{(i)}\in \mathbb{R}^{n}, we find the corresponding point  \boldsymbol{c}^{(i)}\in \mathbb{R}^{l}, and in  l < nthis way, the storage space of the original data can be reduced. Expressed in a mathematical expression, it is f(\boldsymbol{x})=\boldsymbol{c}, , \boldsymbol{x}\approx g(f(\boldsymbol{x})) where \boldsymbol{x} refers  to  , for the convenience of viewing , we will use  and later to represent the original vector of any point and the vector after dimension reduction,  the function is the encoding function (encoding function), and   the function is the decoding function (decoding function).\boldsymbol{x}^{(i)}c\boldsymbol{c}^{(i)}\boldsymbol{x}\boldsymbol{c}fg

PCA provides such a dimensionality reduction method.

The official derivation begins~

The derivation of PCA starts with the decoding function. In order to make the decoding function as simple as possible, the method of matrix multiplication can be selected so that g(\boldsymbol{c})=\boldsymbol{D}\boldsymbol{c}, among them \boldsymbol{D}\in \mathbb{R}^{n\times l}. If there are no restrictions, \boldsymbol{D}it is difficult to calculate the optimal one, so if  \boldsymbol{D} we make some restrictions, see if we can get the effect we want. In fact, we can assume  \boldsymbol{D} that each column is orthogonal, and each column is a unit vector. After \boldsymbol{D}making this restriction , \boldsymbol{D}it is much simpler to calculate the optimal one, and we can get this \boldsymbol{D}.

1. Objective function:

Suppose , among them f(\boldsymbol{x})=\boldsymbol{c}, g(\boldsymbol{c})=\boldsymbol{D}\boldsymbol{c}, our goal is to find the optimal  \boldsymbol{D}:

2. Use  \boldsymbol{D} to represent f the function , because f(\boldsymbol{x})=\boldsymbol{c}, that is, use \boldsymbol{D}to represent \boldsymbol{c} :

The formula in the first step above is a formula for solving the optimal value. Since we want the best \boldsymbol{D}, there can only be known  \boldsymbol{x} and unknown in the formula \boldsymbol{D}. Now there is an unknown  f function in the formula, we need to express fthe function, that is \boldsymbol{c}, the existing \boldsymbol{x}sum .\boldsymbol{D}

How do you express it? Our goal has not changed, we can decompose the formula in step 1 into each sample. Then find each optimal one , and in the process of \boldsymbol{c} solving the optimal one , we think that the sum is known. The optimal formula to solve is as follows:  \boldsymbol{c}\boldsymbol{x}\boldsymbol{D}\boldsymbol{c}

can be converted into matrix form:

Then expand:

Because \boldsymbol{x}^{T}g(\boldsymbol{c}) and g(\boldsymbol{c})^{T}\boldsymbol{x} are equal, both are real numbers, which can be simplified to:

Omitting \boldsymbol{x}^{T}\boldsymbol{x}one item (because it has nothing to do with optimization), we get:

will be  g(\boldsymbol{c}) replaced by  \boldsymbol{D} :

 Since any matrix multiplication by the identity matrix does not change, we have:

 We use matrix calculus operations to  \boldsymbol{c} differentiate , setting the derivative to 0, we get:

 

We have found the optimal  \boldsymbol{c} one, and then substitute its expression into the objective function.

 3. Go back to the objective function in step 1

Convert the norm into the square of the norm; and assume  l=1that  \boldsymbol{D}  it becomes a vector at this time, we write it down \boldsymbol{d} , then:

( l=1The situation we proved here first, l\geq 2the derivation is similar to this)

Since  \boldsymbol{d}^{T}\boldsymbol{x}^{(i)}  it is a real number, we can shift  \boldsymbol{d}^{T}\boldsymbol{x}^{(i)} to \boldsymbol{d}the left of the vector:

Since \boldsymbol{d}^{T}\boldsymbol{x}^{(i)} it is a real number, its transpose is itself: 

We can cancel  \sum  out, let \boldsymbol{X}\in \mathbb{R}^{m\times n} us use the Frobenius norm\boldsymbol{X}_{i,:}=\boldsymbol{x}_{(i)}^{T}  we talked about earlier :

Let's ignore the restrictions \boldsymbol{d}^{T}\boldsymbol{d}=1and simplify the previous formula into a matrix form, using the first knowledge point we talked about earlier:

Then we expand the matrix multiplication:

 We use what we said earlier Tr(\boldsymbol{AB})=Tr(\boldsymbol{BA}) (take \boldsymbol{X}^{T}\boldsymbol{X}as \boldsymbol{A}, \boldsymbol{d}\boldsymbol{d}^{T}as \boldsymbol{B}) to get:

In the same way, this time we \boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^{T}treat as \boldsymbol{A}, \boldsymbol{d}\boldsymbol{d}^{T}as \boldsymbol{B}available:

Don't forget that we have another constraint, \boldsymbol{d}^{T}\boldsymbol{d}=1, which can be substituted into:

 

 

  As mentioned earlier Tr(\boldsymbol{AB})=Tr(\boldsymbol{BA}) , (take \boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}as \boldsymbol{A}, \boldsymbol{d}^{T}as \boldsymbol{B}) can get:

Because  \boldsymbol{d}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d} it is a real number, you can remove the previous one Tr:

=argmax_{d} (\boldsymbol{d}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d})  subject to \boldsymbol{d}^{T}\boldsymbol{d}=1

So far, it has been simplified to the form we want. \boldsymbol{X}^{T}\boldsymbol{X}It is a real symmetric matrix, so   the eigenvector corresponding to the largest eigenvalue can be \boldsymbol{d}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d} maximized  . The specific proof will not be written here, but you can search it on the Internet~\boldsymbol{d}

What we prove is  l=1the special case l = 2when \boldsymbol{D}=[\boldsymbol{d}_{1},\boldsymbol{d}_{2} ] , we can easily deduce:

\boldsymbol{D}^{*}=argmax_{d1,d2} (\boldsymbol{d}_{1}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}_{1} + \boldsymbol{d}_{2}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}_{2} )

subject to \boldsymbol{d}_{1}^{T}\boldsymbol{d}_{1}=1 , \boldsymbol{d}_{2}^{T}\boldsymbol{d}_{2}=1 , \boldsymbol{d}_{2}^{T}\boldsymbol{d}_{1}=0\boldsymbol{d}_{1}^{T}\boldsymbol{d}_{2}=0

 The eigenvector that  \boldsymbol{d}_{1}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}_{1} maximizes  the largest eigenvalue of  \boldsymbol{d}_{1} is .\boldsymbol{X}^{T}\boldsymbol{X}

On \boldsymbol{d}_{1}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}_{1}the basis of reaching the maximum \boldsymbol{d}_{2}, the optimal one \boldsymbol{d}_{2} is \boldsymbol{d}_{1}the unit vector orthogonal to , and the one that \boldsymbol{d}_{2}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}_{2} reaches the maximum \boldsymbol{d}_{2} is \boldsymbol{X}^{T}\boldsymbol{X}the eigenvector of the second largest eigenvalue.

 At that timel>2 , it was the same as when it was equal to 2.

PCA decentralization

In the process of mathematical proof, we only need  the eigenvector corresponding to the required \boldsymbol{X}^{T}\boldsymbol{X}first  large eigenvalue. lBut there is a problem that we have to consider, that is the big difference in the dimension of each dimension data. If the value of a certain feature (a column of the matrix) in the data is particularly large, then it will have a large proportion in the entire error calculation, and then the vector projected on the new dimensional space will try to approach the largest feature, While ignoring the features with relatively small values, we do not know the importance of each feature before PCA, which is likely to result in a large amount of missing information. So decentralization is necessary.

After decentralization, it means that the data in each column in the matrix should be subtracted from the average value of this column.

Assuming that the matrix after decentralization is \boldsymbol{Y}, then PCA is  the eigenvector corresponding to the \boldsymbol{Y}^{T}\boldsymbol{Y} previous  l large eigenvalue.

In fact,   \frac{1}{m-1}\boldsymbol{Y}^{T}\boldsymbol{Y} it is \boldsymbol{X}the covariance matrix of .

Calculation method based on singular value decomposition

The traditional principal component analysis is performed through the eigenvalue decomposition of the covariance matrix of the data, and the commonly used method is through the singular value decomposition of the data matrix.

I won’t talk much about finding the principal components through the eigenvalue decomposition of the covariance matrix of the data. Let’s talk about finding the principal components through the method of singular value decomposition. Singular value decomposition can refer to Singular Value Decomposition (SVD) (Singular Value Decomposition) .

We know that for  m\times n a real matrix  \boldsymbol{A} , assuming its rank is  r , 0<k<rthen the matrix can  \boldsymbol{A} be truncated singular value decomposition:

\boldsymbol{A_{m\times n}} \approx \boldsymbol{U}_{m\times k}\boldsymbol{D}_{k\times k}\boldsymbol{V}_{n\times k}^{T}

 define a new  m\times n matrix  \boldsymbol{X}^{'} 

\boldsymbol{X}^{'}=\frac{1}{\sqrt{m-1}}\boldsymbol{Y}  ( \boldsymbol{Y}is the matrix after decentralization)

So \boldsymbol{X}^{'T}\boldsymbol{X}^{'}=(\frac{1}{\sqrt{m-1}}\boldsymbol{Y})^{T}(\frac{1}{\sqrt{m-1}}\boldsymbol{Y})=\frac{1}{m-1}\boldsymbol{Y}^{T}\boldsymbol{Y}

It can be seen  \boldsymbol{X}^{'T}\boldsymbol{X}^{'} that  \boldsymbol{X}the covariance matrix of S_{\boldsymbol{X}}.

The principal component analysis is due to finding theS_{\boldsymbol{X}} eigenvalues ​​​​and corresponding unit eigenvectors of the covariance matrix, so the problem is transformed into finding the eigenvalues ​​​​and corresponding unit eigenvectors of the matrix​​​​​​​​​​​​​​​​​​.\boldsymbol{X}^{'T}\boldsymbol{X}^{'}

\boldsymbol{X}^{'}Assuming ​​​​​​​​The truncated singular value decomposition is \boldsymbol{X}^{'}\approx \boldsymbol{U}_{m\times k}\boldsymbol{D}_{k\times k}\boldsymbol{V}_{n\times k}^{T} , Then  \boldsymbol{V} the column vector of is  \boldsymbol{X} the first k principal components of . Therefore, \boldsymbol{X}the principal components obtained can \boldsymbol{X}^{'}be realized by the singular value decomposition obtained.

in conclusion

PCA is a dimensionality reduction method. Take the eigenvectors corresponding to  the first  large eigenvalues ​​of the \boldsymbol{X}covariance matrix  ​​​​​​​​, put these eigenvectors together as each column to obtain the decoding matrix  . The encoding matrix is .\frac{1}{m-1}\boldsymbol{Y}^{T}\boldsymbol{Y}l\boldsymbol{D}\boldsymbol{D}^{T}

The implementation method can calculate the eigendecomposition of the covariance matrix of the original data matrix, and can also calculate the singular value decomposition of the decentralized original data matrix.

Huh, it's finally over~

Guess you like

Origin blog.csdn.net/qq_32103261/article/details/120592736