Principal component analysis \ 2.28

Principal component analysis (PCA)

1. Concept

Principal component analysis is a mathematical method of dimensionality reduction, which recombines the original numerous related variables into a new set of unrelated comprehensive variables to replace the original variables.

Two, use and classification

Principal component analysis is used to reduce the number of problem variables and reduce the difficulty of solving the problem. PCA can calculate the importance of each variable to the problem (contribution rate).

There are currently two main methods. One is based on the eigenvalue decomposition of the covariance matrix, and the other is based on the SVD decomposition of the covariance matrix.

Third, the steps of eigenvalue decomposition covariance matrix

Here is an example of the eigenvalue decomposition covariance matrix

There are n variables, each variable has m data
1) The original data is formed into n rows and m columns matrix X

2) Zero-average each row of X (representing an attribute field), that is, subtract the average value of this row

3) Find the covariance matrix C = 1 m XXTC=\frac{1}{m}XX^\mathsf{T}C=m1XXT

4) Find the eigenvalues ​​and corresponding eigenvectors of the covariance matrix

5) Arrange the eigenvectors in rows from top to bottom according to the corresponding eigenvalues ​​into a matrix, and take the first k rows to form a matrix P

6)Y = PXY = PXY=P X is the data after dimensionality reduction to k dimensionality

Four, step explanation

1) Averaging: All observation points are subtracted from the average value of the observation points.
Significance: When there are many data samples, the data will be translated to the origin by averaging to facilitate subsequent operations and calculations.
2) The eigenvalues ​​and eigenvectors of the covariance matrix refer to the vector modulus and the vector
with the largest difference above the data, which can be obtained through the matlab eig function.
3) The larger the eigenvalue, the greater its status, the more it can represent most of the features. The original intention of PCA is to use a few dimensions with a covariance of 0 to represent all relevant variables. Therefore, taking the first k rows is the corresponding dimension.

After searching the literature for a day, I finally got an understanding of pca, which involves linear algebra, matlab operations, and an understanding of mathematical dot multiplication, using many abilities.

Five, reference

The eig function uses
the mathematical principles of PCA

Guess you like

Origin blog.csdn.net/qq_17567367/article/details/114232405