Machine Learning Notes 09---PCA Principal Component Analysis

    In machine learning, the main purpose of dimensionality reduction for high-dimensional data is to find a suitable low-dimensional space in which learning can perform better than the original space.

    Principal component analysis (Principai Component Analysis, referred to as PCA) is the most commonly used dimensionality reduction method. Before introducing PCA, we may wish to consider such a question: For sample points in an orthogonal attribute space, how to use a hyperplane (high-dimensional extension of a straight line) to properly express all samples?

    It is easy to imagine that if such a hyperplane exists, it probably has the following properties:

    Nearest reconstruction: the distance between the sample point and the hyperplane is close enough

    Maximum separability: the projection of sample points on this hyperplane can be separated as much as possible

    Interestingly, based on nearest reconfiguration and maximum separability, two equivalent derivations of principal component analysis can be obtained respectively.

1> Recent reconstructive derivation:

    Assume that the data samples are centered, that is, Σxi = 0; then assume that the new coordinate system obtained after the projection transformation is {w1, w2, ..., wd}, where wi is the orthonormal basis vector, ||wi|| ² = 1, wi(T)*wj = 0(i ≠ j). If some coordinates in the new coordinate system are discarded, that is, the dimension is reduced to d' < d, then the projection of the sample point xi in the low-dimensional coordinate system is zi = (zi1; zi2;...; zid'), where zij= The coordinates of wj(T)*xi in the jth dimension in the low-dimensional coordinate system. If xi is reconstructed based on zi, xi' = Σzijwj will be obtained.

    Considering the entire training set, the distance between the original sample point xi and the reconstructed sample point xi' based on projection is:

 where W = (w1,w2,...,wd). According to the recent reconstruction, the above formula should be minimized, considering that wj is an orthonormal basis, and Σxixi(T) is a covariance matrix, there are:

 This is the optimization goal of principal component analysis.

2> Maximum separability derivation:

    Starting from the maximum separability, another interpretation of principal component analysis can be obtained. We know the projection formula W(T)xi of the sample point xi on the hyperplane in the new space. If the projections of all sample points can be separated as much as possible, the variance of the projected sample points should be maximized. As shown in the picture below: (The drawing is a bit sloppy...It can be understood but not shown)

     The covariance matrix of the sample points after projection is ΣW(T)xixi(T)W, so the optimization objective can be written as:

 Obviously, the two formulas derived from the nearest reconstruction and the maximum separability are equivalent.

    Using the Lagrange multiplier method on the above formula can be obtained:

 Therefore, it is only necessary to perform eigenvalue decomposition on the covariance matrix XX(T), sort the obtained eigenvalues: λ1≥λ2≥...≥λd, and then take the eigenvectors corresponding to the first d' eigenvalues ​​to form W*= (w1,w2,...,wd'). This is the solution of principal component analysis. The PCA algorithm description is as follows:

输入:样本集D = {x1,x2,...,};
      低维空间维数d'

过程:
1.对所有样本进行中心化:xi <- xi-(Σxi)/m
2.计算样本的协方差矩阵XX(T)
3.对协方差矩阵XX(T)做特征值分解
4.取最大的d'个特征值所对应的特征向量w1,w2,...,wd'

输出:投影矩阵W* = (w1,w2,...,wd')

    The dimension d' of the low-dimensional space after dimensionality reduction is usually specified by the user in advance, or selected by cross-validating the k-nearest neighbor classifier (or other low-cost learner) in a low-dimensional space with different d' values better d' value. For PCA, you can also set a reconstruction threshold from the perspective of reconstruction, such as t = 95%, and then choose the minimum d' value that makes the following formula true:

     PCA only needs to retain W* and the mean vector of the sample to project new samples into a low-dimensional space through simple vector subtraction and matrix-vector multiplication. Obviously, the low-dimensional space must be different from the original high-dimensional space, because the eigenvectors corresponding to the smallest dd' eigenvalues ​​are discarded, which is the result of dimensionality reduction. However, it is often necessary to discard this part of information: on the one hand, after discarding this part of information, the sampling density of the sample can be increased, which is an important motivation for will; on the other hand, when the data is affected by noise, the smallest The eigenvectors corresponding to the eigenvalues ​​are often related to noise, and discarding them can achieve a denoising effect to a certain extent.

Refer to Zhou Zhihua's "Machine Learning"

Guess you like

Origin blog.csdn.net/m0_64007201/article/details/127599171