PCA for machine learning dimensionality reduction (python code + data)

PCA of Machine Learning

First, it is divided into four parts to state:

  • Generate application background
  • Design ideas
  • Case practice
  • to sum up
  • appendix

1. Generate application background

Principal Components Analysis, PCA for short, is a data dimensionality reduction technology used for data preprocessing. Generally, the dimensions of the original data we obtain are very high, such as 1000 features. These 1000 features may contain a lot of useless information or noise. There are only 100 really useful features. Then we can use the PCA algorithm to convert the 1000 features Down to 100 features. This can not only remove useless noise, but also reduce a lot of calculations.
In PCA, data is transformed from the original coordinate system to the new coordinate system. Of course, the new coordinate system here is not arbitrarily set, but should be designed according to the characteristics of the data itself. Usually the first new coordinate axis selects the direction with the largest variance of the original data, and the second coordinate axis is the direction that is orthogonal to the first coordinate axis and has the largest variance. The meaning of this sentence is that the second selected direction should have a weak correlation with the first direction.
to sum up

  • Dimensionality reduction can alleviate the dimensional disaster problem
  • Dimensionality reduction can minimize information loss while compressing data
  • It is difficult to understand the data structure of hundreds of dimensions, and the data of two or three dimensions is easier to understand through visualization

2. Design ideas

The PCA principal component analysis algorithm is a linear dimensionality reduction that maps the high-dimensional coordinate system to the low-dimensional coordinate system.
How to choose a low-dimensional coordinate system? It is to find the eigenvalues ​​and eigenvectors of the covariance matrix and the eigenvalues ​​and eigenvectors of the covariance matrix. The eigenvectors represent the coordinate system, and the eigenvalues ​​represent the length mapped to the new coordinates,
and then perform data conversion.
But do you think it is amazing, why is the larger eigenvector for seeking covariance the most ideal k-dimensional vector?
Let's take a look at an experimental project! ! !

3. Case practice

Knowledge reserve

  • Variance: describe fluctuations
    Insert picture description here

  • Covariance: the correlation of x, y

Insert picture description here
-. Covariance matrix: Note that this is three-dimensional, the two-dimensional we used in the following project, there is no z value
Insert picture description here

1. Raw data processing : Here we select two-dimensional reduction to one-dimensional display, where x and y are the two dimensions.
Insert picture description here
Row represents the sample, and column represents the feature (TF-IDF). Find the average of x and y respectively, and then for all For the samples, the corresponding mean is subtracted. Here the mean value of x is 1.81, and the mean value of y is 1.91. Then, after subtracting the mean value from an example, it is (0.69, 0.49).
Insert picture description here
Leave a thought 1 : Why do we subtract the mean value and what is the effect?

2. Find the covariance matrix

The covariance matrix above is two-dimensional, so it has two rows and two columns
Insert picture description here
. 3. Find the eigenvalues ​​and eigenvectors of the covariance

The knowledge of linear algebra is used here, students who don’t know how to do it can learn
by themselves . The first one in the figure below is the eigenvalue, and the second one is the eigenvector (the following numbers are all 0s omitted)
Insert picture description here
. The eigenvalue 0.0490833989 corresponds to the eigenvector. A vector, where the feature vectors are all normalized to unit vectors.

4. Select the eigenvectors corresponding to the largest k eigenvalues ​​(where k is a value smaller than the total number of dimensions) :
sort the eigenvalues ​​in order from largest to smallest, select the largest k among them, and then select the corresponding k The eigenvectors are used as column vectors to form the eigenvector matrix. There are only two eigenvalues ​​here, we choose the largest one, here is 1.28402771, and the corresponding eigenvector is (-0.677873399, -0.735178656)T. We are reducing from two dimensions to one dimension, so k takes one and is the largest.
Leave two questions:

  • Thinking 2: Why choose the largest k? What is the basis?
  • Thinking three: How much is appropriate to choose in practical application?

5. Obtain a new matrix
Project the sample points onto the selected feature vector. Assuming that the number of samples is m and the number of features is n, the sample matrix after subtracting the mean is DataAdjust(mn), the
covariance matrix is ​​nn, and the
selected k eigenvectors are composed of EigenVectors(nk).
Then the projected data FinalData is FinalData(101)
= DataAdjust(10*2 matrix) x feature vector (-0.677873399, -0.735178656)T
. The result obtained is that
Insert picture description here
the dimensionality reduction of the following figure is completed, and the fidelity is achieved at the same time, we will give Out of python code for verification, what happens if you don’t use PCA


We answer thinking one: why the data should be subtracted from the mean

Simply put, centralize the data to reduce the possibility of overfitting. If you have students who like to explore why, please refer to Appendix 1


We answer thinking two: Why choose the larger the eigenvalue of k, the better?

Maximum variance theory: The conclusion of the predecessors: The best k-dimensional feature is that after the n-dimensional sample points are converted to k-dimensional, the sample variance on each dimension is very large.
why?
Because the variance between the projected sample points is the largest (it can also be said that the sum of the absolute values ​​of the projection is the largest),
the two-dimensional is reduced to one-dimensional. If the variance is not large, the projection points on the main axis will overlap, and the loss of information is large , And the missing vector is very important. Look at the pictures 4 and 5 below! (Figures 4 and 5 have passed the origin and have been data centered) The left half of Figure 4 is optimal
when u>x,u is the main vector. We can see that the projection of a point on u is related to the orthogonal vector of u and u. Obviously, the projection surface on the left side of Fig. 4 is larger than the right side and the variance between the projection points is larger. Figure 5 is a detailed view of a single point of Figure 4.
Insert picture description here
Insert picture description here
Let’s summarize:

We have found the eigenvalues ​​countless times in linear algebra. Decomposing an n*n symmetric matrix, we can find its eigenvalues ​​and eigenvectors, and n n-dimensional orthogonal bases will be generated, each Orthogonal basis will correspond to an eigenvalue. Then the matrix is ​​projected onto these N bases. At this time, the modulus of the eigenvalue represents the projection length of the matrix on the base.
The larger the eigenvalue, the greater the variance of the matrix on the corresponding eigenvector, the more discrete the sample points, the easier to distinguish, and the more information. Therefore, the corresponding eigenvector with the largest eigenvalue contains more information in the direction. If a few eigenvalues ​​are small, it means that there is very little information in that direction, and we can delete the small eigenvalue corresponding For the direction data, only the data corresponding to the direction of the large feature value is retained. After doing this, the amount of data is reduced, but the amount of useful information is retained. PCA is this principle.


We answer thinking three: How many k is the most suitable choice among n dimensions?
Of course, the more k (feature vectors) obtained, the better, the more k the greater the entropy, the greater the uncertainty of the sample, and the closer to the real data. If k is larger, the dimensionality reduction effect we said will not be achieved. So this is an
empirical talk. Some research work has shown that the total length of the selected main shaft accounts for about 85% of the sum of all main shaft lengths. In fact, this is only a general statement. The specific selection depends on the actual situation. . The formula as shown below. Note: n is the number of samples, k is the dimension we choose
Insert picture description here

Four. Summary

Step review

  1. Remove average
  2. Calculate the covariance matrix
  3. Calculate the eigenvalues ​​and eigenvectors of the covariance matrix
  4. Sort eigenvalues ​​from large to small
  5. Keep the top k feature vectors
  6. Convert data into a new space of k vector components
  7. n-dimensional matrix * k-dimensional eigenvector = k-dimensional matrix

Dimension reduction is complete! ! !

Appendix 1:
In order to explain what is the principal component of data, let's start with data dimensionality reduction. What is data dimensionality reduction? Suppose there are a series of points in the three-dimensional space, and these points are distributed on an inclined plane passing through the origin. If you use the three axes of the natural coordinate system x, y, z to represent this set of data, you need to use three dimensions, but in fact , The distribution of these points is only on a two-dimensional plane, what is the problem? Rotate the x, y, z coordinate system so that the plane where the data is located coincides with the x, y plane. If the rotated coordinate system is marked as x', y', z', then this group of data is represented by only x'and y'can be expressed in two dimensions! Of course, if you want to restore the original representation, you have to save the transformation matrix between these two coordinates. In this way, the data dimension can be reduced!
Insert picture description here
Look at the figure below to understand this passage. However, we have to see the essence of this process. If these data are arranged in a matrix by rows or columns, then the rank of the matrix is ​​2! There is correlation between these data (with non-zero solutions). The largest linearly independent group of vectors that cross the origin formed by these data contains 2 vectors, which requires the plane to cross the origin! This is the reason for data centralization! Move the coordinate origin to the data center, so that the originally irrelevant data is relevant in this new coordinate system! Increase the orthogonality of the basis vector. Interestingly, the three points must be coplanar, which means that any three points in the three-dimensional space are linearly related after being centered. Generally speaking, n points in the n-dimensional space must be in an n-1 dimensional subspace In analysis! The linear subspace whose co-dimensionality is equal to one in the n-dimensional Euclidean space must be (n-1) dimension. This is the extension of straight lines in the plane and planes in space.

Appendix 2: python code and data

Note: The python code is to show the comparison of four-dimensional data using PCA dimensionality reduction and non-dimensionality reduction in two-dimensional space

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt  #2D绘图库

#计算均值,要求输入数据为numpy的矩阵格式,行表示样本数,列表示特征
def meanX(dataX):
    return np.mean(dataX,axis=0)#axis=0表示按照列来求均值,如果输入list,则axis=1
def pca(XMat, k):
    average = meanX(XMat)
    m, n = np.shape(XMat)
    data_adjust = []
    avgs = np.tile(average, (m, 1))
    data_adjust = XMat - avgs
    covX = np.cov(data_adjust.T)   #计算协方差矩阵
    featValue, featVec=  np.linalg.eig(covX)  #求解协方差矩阵的特征值和特征向量
    index = np.argsort(-featValue) #按照featValue进行从大到小排序
    finalData = []
    if k > n:
        print ("k must lower than feature number")
        return
    else:
        #注意特征向量时列向量,而numpy的二维矩阵(数组)a[m][n]中,a[1]表示第1行值
        selectVec = np.matrix(featVec.T[index[:k]]) #所以这里需要进行转置
        finalData = data_adjust * selectVec.T
        reconData = (finalData * selectVec) + average
    return finalData, reconData
#输入文件的每行数据都以\t隔开
def loaddata(datafile):
    return np.array(pd.read_csv(datafile,sep=" ",header=-1)).astype(np.float)
def plotBestFit(data1, data2):
    dataArr1 = np.array(data1)
    dataArr2 = np.array(data2)

    m = np.shape(dataArr1)[0]
    axis_x1 = []
    axis_y1 = []
    axis_x2 = []
    axis_y2 = []
    for i in range(m):
        axis_x1.append(dataArr1[i,0])
        axis_y1.append(dataArr1[i,1])
        axis_x2.append(dataArr2[i,0])
        axis_y2.append(dataArr2[i,1])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(axis_x1, axis_y1, s=50, c='red', marker='s')
    ax.scatter(axis_x2, axis_y2, s=50, c='blue')
    plt.xlabel('x1'); plt.ylabel('x2');
    plt.savefig("outfile.png")
    plt.show()
#根据数据集data.txt
def main():
    datafile = "data.txt"
    XMat = loaddata(datafile)
    k = 2
    return pca(XMat, k)
if __name__ == "__main__":
    finalData, reconMat = main()
    plotBestFit(finalData, reconMat)

Data: data.txt file, note: use a relative path, put the data in the same level directory of your .py code, parallel relationship

5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5.0 3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2
4.8 3.4 1.6 0.2
4.8 3.0 1.4 0.1
4.3 3.0 1.1 0.1
5.8 4.0 1.2 0.2
5.7 4.4 1.5 0.4
5.4 3.9 1.3 0.4
5.1 3.5 1.4 0.3
5.7 3.8 1.7 0.3
5.1 3.8 1.5 0.3
5.4 3.4 1.7 0.2
5.1 3.7 1.5 0.4
4.6 3.6 1.0 0.2
5.1 3.3 1.7 0.5
4.8 3.4 1.9 0.2
5.0 3.0 1.6 0.2
5.0 3.4 1.6 0.4
5.2 3.5 1.5 0.2
5.2 3.4 1.4 0.2
4.7 3.2 1.6 0.2
4.8 3.1 1.6 0.2

End! ! !
Reference blog 1: https://www.jianshu.com/p/f9c6b36395f6
Reference blog code: https://blog.csdn.net/Dream_angel_Z/article/details/50760130

Guess you like

Origin blog.csdn.net/qq_44112474/article/details/93908037