Principal Component Analysis (PCA)

1. Introduction to PCA

Principal component analysis (Principal Component Analysis, PCA), is a statistical method. A set of potentially correlated variables is transformed into a set of linearly uncorrelated variables through orthogonal transformation , and the transformed set of variables is called principal components.

2. Background proposed by PCA

In many fields of research and application, it is often necessary to observe a large number of variables reflecting things, and collect a large amount of data for analysis to find rules. Large multivariate samples will undoubtedly provide rich information for research and applications, but it also increases the workload of data collection to a certain extent. More importantly, in most cases, there may be correlations between many variables, which increases the It increases the complexity of problem analysis and brings inconvenience to the analysis. If each indicator is analyzed separately, the analysis tends to be isolated rather than comprehensive. Blindly reducing indicators will lose a lot of information and easily lead to wrong conclusions.

Therefore, it is necessary to find a reasonable method to minimize the loss of information contained in the original indicators while reducing the indicators to be analyzed, so as to achieve the purpose of comprehensive analysis of the collected data. Since there is a certain correlation between variables, it is possible to use fewer comprehensive indicators to synthesize various types of information existing in each variable. Principal component analysis and factor analysis belong to this type of dimensionality reduction method.

3. Derivation of PCA

The principal component means the main component of the data. For some given data, we generally care about the changed part of the data. The more changes, the more information we get, and the unchanged data. Locally, the information we can collect is very limited, so the parts of the data that change more constitute the principal components of the data.

In order to explain what the principal components of data are, let’s start with data dimensionality reduction. What's going on with data dimensionality reduction? Suppose there is a series of points in three-dimensional space, and these points are distributed on an inclined plane passing through the origin. If you use the three axes of the natural coordinate system x, y, and z to represent this set of data, you need to use three dimensions, and in fact , the distribution of these points is only on a two-dimensional plane, then, where is the problem? If you think about it again, can you rotate the x, y, z coordinate system so that the plane where the data is located coincides with the x, y plane? now it's right! If the rotated coordinate system is recorded as x', y', z', then the representation of this set of data can only be represented by two dimensions of x' and y'! Of course, if you want to restore the original representation, you have to save the transformation matrix between these two coordinates. This will reduce the dimension of the data! However, we have to see the essence of this process. If these data are arranged in rows or columns into a matrix, then the rank of this matrix is ​​2! There is a correlation between these data. The largest linearly independent group of vectors passing through the origin formed by these data contains 2 vectors, which is why it is assumed that the plane passes through the origin at the beginning! So what if the plane is not the origin? That's what data centralization is all about! Translate the origin of the coordinates to the data center, so that the originally unrelated data will be relevant in this new coordinate system! Interestingly, the three points must be coplanar, which means that any three points in the three-dimensional space are linearly related after being centered. Generally speaking, n points in the n-dimensional space must be in an n-1-dimensional subspace. Analysis in!

In the previous paragraph, it is believed that after reducing the dimension of the data, nothing is discarded, because the components of the third dimension outside the plane of these data are all 0. Now, assuming that these data have a small jitter in the z' axis, then we still use the above two-dimensional representation of these data, the reason is that we can think that the information of these two axes is the principal component of the data, and this information is important to us The analysis is enough. The jitter on the z' axis is likely to be noise, which means that this set of data is originally correlated. The introduction of noise results in incomplete correlation of the data, but these data are on the z' axis. The angle between the distribution and the origin is very small, that is to say, there is a great correlation on the z' axis. Combining these considerations, it can be considered that the projection of the data on the x', y' axis constitutes the main data. Element.

The idea of ​​PCA is to map n-dimensional features to k-dimensions (k<n), which are brand new orthogonal features. This k-dimensional feature is called the principal component, which is a reconstructed k-dimensional feature, rather than simply removing the remaining nk-dimensional features from the n-dimensional feature.

For the main mathematical derivation of PCA, refer to the following article:

Click to open the link

To sum up, the main steps to do PCA are as follows:

There are m pieces of n-dimensional data.

1) Arrange the original data in m rows and n columns, as in matlab and Python (each column represents a feature)

2) Zero-mean each column of X (representing an attribute field), that is, subtract the mean of this column

3) Find the covariance matrix

4) Find the eigenvalues ​​of the covariance matrix and the corresponding eigenvectors

5) Form the eigenvectors obtained in the previous step into a matrix V

6) Find Y=XV, and then take the first k columns of Y to get Y', which is the required matrix


Here is a note about the dimensions of the above matrices for easy understanding and to prevent dimension errors when multiplying matrices during programming:

Matrix X: mxn (n-dimension m data)

Matrix V: nxn

Matrix Y: mxn (n-dimension m data)

Matrix Y': mxk (k dimension m data)

Fourth, the algorithm simulation of PCA

1. Use the pca function that comes with matlab
code show as below:
k=2; % reduce the sample to k dimension parameter setting  
X=[1 2 1 1; % sample matrix  
      3 3 1 2;   
      3 5 4 3;   
      5 4 5 4;  
      5 6 1 5;   
      6 5 2 6;  
      8 7 1 2;  
      9 8 3 7];  
[COEFF,SCORE,latent]=pca(X)
pcaData1=SCORE(:,1:k)

result:

COEFF =

    0.7084   -0.2826   -0.2766   -0.5846
    0.5157   -0.2114   -0.1776    0.8111
    0.0894    0.7882   -0.6086    0.0153
    0.4735    0.5041    0.7222   -0.0116


SCORE =

   -5.7947   -0.6071    0.4140   -0.0823
   -3.3886   -0.8795    0.4054   -0.4519
   -1.6155    1.5665   -1.0535    1.2047
   -0.1513    2.5051   -1.3157   -0.7718
    0.9958   -0.5665    1.4859    0.7775
    1.7515    0.6546    1.5004   -0.6144
    2.2162   -3.1381   -1.6879   -0.1305
    5.9867    0.4650    0.2514    0.0689


latent =

   13.2151
    2.9550
    1.5069
    0.4660


pcaData1 =

   -5.7947   -0.6071
   -3.3886   -0.8795
   -1.6155    1.5665
   -0.1513    2.5051
    0.9958   -0.5665
    1.7515    0.6546
    2.2162   -3.1381
    5.9867    0.4650
其中,SCORE是按照主成分的大小排列成的列矩阵,LATENT为各个主成分的系数,pcaData1为提取出来的前两个主成分


2.在matlab中自己编写pca函数

方法已经解释过了,代码如下:

k=2;                            %将样本降到k维参数设置  
X=[1 2 1 1;                     %样本矩阵  
      3 3 1 2;   
      3 5 4 3;   
      5 4 5 4;  
      5 6 1 5;   
      6 5 2 6;  
      8 7 1 2;  
      9 8 3 7];  
[Row, Col]=size(X);  
covX=cov(X);                                  %求样本的协方差矩阵(散步矩阵除以(n-1)即为协方差矩阵)  
[V, D]=eigs(covX);                            %求协方差矩阵的特征值D和特征向量V  
meanX=mean(X);                                %样本均值m  
tempX= repmat(meanX,Row,1);                   %所有样本X减去样本均值m,再乘以协方差矩阵(散步矩阵)的特征向量V,即为样本的主成份SCORE                          
SCORE2=(X-tempX)*fliplr(V)                    %主成份:SCORE2,V矩阵是按照特征值从小到大的顺序排列的,所以需要翻转一下 
pcaData2=SCORE2(:,1:k)
结果如下:
SCORE2 =

   -5.7947    0.6071   -0.4140    0.0823
   -3.3886    0.8795   -0.4054    0.4519
   -1.6155   -1.5665    1.0535   -1.2047
   -0.1513   -2.5051    1.3157    0.7718
    0.9958    0.5665   -1.4859   -0.7775
    1.7515   -0.6546   -1.5004    0.6144
    2.2162    3.1381    1.6879    0.1305
    5.9867   -0.4650   -0.2514   -0.0689


pcaData2 =

   -5.7947    0.6071
   -3.3886    0.8795
   -1.6155   -1.5665
   -0.1513   -2.5051
    0.9958    0.5665
    1.7515   -0.6546
    2.2162    3.1381
    5.9867   -0.4650
可见于matlab内置的函数得到的结果是一致的


3.使用Python仿真pca
可以参考这个库函数的说明:
点击打开链接

文中说明了在sklearn.decomposition这个库中pca方法的使用以及例子

注意:在pca的很多算法中,都采用的SVD代替特征值法来求解对角矩阵那么SVD与PCA有什么关系呢?我们考虑\textbf{S}_X的SVD表示方式:\textbf{S}_X=\frac{1}{n-1} \textbf{V}\Sigma \textbf{U}^T\textbf{U}\Sigma\textbf{V}^T=\frac{1}{n-1}\textbf{V}\Sigma^2\textbf{V}^T,所以到这里答案就很明显了,我们只需要取另一个投影矩阵\textbf{P}=\textbf{V}^T就可以将\textbf{S}_Y对角化,即\textbf{V}的列是principal components。顺便,我们得到了一个副产品奇异值和特征值的关系:\lambda_i=\frac{1}{n-1}s_i^2,其中\lambda_i,s_i\textbf{S}_X\Sigma相应的特征值和奇异值。因此,我们得到了SVD是PCA的另一种algebraic formulation。而这也提供了另外一种算法来计算PCA,实际上,平时我就是用SVD定义的这套算法来做PCA的。因为很方便,计算一次就可以了。

上面为引用,链接:https://www.zhihu.com/question/38319536/answer/131029607

Python代码如下:

import numpy as np
from sklearn.decomposition import PCA
import sys
#returns choosing how many main factors
def index_lst(lst, component=0, rate=0):
  #component: numbers of main factors
  #rate: rate of sum(main factors)/sum(all factors)
  #rate range suggest: (0.8,1)
  #if you choose rate parameter, return index = 0 or less than len(lst)
  if component and rate:
    print('Component and rate must choose only one!')
    sys.exit(0)
  if not component and not rate:
    print('Invalid parameter for numbers of components!')
    sys.exit(0)
  elif component:
    print('Choosing by component, components are %s......'%component)
    return component
  else:
    print('Choosing by rate, rate is %s ......'%rate)
    for i in range(1, len(lst)):
      if sum(lst[:i])/sum(lst) >= rate:
        return i
    return 0
 
def main():
  # test data
  mat = [[1,2,1,1],[3,3,1,2],[3,5,4,3],[5,4,5,4],[5,6,1,5],[6,5,2,6],[8,7,1,2],[9,8,3,7]]
   
  # simple transform of test data
  Mat = np.array(mat, dtype='float64')
  print('Before PCA transforMation, data is:\n', Mat)
  print('\nMethod 1: PCA by original algorithm:')
  p,n = np.shape(Mat) # shape of Mat 
  t = np.mean(Mat, 0) # mean of each column
   
  # substract the mean of each column
  for i in range(p):
    for j in range(n):
      Mat[i,j] = float(Mat[i,j]-t[j])
       
  # covariance Matrix
  cov_Mat = np.dot(Mat.T, Mat)/(p-1)
   
  # PCA by original algorithm
  # eigvalues and eigenvectors of covariance Matrix with eigvalues descending
  U,V = np.linalg.eigh(cov_Mat) 
  # Rearrange the eigenvectors and eigenvalues
  U = U[::-1]
  for i in range(n):
    V[i,:] = V[i,:][::-1]
  # choose eigenvalue by component or rate, not both of them euqal to 0
  Index = index_lst(U, component=2) # choose how many main factors
  if Index:
    v = V[:,:Index] # subset of Unitary matrix
  else: # improper rate choice may return Index=0
    print('Invalid rate choice.\nPlease adjust the rate.')
    print('Rate distribute follows:')
    print([sum(U[:i])/sum(U) for i in range(1, len(U)+1)])
    sys.exit(0)
  # data transformation
  T1 = np.dot(Mat, v)
  # print the transformed data
  print('We choose %d main factors.'%Index)
  print('After PCA transformation, data becomes:\n',T1)
   
  # PCA by original algorithm using SVD
  print('\nMethod 2: PCA by original algorithm using SVD:')
  # u: Unitary matrix, eigenvectors in columns 
  # d: list of the singular values, sorted in descending order
  u,d,v = np.linalg.svd(cov_Mat)
  Index = index_lst(d, rate=0.95) # choose how many main factors
  T2 = np.dot(Mat, u[:,:Index]) # transformed data
  print('We choose %d main factors.'%Index)
  print('After PCA transformation, data becomes:\n',T2)
   
  # PCA by Scikit-learn
  pca = PCA(n_components=2) # n_components can be integer or float in (0,1)
  pca.fit(mat) # fit the model
  print('\nMethod 3: PCA by Scikit-learn:')
  print('After PCA transformation, data becomes:')
  print(pca.fit_transform(mat)) # transformed data   
main()
运行结果:
Before PCA transforMation, data is:
 [[ 1.  2.  1.  1.]
 [ 3.  3.  1.  2.]
 [ 3.  5.  4.  3.]
 [ 5.  4.  5.  4.]
 [ 5.  6.  1.  5.]
 [ 6.  5.  2.  6.]
 [ 8.  7.  1.  2.]
 [ 9.  8.  3.  7.]]

Method 1: PCA by original algorithm:
Choosing by component, components are 2......
We choose 2 main factors.
After PCA transformation, data becomes:
 [[ 5.79467821  0.60705487]
 [ 3.38863423  0.87952394]
 [ 1.61549833 -1.56652328]
 [ 0.15133075 -2.50507639]
 [-0.99576675  0.56654487]
 [-1.7515016  -0.65460481]
 [-2.21615035  3.13807448]
 [-5.98672282 -0.46499368]]

Method 2: PCA by original algorithm using SVD:
Choosing by rate, rate is 0.95 ......
We choose 3 main factors.
After PCA transformation, data becomes:
 [[ 5.79467821  0.60705487  0.41402357]
 [ 3.38863423  0.87952394  0.40538881]
 [ 1.61549833 -1.56652328 -1.05351894]
 [ 0.15133075 -2.50507639 -1.3156637 ]
 [-0.99576675  0.56654487  1.48593239]
 [-1.7515016  -0.65460481  1.50039959]
 [-2.21615035  3.13807448 -1.68793779]
 [-5.98672282 -0.46499368  0.25137607]]

Method 3: PCA by Scikit-learn:
After PCA transformation, data becomes:
[[-5.79467821  0.60705487]
 [-3.38863423  0.87952394]
 [-1.61549833 -1.56652328]
 [-0.15133075 -2.50507639]
 [ 0.99576675  0.56654487]
 [ 1.7515016  -0.65460481]
 [ 2.21615035  3.13807448]
 [ 5.98672282 -0.46499368]]

可见此结果与在matlab中的仿真相同

五、主成分分析的主要优缺点

优点

  ①可消除评估指标之间的相关影响。因为主成分分析法在对原始数据指标变量进行变换后形成了彼此相互独立的主成分,而且实践证明指标间相关程度越高,主成分分析效果越好。

  ②可减少指标选择的工作量,对于其他评估方法,由于难以消除评估指标间的相关影响,所以选择指标时要花费不少精力,而主成分分析法由于可以消除这种相关影响,所以在指标选择上相对容易些。

  ③主成分分析中各主成分是按方差大小依次排列顺序的,在分析问题时,可以舍弃一部分主成分,只取前面方差较大的几个主成分来代表原变量,从而减少了计算工作量。用主成分分析法作综合评估时,由于选择的原则是累计贡献率≥85%,不至于因为节省了工作量却把关键指标漏掉而影响评估结果。

缺点

  ①在主成分分析中,我们首先应保证所提取的前几个主成分的累计贡献率达到一个较高的水平(即变量降维后的信息量须保持在一个较高水平上),其次对这些被提取的主成分必须都能够给出符合实际背景和意义的解释(否则主成分将空有信息量而无实际含义)。

  ②主成分的解释其含义一般多少带有点模糊性,不像原始变量的含义那么清楚、确切,这是变量降维过程中不得不付出的代价。因此,提取的主成分个数m通常应明显小于原始变量个数p(除非p本身较小),否则维数降低的“利”可能抵不过主成分含义不如原始变量清楚的“弊”。

  ③当主成分的因子负荷的符号有正有负时,综合评价函数意义就不明确

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326006972&siteId=291194637