Principal component analysis (PCA) steps and code

PCA


foreword

  Principal Component Analysis (PCA), referred to as PCA, is a statistical method. Through orthogonal transformation, a group of variables that may be correlated is converted into a group of linearly uncorrelated variables, and the converted group of variables is called the principal component. Principal component analysis is the most common linear dimensionality reduction method in the process of mathematical modeling. It is often used in competitions to deal with too many data indicators and process high-dimensional data into low-dimensional data to facilitate subsequent modeling. Speaking human words is to reduce the dimensionality of multiple data indicators to fewer data indicators.


1. The steps of principal component analysis

For n samples, the Xnp sample matrix composed of p indicators

1. Centralization of indicators

Centralization is to change the mean value of the data to zero
xij = xij − 1 n ∑ j = 1 nxij (1) x_{ij}=x_{ij}-\frac{1}{n}\sum_{j=1} ^n x_{ij} \tag{1}xij=xijn1j=1nxij( 1 )
The data is normally distributed and can also be standardized.
xij = X − X ‾ σ (2) x_{ij}= \frac{X-\overline{X}}{\sigma} \tag{2}xij=pXX(2)

2. Calculate the covariance matrix C

C = 1 n X ′ T X ′ (3) C= \frac{1}{n}X'^{T}X' \tag{3} C=n1XTX( 3 ) C ij = cov ( xi , xj ) = E ( ( xi − μ i ) ( xj − μ j ) ) (4) C_{ij}= cov(x_{i},x_{j})=E ((x_{i}-\mu_{i})(x_{j}-\mu_{j})) \tag{4}Cij=c o v ( xi,xj)=And ( ( ximi)(xjmj))( 4 )
where x is the index column, μ is the mean value of the index

3. Calculate the eigenvalues ​​and eigenvectors of the covariance matrix

C a = λ a (5) Ca=\lambda a \tag{5} Ca=λa( 5 )
λ is the eigenvalue of C, and a is the eigenvector of C corresponding to the eigenvalue λ. For specific derivation, please refer to linear algebra.

4. Calculate the principal component contribution rate and cumulative contribution rate

Arrange the eigenvalues ​​from large to small, and arrange the corresponding eigenvectors into a matrix a
P i = λ i ∑ λ i (6) P_{i}=\frac {\lambda_{i}}{\sum \lambda_{ i}}\tag{6}Pi=lili(6)
P i ′ = ∑ k = 1 i P k (7) P'_{i}=\sum_{k=1}^i P_{k}\tag{7} Pi=k=1iPk( 7 )
Pi is the contribution rate, P'i is the cumulative contribution rate, we regard Pi as the percentage of information retained

5. Write down the principal components

Generally, the m principal components corresponding to the eigenvalues ​​with a cumulative contribution rate P'i exceeding 80% are taken
Y = a X (8) Y=aX\tag{8}Y=aX( 8 )
i-th principal component:F i = a 1 i X 1 + a 2 i X 2 + … + api X p (9) F_{i}=a_{1i}X_{1}+a_{2i} X_{2}+…+a_{pi}X_{p}\tag{9}Fi=a1 iX1+a2 iX2++apiXp(9)

6. Explain the principal components

For a principal component, the larger the coefficient of the index, the greater the impact of the index on the principal component, and we should assign greater weight to the index in the interpretation of this principal component.

2. Code program

The matlab code is as follows:

clear;clc
x =  xlsread('文件路径\xxx.xlsx');  %导入excel数据
[n,p] = size(x);  % n是样本个数,p是指标个数
X=zscore(x); %matlab内置的标准化函数
C = cov(X); %求协方差矩阵
[V,lambda] = eig(C);  % V 特征向量矩阵,lamda为特征值构成的对角矩阵
[lambda, ind] = sort(diag(lambda), 'descend'); %排序
lambda=lambda./sum(lambda); %求贡献率
lambda=cumsum(lambda); %累计贡献率
k=find(lambda>0.9); %累计贡献率超过0.9
y=x*V(:,ind(1:k(1))); %y为主成分降维后的结果

The python code is as follows:

## pca特征降维
# 导入相关模块
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from numpy.linalg import eig
from sklearn.datasets import load_iris

iris = load_iris() # 导入矩阵,行是样本,列是指标
#X = np.array([[5.1, 3.5, 1.4, 0.2],
#                [4.9, 3, 1.4, 0.2]])
#自己导入矩阵数据可以用上面的注释代码,然后把X = iris.data 删掉即可
X = iris.data
# Standardize by remove average通过去除平均值进行标准化
X = X - X.mean(axis=0)

# Calculate covariance matrix:计算协方差矩阵:
X_cov = np.cov(X.T, ddof=0)

# Calculate  eigenvalues and eigenvectors of covariance matrix
# 计算协方差矩阵的特征值和特征向量
eigenvalues, eigenvectors = eig(X_cov)
pi = eigenvalues/np.sum(eigenvalues) #计算贡献率
p = np.cumsum(pi) #计算累计贡献率

k=np.min(np.argwhere(p > 0.95))+1 #返回达到累计贡献率的阈值的下标

# top k large eigenvectors选取前k个特征向量
klarge_index = eigenvalues.argsort()[-k:][::-1]
k_eigenvectors = eigenvectors[klarge_index]

# X和k个特征向量进行点乘
X_pca = np.dot(X, k_eigenvectors.T)
print(X_pca) #输出主成分结果

Both programming languages ​​are the same, you can use whichever one you are proficient in.


Summary`

Is there any operation of principal component analysis without code?
  Yes, spsspro, MPai data science platform, etc. have built-in principal component analysis operations.

How to interpret principal components?
  In the actual process operation, we know the composition of the i-th principal component, so we can explain it according to the corresponding coefficients, such as: X1 represents food expenditure, X2 represents housing expenditure, X3 represents entertainment expenditure, X4 represents medical expenditure, Y1 is the first principal component, and its composition is as follows:
F i = 0.91 X 1 + 0.83 X 2 + 0.04 X 2 + 0.76 X 4 (10) F_{i}=0.91X_{1}+0.83X_{2}+0.04X_ {2}+0.76X_{4}\tag{10}Fi=0.91X1+0.83X2+0.04X2+0 . 7 6X _4( 1 0 )
  It can be clearly observed that the coefficients in front of X1, X2, and X4 are higher and have a greater impact on the principal components, while the coefficients in front of X3 are higher and have less impact on the principal components. Then we can use the first The principal component is interpreted as: necessary family expenditure
Note: Once the principal component cannot be explained, then the principal component analysis will fail this time, and factor analysis can be considered

Can principal component analysis be used for comprehensive evaluation?
  Let me talk about the conclusion first. Although principal component analysis has principal component scores, we generally do not use principal component analysis for comprehensive score evaluation. Because the principal component will lose part of the information of the original data, and if the index is extremely small, in which we do not carry out the process of forwarding the data index, the scoring result will be inaccurate.
  The principle of principal component scoring is to decompose several pivots from dozens of indicators, which is equivalent to a brand new indicator, and has the contribution rate of these pivots. The contribution rate measures how much information this pivot can reflect as a whole (The distance variance of other indicators mapped to the pivot. The larger the variance, the more information can be reflected). It is a relative value. After normalization, it can be used as a weight. Multiplying the weight with the pivot can get the relative value of different samples. value.

おすすめ

転載: blog.csdn.net/weixin_52952969/article/details/124954713