[Credit Score Prediction Model (3)] PCA Principal Component Analysis


foreword

Principal Components Analysis (PCA for short) is a data dimensionality reduction technique for data preprocessing.
The general steps of PCA are: first zero-mean the original data , then find the covariance matrix , and then find the eigenvectors and eigenvalues ​​of the covariance matrix, and these eigenvectors form a new feature space. The principal component of the matrix is ​​the eigenvector corresponding to its covariance matrix, which is sorted according to the size of the corresponding eigenvalue. The largest eigenvalue is the first principal component, followed by the second principal component, and so on.
In summary, what PCA does is: reduce the dimensionality of a dataset while retaining as much information as possible.


1. The pros and cons of PCA

1. Advantages:

  1. Reduce data dimensions, simplify data, and reduce computational complexity;
  2. Extract key information from the data;
  3. Ability to discover hidden data;
  4. The operation is simple, and spss can be used for PCA dimension reduction;

2. Disadvantages:

  1. Some important information may be lost;
  2. Sensitive to noisy data;
  3. There may be multicollinearity, resulting in inaccurate results;

To determine whether there is multicollinearity, the VIF test can be used.

Two, PCA steps

1. Standardization

This step has been shown in the article linked below:

https://blog.csdn.net/m0_65157892/article/details/129523883?spm=1001.2014.3001.5502

2. VIF multicollinearity test

(1) Missing value processing, VIF test

Since the data for the multicollinearity test cannot have vacant values, the data must first be filled with vacant values. Here, the fillna(0) method is used because the amount of vacant data is not much.

code show as below:

x1 = df['x1'].fillna(0)
#x12 = x1.fillna(0)
#print(x1.isnull().sum())
#VIF检验
X  = sm.add_constant(df.loc[:,[ 'x2','x3','x4','x5','x6','x7','x8','x9']])
X1 = pd.concat([x1,X],axis= 1)
#print(X1)
vif = pd.DataFrame()
vif["Ficture"] = X1.columns
vif["Fctor"] = [variance_inflation_factor(X1.values,i) for i in range(X1.shape[1])]
print('经过VIF检验的多重共线性检验')
print(vif)

当VIF值大于10时就说明存在多重共线性,需要剔除。

3. Model training

# 训练,
# pca = PCA(n_components=5)  n_components表示要保留的特征个数
pca = PCA(random_state=5,n_components=9)
pca.fit(x) # PCA是无监督学习算法,此处y自然等于None。
result = pca.transform(x) # 将数据转换为降维后的数据

After unsupervised learning, the original data is transformed into future data.

4. Variance percentage and cumulative variance contribution

res = pca.explained_variance_ratio_  #explained_variance_ratio_:返回 所保留的n个成分各自的方差百分比。
res2 = [0]*len(res)
for i in range(len(res)):
    if i==0:
        res2[i]=res[i]
    else:
        res2[i] = res2[i-1]+res[i]

print("方差百分比:\n",res)
print("累计方差贡献:\n",res2)  #这里得到的就是累计方差贡献
# print (pca.explained_variance_ratio_)
print (pca.explained_variance_)
res3 =  pca.explained_variance_

insert image description here
The resulting percentages can be imported into excel for viewing.


Summarize

PCA dimensionality reduction is mainly to reduce the dimensionality of the original data and reduce variables to facilitate our subsequent analysis.

Guess you like

Origin blog.csdn.net/m0_65157892/article/details/129633768
Recommended