Machine learning-dimensionality reduction algorithm PCA

Article Directory

The following is an example of using the PCA algorithm to deal with actual problems, also using the iris data set, the purpose is still to complete the dimensionality reduction task

The basic process is as follows:

1. Data preprocessing, only numerical data can perform PCA dimensionality reduction
2. Calculate the covariance square matrix of the sample data
3. Solve the eigenvalues and eigenvectors of the covariance matrix
4. Arrange the eigenvalues in descending order, select the larger K, and then compose the corresponding K eigenvectors into a projection matrix
5. Project the sample points to complete the PCA dimensionality reduction task

1. Import data

import numpy as np
import pandas as pd

# 读取数据集
df = pd.read_csv('iris.data')
# 原始数据没有给定列名的时候需要我们自己加上
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.head()

	sepal_len	sepal_wid	petal_len	petal_wid	class
0	4.9	3.0	1.4	0.2	Iris-silky
1	4.7	3.2	1.3	0.2	Iris-silky
2	4.6	3.1	1.5	0.2	Iris-silky
3	5.0	3.6	1.4	0.2	Iris-silky
4	5.4	3.9	1.7	0.4	Iris-silky

2. Show data characteristics

# 把数据分成特征和标签
X = df.iloc[:,0:4].values
y = df.iloc[:,4].values

from matplotlib import pyplot as plt

# 展示我们标签用的
label_dict = {
    
    1: 'Iris-Setosa',
              2: 'Iris-Versicolor',
              3: 'Iris-Virgnica'}

# 展示特征用的
feature_dict = {
    
    0: 'sepal length [cm]',
                1: 'sepal width [cm]',
                2: 'petal length [cm]',
                3: 'petal width [cm]'}

# 指定绘图区域大小
plt.figure(figsize=(8, 6))
for cnt in range(4):
    # 这里用子图来呈现4个特征
    plt.subplot(2, 2, cnt+1)
    for lab in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):
        plt.hist(X[y==lab, cnt],
                     label=lab,
                     bins=10,
                     alpha=0.3,)
    plt.xlabel(feature_dict[cnt])
    plt.legend(loc='upper right', fancybox=True, fontsize=8)

plt.tight_layout()
plt.show()

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-JSBFrTIz-1614261679511)(output_5_0.png)]

It can be seen that some features have strong distinguishing ability and can present the three kinds of flowers separately; some feature distinguishing ability is weak, and some characteristic data samples are mixed together.

3. Data standardization

In general, before training, data often needs to be standardized.

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

4. Calculate the covariance matrix

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('协方差矩阵 \n%s' %cov_mat)

# 利用numpy也可以
# print('NumPy 计算协方差矩阵: \n%s' %np.cov(X_std.T))

协方差矩阵 
[[ 1.00675676 -0.10448539  0.87716999  0.82249094]
 [-0.10448539  1.00675676 -0.41802325 -0.35310295]
 [ 0.87716999 -0.41802325  1.00675676  0.96881642]
 [ 0.82249094 -0.35310295  0.96881642  1.00675676]]

5. Find eigenvalues and eigenvectors

cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('特征向量 \n%s' %eig_vecs)
print('\n特征值 \n%s' %eig_vals)

特征向量 
[[ 0.52308496 -0.36956962 -0.72154279  0.26301409]
 [-0.25956935 -0.92681168  0.2411952  -0.12437342]
 [ 0.58184289 -0.01912775  0.13962963 -0.80099722]
 [ 0.56609604 -0.06381646  0.63380158  0.52321917]]

特征值 
[2.92442837 0.93215233 0.14946373 0.02098259]

6. Sort according to the size of eigenvalues

# 把特征值和特征向量对应起来
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print (eig_pairs)
print ('----------')
# 把它们按照特征值大小进行排序
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# 打印排序结果
print('特征值又大到小排序结果:')
for i in eig_pairs:
    print(i[0])

[(2.9244283691111126, array([ 0.52308496, -0.25956935,  0.58184289,  0.56609604])), (0.9321523302535072, array([-0.36956962, -0.92681168, -0.01912775, -0.06381646])), (0.14946373489813383, array([-0.72154279,  0.2411952 ,  0.13962963,  0.63380158])), (0.020982592764270565, array([ 0.26301409, -0.12437342, -0.80099722,  0.52321917]))]
----------
特征值又大到小排序结果:
2.9244283691111126
0.9321523302535072
0.14946373489813383
0.020982592764270565

7. Calculate the cumulative result

Add up the feature vector and when it exceeds a certain percentage, you can select it as the dimension size after dimensionality reduction

# 计算累加结果
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
print (var_exp)
cum_var_exp = np.cumsum(var_exp)
cum_var_exp

[72.62003332692029, 23.147406858644153, 3.711515564584534, 0.5210442498510144]





array([ 72.62003333,  95.76744019,  99.47895575, 100.        ])

It can be found that when the first two eigenvalues are used, the corresponding cumulative contribution rate has exceeded 95%, so the selection is reduced to two-dimensional.

# cumsum的用法例子
a = np.array([1,2,3,4])
print (a)
print ('-----------')
print (np.cumsum(a))

[1 2 3 4]
-----------
[ 1  3  6 10]

Drawing pictures can be displayed more directly


plt.figure(figsize=(6, 4))

plt.bar(range(4), var_exp, alpha=0.5, align='center',
            label='individual explained variance')
plt.step(range(4), cum_var_exp, where='mid',
             label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-zDhcB21y-1614261679514)(output_22_0.png)]

8. Complete PCA dimensionality reduction

Combine the first two feature vectors to complete the dimensionality reduction operation

matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
                      eig_pairs[1][1].reshape(4,1)))

print('Matrix W:\n', matrix_w)

Matrix W:
 [[ 0.52308496 -0.36956962]
 [-0.25956935 -0.92681168]
 [ 0.58184289 -0.01912775]
 [ 0.56609604 -0.06381646]]

Y = X_std.dot(matrix_w)
print("X.shape : ",X.shape)
print("Y.shape : ",Y.shape)

X.shape :  (149, 4)
Y.shape :  (149, 2)

It can be seen that the original data is reduced from 4 dimensions to 2 dimensions

9. Visually compare the distribution of data before and after dimensionality reduction

Since the data has 4 features, it cannot be displayed in a plan view, so only two-dimensional features are used to display the data

plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
                        ('blue', 'red', 'green')):
     plt.scatter(X[y==lab, 0],
                X[y==lab, 1],
                label=lab,
                c=col)
plt.xlabel('sepal_len')
plt.ylabel('sepal_wid')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-EXhiJg1C-1614261679517)(output_30_0.png)]

Result after dimensionality reduction

plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
                        ('blue', 'red', 'green')):
     plt.scatter(Y[y==lab, 0],
                Y[y==lab, 1],
                label=lab,
                c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()
plt.show()

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-QpP4hdvk-1614261679522)(output_32_0.png)]