Article Directory
-
-
- 1. Import data
- 2. Show data characteristics
- 3. Data standardization
- 4. Calculate the covariance matrix
- 5. Find eigenvalues and eigenvectors
- 6. Sort according to the size of eigenvalues
- 7. Calculate the cumulative result
- 8. Complete PCA dimensionality reduction
- 9. Visually compare the distribution of data before and after dimensionality reduction
-
The following is an example of using the PCA algorithm to deal with actual problems, also using the iris data set, the purpose is still to complete the dimensionality reduction task
The basic process is as follows:
-
1. Data preprocessing, only numerical data can perform PCA dimensionality reduction
-
2. Calculate the covariance square matrix of the sample data
-
3. Solve the eigenvalues and eigenvectors of the covariance matrix
-
4. Arrange the eigenvalues in descending order, select the larger K, and then compose the corresponding K eigenvectors into a projection matrix
-
5. Project the sample points to complete the PCA dimensionality reduction task
1. Import data
import numpy as np
import pandas as pd
# 读取数据集
df = pd.read_csv('iris.data')
# 原始数据没有给定列名的时候需要我们自己加上
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.head()
sepal_len | sepal_wid | petal_len | petal_wid | class | |
---|---|---|---|---|---|
0 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-silky |
1 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-silky |
2 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-silky |
3 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-silky |
4 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-silky |
2. Show data characteristics
# 把数据分成特征和标签
X = df.iloc[:,0:4].values
y = df.iloc[:,4].values
from matplotlib import pyplot as plt
# 展示我们标签用的
label_dict = {
1: 'Iris-Setosa',
2: 'Iris-Versicolor',
3: 'Iris-Virgnica'}
# 展示特征用的
feature_dict = {
0: 'sepal length [cm]',
1: 'sepal width [cm]',
2: 'petal length [cm]',
3: 'petal width [cm]'}
# 指定绘图区域大小
plt.figure(figsize=(8, 6))
for cnt in range(4):
# 这里用子图来呈现4个特征
plt.subplot(2, 2, cnt+1)
for lab in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):
plt.hist(X[y==lab, cnt],
label=lab,
bins=10,
alpha=0.3,)
plt.xlabel(feature_dict[cnt])
plt.legend(loc='upper right', fancybox=True, fontsize=8)
plt.tight_layout()
plt.show()
It can be seen that some features have strong distinguishing ability and can present the three kinds of flowers separately; some feature distinguishing ability is weak, and some characteristic data samples are mixed together.
3. Data standardization
In general, before training, data often needs to be standardized.
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
4. Calculate the covariance matrix
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('协方差矩阵 \n%s' %cov_mat)
# 利用numpy也可以
# print('NumPy 计算协方差矩阵: \n%s' %np.cov(X_std.T))
协方差矩阵
[[ 1.00675676 -0.10448539 0.87716999 0.82249094]
[-0.10448539 1.00675676 -0.41802325 -0.35310295]
[ 0.87716999 -0.41802325 1.00675676 0.96881642]
[ 0.82249094 -0.35310295 0.96881642 1.00675676]]
5. Find eigenvalues and eigenvectors
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('特征向量 \n%s' %eig_vecs)
print('\n特征值 \n%s' %eig_vals)
特征向量
[[ 0.52308496 -0.36956962 -0.72154279 0.26301409]
[-0.25956935 -0.92681168 0.2411952 -0.12437342]
[ 0.58184289 -0.01912775 0.13962963 -0.80099722]
[ 0.56609604 -0.06381646 0.63380158 0.52321917]]
特征值
[2.92442837 0.93215233 0.14946373 0.02098259]
6. Sort according to the size of eigenvalues
# 把特征值和特征向量对应起来
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print (eig_pairs)
print ('----------')
# 把它们按照特征值大小进行排序
eig_pairs.sort(key=lambda x: x[0], reverse=True)
# 打印排序结果
print('特征值又大到小排序结果:')
for i in eig_pairs:
print(i[0])
[(2.9244283691111126, array([ 0.52308496, -0.25956935, 0.58184289, 0.56609604])), (0.9321523302535072, array([-0.36956962, -0.92681168, -0.01912775, -0.06381646])), (0.14946373489813383, array([-0.72154279, 0.2411952 , 0.13962963, 0.63380158])), (0.020982592764270565, array([ 0.26301409, -0.12437342, -0.80099722, 0.52321917]))]
----------
特征值又大到小排序结果:
2.9244283691111126
0.9321523302535072
0.14946373489813383
0.020982592764270565
7. Calculate the cumulative result
Add up the feature vector and when it exceeds a certain percentage, you can select it as the dimension size after dimensionality reduction
# 计算累加结果
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
print (var_exp)
cum_var_exp = np.cumsum(var_exp)
cum_var_exp
[72.62003332692029, 23.147406858644153, 3.711515564584534, 0.5210442498510144]
array([ 72.62003333, 95.76744019, 99.47895575, 100. ])
It can be found that when the first two eigenvalues are used, the corresponding cumulative contribution rate has exceeded 95%, so the selection is reduced to two-dimensional.
# cumsum的用法例子
a = np.array([1,2,3,4])
print (a)
print ('-----------')
print (np.cumsum(a))
[1 2 3 4]
-----------
[ 1 3 6 10]
Drawing pictures can be displayed more directly
plt.figure(figsize=(6, 4))
plt.bar(range(4), var_exp, alpha=0.5, align='center',
label='individual explained variance')
plt.step(range(4), cum_var_exp, where='mid',
label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
8. Complete PCA dimensionality reduction
Combine the first two feature vectors to complete the dimensionality reduction operation
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
eig_pairs[1][1].reshape(4,1)))
print('Matrix W:\n', matrix_w)
Matrix W:
[[ 0.52308496 -0.36956962]
[-0.25956935 -0.92681168]
[ 0.58184289 -0.01912775]
[ 0.56609604 -0.06381646]]
Y = X_std.dot(matrix_w)
print("X.shape : ",X.shape)
print("Y.shape : ",Y.shape)
X.shape : (149, 4)
Y.shape : (149, 2)
It can be seen that the original data is reduced from 4 dimensions to 2 dimensions
9. Visually compare the distribution of data before and after dimensionality reduction
Since the data has 4 features, it cannot be displayed in a plan view, so only two-dimensional features are used to display the data
plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
('blue', 'red', 'green')):
plt.scatter(X[y==lab, 0],
X[y==lab, 1],
label=lab,
c=col)
plt.xlabel('sepal_len')
plt.ylabel('sepal_wid')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
Result after dimensionality reduction
plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
('blue', 'red', 'green')):
plt.scatter(Y[y==lab, 0],
Y[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()
plt.show()