1 Introduction

This article mainly explains the python implementation of principal component analysis (PCA), and will follow up the example analysis later

2 Principle - Code Implementation

2.1 Implementation steps

Principal component analysis PCA is a widely used and dimensionality reduction method, and its realization is summarized as follows insert image description here

2.2 Code implementation

import package

import numpy as np

Define the calculation covariance matrix function
X as the input data, m is the number of sample data, that is, the number of rows of X.
To standardize X, the method is: subtract the mean and divide by the variance. If you don’t understand the principle of this part, you can Baidu.
The standardized data is a standard normal distribution with a mean of 0 and a variance of 1.

# 计算协方差矩阵
def calc_cov(X):
    m = X.shape[0] # 样本的数量，行数
    # 数据标准化
    X = (X - np.mean(X, axis=0)) / np.var(X, axis=0) # 标准化之后均值为0，方差为1
    return 1 / m * np.matmul(X.T, X) # matmul为两个矩阵的乘积

The process of defining PCA
first calculates the covariance of the input data X, and then calculates its eigenvalues as: eigenvalues, calculates its eigenvectors as: eigenvectors
calculates the eigenvalues and eigenvectors using the np.linalg.eig() function, use It is very convenient
and then the next step is to calculate the matrix P, and use Y=XP to calculate the dimensionally reduced data Y

def pca(X, n_components):
    # 计算协方差矩阵
    cov_matrix = calc_cov(X)
    # 计算协方差矩阵的特征值和对应特征向量
    eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) # eigenvalues特征值，eigenvectors特征向量
    # 对特征值排序
    idx = eigenvalues.argsort()[::-1]
    # 取最大的前n_component组
    eigenvectors = eigenvectors[:, idx]
    eigenvectors = eigenvectors[:, :n_components]
    # Y=XP转换
    return np.matmul(X, eigenvectors)

2.3 Iris data set example

Import Data

from sklearn import datasets
import matplotlib.pyplot as plt
# 导入鸢尾花数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target

Looking at the shape of the data, the result is (150, 4)

X.shape
# (150, 4)

Compute the covariance matrix

cov_matrix = calc_cov(X) # 计算特征值
cov_matrix

insert image description here
You can see that the covariance matrix is a 4*4 matrix, and then we calculate the eigenvalues and eigenvectors of the matrix

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) # eigenvalues特征值，eigenvectors特征向量

insert image description here
Then calculate the P we need, here we keep 3 principal components

idx = eigenvalues.argsort()[::-1]
# 取最大的前n_component组
eigenvectors = eigenvectors[:, idx]
eigenvectors = eigenvectors[:, :3]

A matrix with 4 rows and 3 columns is obtained
insert image description here
, and then P is used to obtain the dimensionally reduced data

# Y=PX转换
np.matmul(X, eigenvectors)

The data after dimensionality reduction is (150, 4)*(4, 3)=(150, 3),
that is, 150 pieces of data in 3 columns, and the data is reduced from the original 4 dimensions to 3 dimensions
insert image description here

3 Implementation based on Sklearn

# 导入sklearn降维模块
from sklearn import decomposition
# 创建pca模型实例，主成分个数为3个
pca = decomposition.PCA(n_components=3) # 写我们需要几个主成分
# 模型拟合
pca.fit(X)
# 拟合模型并将模型应用于数据X
X_trans = pca.transform(X)

# 颜色列表
colors = ['navy', 'turquoise', 'darkorange']
# 绘制不同类别
for c, i, target_name in zip(colors, [0,1,2], iris.target_names):
    plt.scatter(X_trans[y == i, 0], X_trans[y == i, 1], 
            color=c, lw=2, label=target_name)
# 添加图例
plt.legend()
plt.show()

insert image description here

[Mathematical Modeling] Python Implementation of Common Algorithms - Principal Component Analysis PCA