[Mathematical Modeling] Python Implementation of Common Algorithms - Principal Component Analysis PCA

1 Introduction

This article mainly explains the python implementation of principal component analysis (PCA), and will follow up the example analysis later

2 Principle - Code Implementation

2.1 Implementation steps

Principal component analysis PCA is a widely used and dimensionality reduction method, and its realization is summarized as followsinsert image description here

2.2 Code implementation

import package

import numpy as np
  • Define the calculation covariance matrix function
    X as the input data, m is the number of sample data, that is, the number of rows of X.
    To standardize X, the method is: subtract the mean and divide by the variance. If you don’t understand the principle of this part, you can Baidu.
    The standardized data is a standard normal distribution with a mean of 0 and a variance of 1.
# 计算协方差矩阵
def calc_cov(X):
    m = X.shape[0] # 样本的数量,行数
    # 数据标准化
    X = (X - np.mean(X, axis=0)) / np.var(X, axis=0) # 标准化之后均值为0,方差为1
    return 1 / m * np.matmul(X.T, X) # matmul为两个矩阵的乘积
  • The process of defining PCA
    first calculates the covariance of the input data X, and then calculates its eigenvalues ​​as: eigenvalues, calculates its eigenvectors as: eigenvectors
    calculates the eigenvalues ​​and eigenvectors using the np.linalg.eig() function, use It is very convenient
    and then the next step is to calculate the matrix P, and use Y=XP to calculate the dimensionally reduced data Y
def pca(X, n_components):
    # 计算协方差矩阵
    cov_matrix = calc_cov(X)
    # 计算协方差矩阵的特征值和对应特征向量
    eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) # eigenvalues特征值,eigenvectors特征向量
    # 对特征值排序
    idx = eigenvalues.argsort()[::-1]
    # 取最大的前n_component组
    eigenvectors = eigenvectors[:, idx]
    eigenvectors = eigenvectors[:, :n_components]
    # Y=XP转换
    return np.matmul(X, eigenvectors)

2.3 Iris data set example

Import Data

from sklearn import datasets
import matplotlib.pyplot as plt
# 导入鸢尾花数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target

Looking at the shape of the data, the result is (150, 4)

X.shape
# (150, 4)

Compute the covariance matrix

cov_matrix = calc_cov(X) # 计算特征值
cov_matrix

insert image description here
You can see that the covariance matrix is ​​a 4*4 matrix, and then we calculate the eigenvalues ​​and eigenvectors of the matrix

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) # eigenvalues特征值,eigenvectors特征向量

insert image description here
Then calculate the P we need, here we keep 3 principal components

idx = eigenvalues.argsort()[::-1]
# 取最大的前n_component组
eigenvectors = eigenvectors[:, idx]
eigenvectors = eigenvectors[:, :3]

A matrix with 4 rows and 3 columns is obtained
insert image description here
, and then P is used to obtain the dimensionally reduced data

# Y=PX转换
np.matmul(X, eigenvectors)

The data after dimensionality reduction is (150, 4)*(4, 3)=(150, 3),
that is, 150 pieces of data in 3 columns, and the data is reduced from the original 4 dimensions to 3 dimensions
insert image description here

3 Implementation based on Sklearn

# 导入sklearn降维模块
from sklearn import decomposition
# 创建pca模型实例,主成分个数为3个
pca = decomposition.PCA(n_components=3) # 写我们需要几个主成分
# 模型拟合
pca.fit(X)
# 拟合模型并将模型应用于数据X
X_trans = pca.transform(X)

# 颜色列表
colors = ['navy', 'turquoise', 'darkorange']
# 绘制不同类别
for c, i, target_name in zip(colors, [0,1,2], iris.target_names):
    plt.scatter(X_trans[y == i, 0], X_trans[y == i, 1], 
            color=c, lw=2, label=target_name)
# 添加图例
plt.legend()
plt.show()

insert image description here

Guess you like

Origin blog.csdn.net/qq_44319167/article/details/128839122