Detailed explanation of PCA (principal component analysis) data dimensionality reduction technology code

introduction

With the advent of the big data era, we often face the problem of processing high-dimensional data. High-dimensional data not only increases computational complexity, but may also cause the "curse of dimensionality". In order to solve this problem, we need to perform dimensionality reduction on the data, that is, mapping the data from high-dimensional space to low-dimensional space without losing too much information. Principal Component Analysis (PCA) is a commonly used data dimensionality reduction method.

In short: PCA dimensionality reduction is to simplify complex high-dimensional data into easier-to-understand low-dimensional data, while retaining the most important information, allowing us to analyze and process these data more conveniently.​ 

Take the following figure as an example. All data are distributed in three-dimensional space. PCA maps the three-dimensional data to the two-dimensional plane u. The two-dimensional plane is represented by the vector <u1,u2>, and u1 is perpendicular to u2.

Code demo: 

import numpy as np
from sklearn.decomposition import PCA

# 创建一个包含五个数据点和两个特征的二维NumPy数组
data = np.array([[1, 1], [1, 3], [2, 3], [4, 4], [2, 4]])

# 创建一个PCA对象,通过设置 n_components 参数为 0.9,表示要保留90%的原始数据的方差
pca = PCA(n_components=0.9)  # 提取90%特征

# 对输入的数据进行PCA模型拟合,计算主成分
pca.fit(data)

# 使用拟合好的PCA模型对原始数据进行转换,将数据压缩到新的特征空间,压缩后的结果存储在变量 new 中
new = pca.fit_transform(data)  # 压缩后的矩阵

# 打印压缩后的数据
print("Compressed Data:")
print(new)

# 打印每个选定主成分解释的方差比例。在这里,由于指定了 n_components=0.9,它将打印每个主成分解释的方差比例,直到累积解释的方差达到90%为止
print("Explained Variance Ratios:")
print(pca.explained_variance_ratio_)

 Compressed matrix:

Data after dimensionality reduction by PCA. This matrix contains the representation of the reduced dimensionality data points in the new feature space.

Simply put, each row corresponds to a data point in the original data, and each column corresponds to a new principal component (new feature). In this example, since is set, only the first principal component is retained, so the new feature space has only one dimension. n_components=0.9

Proportion of variance explained by principal components:

In the provided data set data, each data point has two features. When applying PCA for dimensionality reduction, PCA tries to find a new feature space in which the first principal component (the first new feature) has the largest variance and the second principal component (the second new feature) has The second largest variance. For the detailed derivation process, you can read my blog:Derivation of PCA dimensionality reduction (super detailed)_AI_dataloads’ blog-CSDN blog

In the data, the first principal component (new features) calculated by PCA has a variance of about 0.83, while the second principal component has a variance of about 0.17. Therefore, the first principal component retains most of the variation and information in the data, while the second principal component contains relatively less information. Therefore, after dimensionality reduction, only the first principal component is retained, while the information of the second principal component is discarded.

This is why there is only one principal component left after dimensionality reduction, namely[0.83333333, 0.16666667]. This means that the dimensionally reduced dataset contains only one principal component, where the contribution of the first principal component is dominant, while the contribution of the second principal component is relatively small and therefore is removed. This is how PCA works, it attempts to capture the most important changes in the data and reduce the dimensions to reduce redundancy.

Guess you like

Origin blog.csdn.net/m0_74053536/article/details/134366940