[python] data mining analysis and cleaning - summary of feature selection screening (dimension reduction) methods


Links to this article:

foreword

Feature dimensionality reduction refers to the process of converting high-dimensional data into low-dimensional data by a certain method, while retaining the information in the original data as much as possible. In practical applications, we often face a large amount of high-dimensional data. How to effectively process these data and discover the underlying laws and relationships is a very important issue. Feature dimensionality reduction can help us reduce the amount of data and computational complexity, and avoid problems such as overfitting.

The data is shown in the figure below
insert image description here

1. PCA Dimensionality Reduction Technology

PCA (Principal Component Analysis) is a common dimensionality reduction technique, which can map high-dimensional data sets into low-dimensional spaces. In the field of data analysis and machine learning, PCA is often used to reduce the number of features of a dataset, thereby simplifying the model and avoiding overfitting.
The basic idea of ​​PCA is to find the most important directions (i.e. principal components) in the data and project the data onto these directions to achieve dimensionality reduction. These principal components refer to the most informative directions in the data and usually correspond to the directions in which the variance of the data is greatest.
Through PCA dimensionality reduction, we can ignore the irrelevant features in the original data and only keep the features that can explain most of the variance of the data. This can effectively reduce the complexity of the data, improve the processing efficiency of the algorithm, and avoid over-fitting problems caused by high-dimensional data.

from sklearn.decomposition import PCA
pca = PCA(n_components=5)   #整数表示保留多少特征 小数表示保留多少百分比信息量
X_new = pca.fit_transform(data_analy_x)

"""查看PCA的一些属性"""
print(X_new.shape)   # (200000, 10)
print(pca.explained_variance_)    # 属性可以查看降维后的每个特征向量上所带的信息量大小(可解释性方差的大小)
print(pca.explained_variance_ratio_)  # 查看降维后的每个新特征的信息量占原始数据总信息量的百分比
print(pca.explained_variance_ratio_.sum())    # 降维后信息保留量

insert image description here

The feature dimension is reduced, and it can be seen that the information of each feature accounts for the total ratio

二、SVD

SVD dimensionality reduction is a common data dimensionality reduction technique. By performing Singular Value Decomposition (SVD for short) on the data matrix, the original high-dimensional data can be converted into a representation in a low-dimensional space. This process preserves the main characteristics of the data, thereby reducing storage and computation overhead while avoiding the "curse of dimensionality" problem.

In SVD dimensionality reduction, we can choose which main features to keep according to the size of the singular value. Usually, we only need to keep the singular vectors corresponding to the first few largest singular values ​​to obtain a low-dimensional representation that approximates the original data. This low-dimensional representation can be used in subsequent tasks such as data processing, visualization, or model training.

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=5) # 使用sklearn的truncatedsvd将矩阵降到5维数据
X_new = pd.DataFrame(svd.fit_transform(data_analy_x))
X_new

insert image description here

reduced to 5D

3. Factor Analysis

Factor analysis is a statistical analysis method used to identify latent factors in a data set. It aims to determine what common factors exist in a set of variables and can group these variables into several correlated factors.

In factor analysis, we first need to choose an appropriate number of factors, and then find the factor that can best explain the variance of the data by performing corresponding mathematical operations on the data. These factors are generally interpreted as a particular construct or concept, such as health, education level, happiness, and so on.

Through factor analysis, we can simplify data and improve data readability, and it also helps us better understand the structure and relationships behind the data. In addition, factor analysis can also help us predict future trends and behaviors, and formulate corresponding decisions and strategies.

from sklearn.decomposition import FactorAnalysis
X_new = pd.DataFrame( FactorAnalysis(n_components = 5).fit_transform(data_analy_x.values))
X_new

insert image description here

Summarize

PCA dimensionality reduction technique, SVD dimensionality reduction and factor analysis are commonly used data dimensionality reduction methods. They can help us extract the most important information from high-dimensional data, thereby simplifying data, speeding up calculations and reducing noise.

PCA (Principal Component
Analysis) transforms based on the covariance matrix of the data, finds out the principal components that can explain the variance of the original data to the greatest extent, and uses it as a new coordinate system for data projection and dimensionality reduction. PCA emphasizes retaining the maximum variance, so when the data structure is relatively simple, its effect will be better.

SVD (Singular Value
Decomposition) is to decompose the singular value of the data matrix and project the data into a smaller space, so that unnecessary details can be removed and data storage space can be reduced. Compared with PCA, SVD pays more attention to compressing data and is suitable for situations where the amount of calculation is relatively large.

Factor analysis is a modeling approach that seeks to find latent factors hidden in data. With factor analysis, we can identify a common set of factors and then categorize data into these factors to understand the structure and relationships behind the data. Compared with PCA and SVD, factor analysis focuses more on finding correlations and underlying structures between data.

In summary, PCA, SVD and factor analysis are three different dimensionality reduction methods. They are different in theoretical basis, application scenarios and data processing methods. In specific applications, you can choose the appropriate method according to your needs.

Guess you like

Origin blog.csdn.net/weixin_47058355/article/details/130420263