[Machine Learning] Feature Dimensionality Reduction - Principal Component Analysis PCA

"Author's Homepage": Shibie Sanri wyx "Author's Profile
": CSDN top100, Alibaba Cloud Blog Expert, Huawei Cloud Sharing Expert, and High- quality Creator in the Network Security Field

Among the extracted features, there are some related (similar) "redundant features" , which are not necessary to be counted. We need to "reduce" related features and leave irrelevant features. That is, "feature dimensionality reduction" .

There are many ways to reduce the dimensionality of features, one of which is used here: principal component analysis

1. Principal Component Analysis

Principal Component Analysis (PCA), is a "statistical" method. Through orthogonal transformation, a group of variables that may have "correlation" is converted into a group of "linearly uncorrelated" variables. The converted group of variables is called "principal component" .

When counting variables, there are too many variables and there is a strong correlation, that is, there are many "similar" variables, which will increase the workload and "complexity" of the analysis .

Principal component analysis can create new variables to replace repeated and unimportant variables based on the correlation between variables; that is, replace more variables with fewer variables, and can reflect most of the information of the original multiple variables, thereby improving the "speed" of data processing .

For example, in the selection of three good students, each student has multiple characteristics such as height, weight, family background, and grades, but the characteristics of height and weight are useless for the selection, so we will remove these useless characteristics and replace them with grades.

sklearn.decomposition.PCA( n_components=None )

  • PCA.fit_transform( data ): Receive data and perform dimension reduction
  • PCA.inverse_transform( data ): Convert the dimensionally reduced data back to the original data
  • PCA.get_covariance(): Get covariance data
  • PCA.get_params(): Get model data
  • n_components: Specify the dimension (decimal: how many percent of information is retained in the end, integer: how many features are reduced to)

Two, specify the dimension

The n_components parameter is an "integer" , which means to reduce to the "specified dimension" .

from sklearn import decomposition

# 测试数据
data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]

# 初始化
pca = decomposition.PCA(n_components=2)

# 降维
result = pca.fit_transform(data)
print(result)

output:

[[ 1.28620952e-15  3.82970843e+00]
 [ 5.74456265e+00 -1.91485422e+00]
 [-5.74456265e+00 -1.91485422e+00]]

As can be seen from the results, the features are reduced from the original 3 dimensions to the current 2 dimensions.

PS: There are originally 3 columns, called 3 dimensions; after dimensionality reduction, it becomes 2 columns, called 2 dimensions.


3. Retention ratio

The n_components parameter is "decimal" , which means how much information is retained after dimensionality reduction.

from sklearn import decomposition

# 测试数据
data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]

# 初始化
pca = decomposition.PCA(n_components=0.30)

# 降维
result = pca.fit_transform(data)
print(result)

output:

[[ 1.28620952e-15]
 [ 5.74456265e+00]
 [-5.74456265e+00]]

It can be seen from the results that the feature has been reduced from the original 4 dimensions to 1 dimension, and only 30% of the information is retained.


Fourth, get the covariance

from sklearn import decomposition

# 测试数据
data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]

# 初始化
pca = decomposition.PCA(n_components=2)

# 降维
result = pca.fit_transform(data)
print(pca.get_covariance())

output:

[[  4.33333333  -5.5         -1.66666667   1.16666667]
 [ -5.5          7.           1.5         -1.        ]
 [ -1.66666667   1.5         20.33333333 -15.83333333]
 [  1.16666667  -1.         -15.83333333  12.33333333]]

5. Return the original data

Convert dimensionally reduced data into raw data

from sklearn import decomposition

# 测试数据
data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]

# 初始化
pca = decomposition.PCA(n_components=2)

# 降维
result = pca.fit_transform(data)
print(pca.inverse_transform(result))

output:


[[2. 8. 4. 5.]
 [6. 3. 0. 8.]
 [5. 4. 9. 1.]]

Guess you like

Origin blog.csdn.net/wangyuxiang946/article/details/131758573