Data specification for data preprocessing

Table of contents

I. Introduction

Second, the main parameters of PCA:

3. Data reduction task 1

4. Data specification task 2


I. Introduction

PCA (Principal Component Analysis), the principal component analysis method, is one of the most widely used data dimensionality reduction algorithms. The main idea of ​​PCA is to map n-dimensional features to k-dimensional. This k-dimensional is a brand-new orthogonal feature, also known as principal component, which is a k-dimensional feature reconstructed on the basis of the original n-dimensional features.

Essentially, by calculating the covariance matrix of the data matrix, and then obtaining the eigenvalue eigenvectors of the covariance matrix, a matrix composed of eigenvectors corresponding to the k features with the largest eigenvalue (that is, the largest variance) is selected. In this way, the data matrix can be transformed into a new space to achieve dimensionality reduction of data features. ——From "Zhihu"

Second, the main parameters of PCA:

1) n_components: This parameter specifies the number of feature dimensions that you want after PCA dimensionality reduction.

The most commonly used method is to directly specify the number of dimensions to be reduced to. At this time, n_components is an integer greater than or equal to 1.

Of course, we can also specify the variance of the principal components and the minimum proportion threshold, and let the PCA class determine the number of dimensions to be reduced according to the variance of the sample features. At this time, n_components is a float between (0,1] points .

2) explained_variance_ratio_, which represents the ratio of the variance value of each principal component after dimensionality reduction to the total variance value. The larger the ratio, the more important the principal component.

3. Data reduction task 1

Import the file " Euro2012_stats.csv ", and remove the attribute columns (team name, percentage, etc.) in the data that cannot directly participate in the PCA calculation after viewing.

Note: The corresponding csv file can be obtained in the blog resource.

from sklearn.decomposition import PCA
import pandas as pd
eu=pd.read_csv("D:\\dataspace\\Euro2012_stats.csv",encoding="utf-8-sig")
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns', None)
eu=pd.DataFrame(eu) #数据预览
print(eu.dtypes)  #查看各特征列对应的属性
eu1=eu.drop(columns=['Team','Shooting Accuracy','% Goals-to-shots','Passing Accuracy','Saves-to-shots ratio'])  #删除非数值型属性的特征列
print(eu1) #删除后的数据预览

 Data preview:

 Query the attributes corresponding to each feature:

 Data preview after deletion:

4. Data specification task 2

PCA analysis was performed on the data. Introduce pca (from  sklearn.decomposition import PCA ), reduce the dimensionality of all numerical attribute columns to 5 dimensions, and view the dimensionality reduction results and information. By analyzing the results, select and specify the appropriate compression dimensions. (Self-study explained_variance_ratio_ , explain what the value means in the experiment report, and show the result)

from sklearn.decomposition import PCA
pca=PCA(n_components=5)
pca.fit(eu1)
newdata=pca.transform(eu1)
print('降维结果\n',newdata)
newdata1=pd.DataFrame(newdata)
print('方差值所占比\n',pca.explained_variance_ratio_)

 The dimensionality reduction results are as follows:

 

By calculating explained_variance_ratio_ , the ratio of the variance value of each principal component to the total variance value can be obtained, that is, the variance contribution rate.

 Use the first column of the data feature column after dimensionality reduction as the y-axis, and the team column in the original data as the x-axis to draw a line graph:

Re-adjust the parameters to determine the appropriate compression dimension:

scorelist =[]
for i in range(16):
    pca1=PCA(n_components=i)
    pca2=pca1.fit(eu1)
    score=pca2.score(eu1)
    scorelist.append(score)
print(scorelist)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
plt.plot(range(16),scorelist,label='score')
plt.legend()
plt.show()

 

As shown in the figure, according to the value of score, it can be concluded that the abscissa corresponding to the highest point is 16, but there are only 16 clusters of data objects in total. Too many divisions may easily lead to inaccurate classification, so the suitable compression dimension for division is 8 clusters.

 

 

Guess you like

Origin blog.csdn.net/m0_52051577/article/details/131564621