Learning Principal Component Analysis (PCA) with scikit-learn

1. Introduction to scikit-learn PCA class

    In scikit-learn, the PCA-related classes are all in the sklearn.decomposition package. The most commonly used PCA class is sklearn.decomposition.PCA, and we will mainly explain the method based on the use of this class below.

    In addition to the PCA class, the most commonly used PCA-related class is the KernelPCA class, which we also mentioned in the principle chapter. It is mainly used for dimensionality reduction of nonlinear data and requires the use of kernel skills. Therefore, it is necessary to select an appropriate kernel function and adjust the parameters of the kernel function when using it.

    Another commonly used PCA-related class is the IncrementalPCA class, which is mainly to solve the memory limitation of a single machine. Sometimes our sample size may be millions +, and the dimension may be thousands. Directly fitting the data may cause the memory to explode. At this time, we can use the IncrementalPCA class to solve this problem. IncrementalPCA first divides the data into multiple batches, and then incrementally calls the partial_fit function for each batch, so as to obtain the optimal dimension reduction of the final sample step by step.

    There are also SparsePCA and MiniBatchSparsePCA. The main difference between them and the PCA class mentioned above is the use of L1 regularization, which can reduce the influence of many non-main components to 0, so that in PCA dimensionality reduction, we only need to compare those relatively main components. Perform PCA dimensionality reduction to avoid the influence of some noise and other factors on our PCA dimensionality reduction. The difference between SparsePCA and MiniBatchSparsePCA is that MiniBatchSparsePCA performs PCA dimensionality reduction by using a part of sample features and a given number of iterations to solve the problem that feature decomposition is too slow in large samples. Of course, the cost is the accuracy of PCA dimensionality reduction may decrease. Using SparsePCA and MiniBatchSparsePCA requires tuning the L1 regularization parameters.

2. sklearn.decomposition.PCA parameter introduction

    Below we mainly explain how to use scikit-learn for PCA dimensionality reduction based on sklearn.decomposition.PCA. The PCA class basically does not need to adjust the parameters. Generally speaking, we only need to specify the dimension to which we need to reduce the dimension, or we want the variance of the principal component after the dimension reduction and the proportional threshold of the variance sum of all the features of the original dimension.

    Now let's make an introduction to the main parameters of sklearn.decomposition.PCA:

    1) n_components : This parameter can help us specify the number of feature dimensions that we want PCA to reduce. The most common practice is to directly specify the number of dimensions to reduce to, in which case n_components is an integer greater than or equal to 1. Of course, we can also specify the variance of the principal components and the minimum proportion threshold, and let the PCA class decide the number of dimensions to reduce the dimension according to the variance of the sample characteristics. At this time, n_components is a number between (0, 1]. Of course, we can also set the parameter to "mle". At this time, the PCA class will use the MLE algorithm to select a certain number of principal component features to reduce the dimension according to the variance distribution of the features. We can also use the default value, that is, no Input n_components, at this time n_components=min (number of samples, number of features).

    2) whiten  : Determine whether to whiten or not. The so-called whitening is to normalize each feature of the dimensionally reduced data, so that the variance is 1. For PCA dimensionality reduction itself, whitening is generally not required. If you have subsequent data processing actions after PCA dimensionality reduction, you can consider whitening. The default value is False, which means no whitening is performed.

    3) svd_solver : that is, the method of specifying singular value decomposition SVD. Since eigen decomposition is a special case of singular value decomposition SVD, general PCA libraries are implemented based on SVD. There are 4 values ​​to choose from: {'auto', 'full', 'arpack', 'randomized'}. Randomized is generally suitable for PCA dimensionality reduction with a large amount of data, many data dimensions and a low proportion of the number of principal components. It uses some random algorithms that speed up SVD. full is SVD in the traditional sense, using the corresponding implementation of the scipy library. The applicable scenarios of arpack and randomized are similar, the difference is that randomized uses scikit-learn's own SVD implementation, while arpack directly uses the sparse SVD implementation of scipy library. The default is auto, that is, the PCA class will weigh the three algorithms mentioned above by itself, and choose a suitable SVD algorithm to reduce the dimension. In general, using the default value is sufficient.

    In addition to these input parameters, there are two members of the PCA class that deserve attention. The first is explained_variance_ , which represents the variance value of each principal component after dimension reduction. The larger the variance value, the more important the principal component is. The second is explained_variance_ratio_ , which represents the ratio of the variance value of each principal component after dimension reduction to the total variance value. The larger the ratio, the more important the principal component.

3. PCA example

    Let's use an example to learn the use of the PCA class in scikit-learn. In order to facilitate the visualization and let everyone have an intuitive understanding, we use three-dimensional data here to reduce dimensionality.

    First we generate random data and visualize it, the code is as follows:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
from sklearn.datasets.samples_generator import make_blobs
# X is the sample feature, Y is the sample cluster category, a total of 1000 samples, each sample has 3 features, a total of 4 clusters
X, y = make_blobs(n_samples=10000, n_features=3, centers=[[3,3, 3], [0,0,0], [1,1,1], [2,2,2]], cluster_std=[0.2, 0.1, 0.2, 0.2],
                  random_state =9)
fig = plt.figure()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=30, azim=20)
plt.scatter(X[:, 0], X[:, 1], X[:, 2],marker='o')

    The distribution diagram of the 3D data is as follows:

    Let's not reduce the dimension first, just project the data to see the variance distribution of the three dimensions after projection. The code is as follows:

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(X)
print pca.explained_variance_ratio_
print pca.explained_variance_

    The output is as follows:

[ 0.98318212  0.00850037  0.00831751]
[ 3.78483785  0.03272285  0.03201892]

    可以看出投影后三个特征维度的方差比例大约为98.3%:0.8%:0.8%。投影后第一个特征占了绝大多数的主成分比例。

    现在我们来进行降维,从三维降到2维,代码如下:

 
 
pca = PCA(n_components=2)
pca.fit(X)
print pca.explained_variance_ratio_
print pca.explained_variance_

    输出如下:

[ 0.98318212  0.00850037]
[ 3.78483785  0.03272285]

    这个结果其实可以预料,因为上面三个投影后的特征维度的方差分别为:[ 3.78483785  0.03272285  0.03201892],投影到二维后选择的肯定是前两个特征,而抛弃第三个特征。

    为了有个直观的认识,我们看看此时转化后的数据分布,代码如下:

X_new = pca.transform(X)
plt.scatter(X_new[:, 0], X_new[:, 1],marker='o')
plt.show()

    输出的图如下:

    可见降维后的数据依然可以很清楚的看到我们之前三维图中的4个簇。

    现在我们看看不直接指定降维的维度,而指定降维后的主成分方差和比例。

pca = PCA(n_components=0.95)
pca.fit(X)
print pca.explained_variance_ratio_
print pca.explained_variance_
print pca.n_components_

    我们指定了主成分至少占95%,输出如下:

[ 0.98318212]
[ 3.78483785]
1

    可见只有第一个投影特征被保留。这也很好理解,我们的第一个主成分占投影特征的方差比例高达98%。只选择这一个特征维度便可以满足95%的阈值。我们现在选择阈值99%看看,代码如下:

pca = PCA(n_components=0.99)
pca.fit(X)
print pca.explained_variance_ratio_
print pca.explained_variance_
print pca.n_components_

    此时的输出如下:

[ 0.98318212  0.00850037]
[ 3.78483785  0.03272285]
2

    这个结果也很好理解,因为我们第一个主成分占了98.3%的方差比例,第二个主成分占了0.8%的方差比例,两者一起可以满足我们的阈值。

    最后我们看看让MLE算法自己选择降维维度的效果,代码如下:

pca = PCA(n_components='mle')
pca.fit(X)
print pca.explained_variance_ratio_
print pca.explained_variance_
print pca.n_components_

    输出结果如下:

[ 0.98318212]
[ 3.78483785]
1

    可见由于我们的数据的第一个投影特征的方差占比高达98.3%,MLE算法只保留了我们的第一个特征。


转自https://www.cnblogs.com/pinard/p/6243025.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326687977&siteId=291194637