[] Sklearn machine learning library principal component analysis PCA dimensionality reduction of the use of combat

. 1, PCA classification described
in scikit-learn, the category associated with the PCA in sklearn.decomposition package. The most common type is the PCA sklearn.decomposition.PCA.

Principle: linear maps (or linear transformation), simply means that the high-dimensional data space is projected onto a lower dimensional space, then the data analysis, we are the main component of the data (containing a large amount of information dimensions) preserved , ignore the unimportant data described ingredients. It is about a main component consisting of the vector space dimension as the low dimensional space, the high dimensional data onto a projection space to complete the dimensionality reduction work.

In addition to the PCA class, the most commonly used PCA related classes there KernelPCA class, in principle chapter we have talked about, it is mainly used for nonlinear data dimensionality reduction, the need to use nuclear techniques. Thus when in use to select the proper parameters and kernel function kernel function parameter adjustment.

Another popular category is related to PCA IncrementalPCA class, it is mainly to solve the stand-alone memory limit. Sometimes our sample size may be millions +, dimensions and perhaps thousands, fitting the data directly to memory may make ringing off the hook, then we can use IncrementalPCA class to solve this problem. IncrementalPCA data into a plurality of first batch, and then successively incremented for each function call partial_fit batch, a step by step so that the final sample obtained optimal dimensionality reduction.

There are also SparsePCA and MiniBatchSparsePCA. They PCA class and distinction mentioned above is mainly used L1 regularization, so the impact of the many non-essential components may be reduced to 0, so the PCA dimensionality reduction of the time we just need those relatively major component PCA dimensionality reduction performed, to avoid a number of factors like the noise of our PCA dimensionality reduction. The difference between SparsePCA and MiniBatchSparsePCA is MiniBatchSparsePCA dimensionality reduction by using a portion of the sample characteristics and a given number of iterations to PCA, to address the large sample characteristics too slow decomposition of the problem, of course, the price is PCA dimensionality reduction of accuracy It may be reduced. Use SparsePCA and MiniBatchSparsePCA need for L1 regularization parameter adjustment parameter.

2, sklearn.decomposition.PCA parameters introduced
PCA based Scheduling basic need, in general, we need only specify the dimensionality reduction we need to dimension, we hope or down the main component and the dimensional variance accounted for all features of the original dimension and the proportion of variance threshold value on it.
A parameter about
n_components:
Meaning: the number of principal components PCA algorithm to be retained n, i.e., n number of the characteristics of the retained
type: Default is None int or when the string, the default, all of the ingredients are retained.
Assigned to int, such as n_components = 1, the raw data will be reduced to one dimension.
Of course, we can also specify the proportion of variance and minimum threshold percentage of the principal component of PCA let yourself go to class dimension reduction to determine the number of dimensions of the feature based on a sample variance, this time n_components is a number between (0,1] .
assigned string, such n_components = 'mle', wherein automatically selected number n, that satisfy the required percentage of variance.
Copy:
type: bool, True or False, True default default.
meaning: indicates whether the when you run the algorithm, the original copy of the training data if it is True, then run the PCA algorithm, the value of the original training data will not change, because the operation is performed on a copy of the original data; If False, the run after the PCA algorithm, the value of the original training data will change, because the dimension is calculated down on the original data.
whiten:
type: bool, the default default is False
sense: albino.

Two, PCA object properties
components_: Returns the component with the greatest variance.
explained_variance_ratio_: the percentage of the variance returned to their reserved the n components.
n_components_: Returns the number of components retained n.

Four, PCA object properties
Fit (the X-, the y-= None)
fit () can be said to scikit-learn the general method of training required for each algorithm will have a fit () method, which is actually the algorithm of "training" this a step. Because PCA is an unsupervised learning algorithm, where y equals the natural None.
fit (X), X represents the data to train the PCA model. Function return value: Call fit method of the object itself. For example pca.fit (X), expressed train the pca this object with the X.
fit_transform (X)
with X to train the PCA model, and returns data dimensionality reduction.
newX = pca.fit_transform (X), newX is the data dimension reduction.
inverse_transform ()
converts the data into the original data dimension reduction, X = pca.inverse_transform (newX)
Transform (X)
converts the data into X data dimension reduction. When the model is good training for the newly input data, the method can be used to transform dimension reduction.

3, PCA examples
below we first use a simple example to learn under scikit-learn in class uses PCA. In order to facilitate the visualization so that we have an intuitive understanding, here we use the three-dimensional data dimensionality reduction. First, we generated random data and visualization, as follows:

AS NP numpy Import 
Import matplotlib.pyplot AS PLT 
from mpl_toolkits.mplot3d Import Axes3D 
from sklearn.datasets.samples_generator Import make_blobs 
Import matplotlib.pyplot AS PLT 
#X sample wherein, Y is the sample cluster categories, a total of 1000 samples, each sample 3 wherein a total of four clusters 
X, y = make_blobs (n_samples = 10000, n_features = 3, centers = [[3,3, 3], [0,0,0], [1,1,1], [ 2.2.2]], cluster_std = [0.2, 0.1, 0.2, 0.2], 
random_state =. 9) 
Fig plt.figure = () 
AX = Axes3D (Fig, RECT = [0, 0,. 1,. 1], ELEV 30 =, = 20 is Azim) 
plt.scatter (X-[:, 0], X-[:,. 1], X-[:, 2], marker = 'O') 
# first dimension reduction is not only the projection data, See in three dimensions after the variance of the projected distribution 
from sklearn.decomposition Import the PCA 
PCA = the PCA (= n_components. 3) 
pca.fit (X-)  
# returns the percentage reserved variance of each component of the n
Print (pca.explained_variance_ratio_)
Print (pca.explained_variance_) 
# dimensionality reduction, down to 2 from three-dimensional 
PCA1 = the PCA (n_components = 2) 
pca1.fit (X-) 
Print (pca1.explained_variance_ratio_) 
Print (pca1.explained_variance_) 
n-th return reserved # the percentage of variance of each component 
Print (pca1.explained_variance_ratio_) 
Print (pca1.explained_variance_) 
'' 'by contrast, because the feature dimensions of the three variances is projected above: 
[3.78483785 0.03272285 0.03201892], select the two-dimensional projection must be the first two characteristics, the third feature abandon '' ' 
# 2 will be reduced dimensionality dimensional data visualization 
X_new pca1.transform = (X-) 
plt.scatter (X_new [:, 0], X_new [:, . 1], marker = 'O') 
plt.show () 


with the original bit different
I using pca.fit_transform (x)
this method

  


4, PCA algorithm summary

As a method of dimension reduction of unsupervised learning, it requires eigenvalue decomposition may be compressed data, denoising. Therefore, in the actual scene very broad application. In order to overcome some of the disadvantages of the PCA, there have been many variants of the PCA, such as for solving nonlinear dimensionality reduction KPCA, as well as to address the incremental method Incremental PCA PCA memory limitations of section VI, as well as to solve the sparse PCA method of data dimensionality reduction Sparse PCA and so on.

    The main advantage of PCA algorithm are:

    1) only need to measure the impact of factors other than the variance amount of information from the data set. 

    2) orthogonal to eliminate factors that influence each other between the original data components among the main component.

    3) calculation method is simple, operation is mainly eigenvalue decomposition, and easy to implement.

    The main disadvantage of PCA algorithm are:

    1) wherein the meaning of each dimension of the main component having a certain degree of ambiguity, as good as the original sample explanatory strong features.

    2) the small variance of the principal component may also contain important information about the sample differences by dimension reduction may affect the subsequent data discard process.
----------------

Original link: https: //blog.csdn.net/brucewong0516/article/details/78666763

Guess you like

Origin www.cnblogs.com/blogwangwang/p/11520361.html