Feature selection and extraction using PCA and LDA

本文自建随机数据集进行演示PCA和LDA进行特征选择与提取

create dataset

The first thing to use is the make_blobs function. View the source code in pycharm and describe it as Generate isotropic Gaussian blobs for clustering, which is used to generate data sets. The first parameter of the make_blobs function is n_samples, which can be set to an integer or array-like It is a parameter similar to a list. If the integer type entered here is to generate a corresponding number of samples, the default value of this parameter is 100, which means that 100 data will be generated. Here, it is set to 1000, which means that 1000 sample data will be generated; The second parameter n_features indicates the feature dimension of the sample. The default value is 2. The parameter set in this article is 3, that is, each sample includes 3-dimensional feature values; the parameter center can receive an integer or a list type Yes, the source code is described as The number of centers to generate, or the fixed center locations. This parameter represents the center point of the sample data. The parameter set in this paper is a list of 4 rows and 3 columns, which means that the sample data set needs to Divided into 4 categories, these 4 values ​​are the eigenvalues ​​of the centers of each category; the parameter cluster_std represents the standard deviation of each cluster sample data, and several standard deviations must be given for several centers, otherwise An error will be reported. Here I have experimented. The parameters set in this article are given in the form of a list, each corresponding to the standard deviation of each category; random_state is a random number seed to ensure the reproducibility of the experimental results. Given An integer is fine, and this article sets it to 9.
After the make_blobs function is called, the first return value is the sample data, and the second return parameter is the label value of the sample data, which is the same as the return value of the function importing data in the previously used sklearn package, here is the automatically generated specified data , where X is the data, and y is the label value, which is the category value.

# X为样本特征,Y为样本簇类别, 共1000个样本,每个样本3个特征,共4个簇
X, y = datasets.make_blobs(n_samples=10000,
                           n_features=3,
                           centers=[[3, 3, 3], [0, 0, 0],
                                    [1, 1, 1], [2, 2, 2]],
                           cluster_std=[0.2, 0.1, 0.2, 0.2],
                           random_state=9)

PCA dimensionality reduction

Then the dimensionality reduction operation is performed on the data. The PCA transformation is used here. This transformation transforms the data into a new coordinate system, so that the first largest variance of any data projection is in the first coordinate (called the first principal component. ), the second largest variance is on the second coordinate (second principal component), and so on. The first parameter of this function is n_components, which is described as Number of components to keep on pycharm, that is, after the data is dimensionally reduced, the several-dimensional features are retained. The parameter passed in here is 2, that is, the 3-dimensional The features are reduced to only two features. The method used for PCA dimensionality reduction is to calculate the covariance matrix of the data matrix, and then obtain the eigenvalue eigenvector of the covariance matrix, and select the matrix composed of the eigenvectors corresponding to the k features with the largest eigenvalue (that is, the largest variance). In this way, the data matrix can be transformed into a new space to achieve dimensionality reduction of data features.

# 降维
pca = PCA(n_components=2)
pca.fit(X)

Display it on matplotlib after dimensionality reduction, and only keep the information of the two selected dimensions.

insert image description here

This is the distribution of the data set before dimensionality reduction, which is not very distinguishable, but after dimensionality reduction, the degree of differentiation is more obvious.

insert image description here


LDA dimensionality reduction

Because PCA does not use category information, that is, it ignores the relationship between sample features and category information during dimensionality reduction, while LDA is equivalent to supervised dimensionality reduction, taking category information into account during dimensionality reduction, and features after dimensionality reduction through LDA The correlation between value and category information can be preserved, so for the labeled data generated by this own setting, in the feature space after LDA dimensionality reduction, the distribution of samples is better than that of PCA, and the degree of discrimination is higher.

# LDA降维
lda = LinearDiscriminantAnalysis(n_components=2)
lda.fit(X,y)
X_lda = lda.transform(X)

The result of the operation is:

insert image description here


Guess you like

Origin blog.csdn.net/qq_48068259/article/details/127893396