Machine Learning Notes - Dimensionality Reduction (Feature Extraction)

[Why dimensionality reduction]
*The curse of dimensionality: Under a given precision, to accurately estimate the function of some variables, the required sample size will increase exponentially with the increase of the sample dimension.
*The meaning of dimensionality reduction: overcome the disaster of dimensionality, obtain essential features, save storage space, remove useless noise, and realize data visualization
------------------------ -Data
dimensionality reduction is divided into two methods: feature selection and feature extraction. This paper introduces the feature extraction method, that is, the reduction feature is obtained by a certain transformation of the existing feature.
--------------------------
[Part 1, Linear Dimensionality Reduction Methods]
Assume that the dataset samples a global linear subspace from a high-dimensional space , that is, the variables that make up the data are independent of each other.  … dimensionality
reduction through a linear combination       of features ----------[PCA] The basic to construct a series of linear combinations of the original variables to form several comprehensive indicators, so as to remove the correlation of the data and keep the low-dimensional data as high as possible. Variance information for dimensional data. Determination of the number of principal components:     contribution rate: the proportion of the variance of the i-th principal component in the total variance, reflecting the share of the total information extracted by the i-th principal component.     Cumulative contribution rate: the proportion of the first k principal components in the total variance     Determination of the number of principal components: Cumulative contribution rate > 0.85 correlation coefficient matrix or covariance matrix?










When variables with different dimensions or large value ranges are involved, principal component analysis should be carried out from the correlation coefficient matrix;
for data with the same measure or value range, the covariance matrix should be used as a starting

point-- -------R implements
the princomp function
princomp(x, cor = FALSE, scores = TRUE, ...)
cor Logical value indicating whether the calculation should use a correlation matrix (TRUE) or a covariance
              (FALSE) matrix
scores Logical value, which indicates whether to calculate the principal component score. The
return value
loadings is a matrix, each column is a feature vector, that is, the rotation coefficient
scores of the original feature provide the scores on each principal component
      ------ -----[
LDA can reduce the dimensionality of C-type data to C-1 dimensional subspace
---------R realizes
library(MASS)
> params <- lda(y~x1 +x2+x3, data=d)
##The first parameter is the form of the discriminant, and the second parameter is the sample data used for training. After the lda command is executed, the coefficients that make up the discriminant are output.
> predict(params, newdata)
##Use the predict command to discriminate unclassified samples. The first parameter is the result of the lda command in the previous stage, and the second parameter is the sample data used for classification. Since then, the entire fisher discrimination process is completed.
      -----------[MDS] Multidimensional Scaling Analysis
When the similarity (or distance) between n research objects is given, determine the representation of these objects in a low-dimensional space, and make it as possible as possible "Roughly match" the original similarity (or distance) to minimize any distortion caused by dimensionality reduction.
Visually represent the research objects in a low-dimensional (two-dimensional or three-dimensional) space (perceptual map), and explain the relative relationship between the research objects simply and clearly.
-------------R example
city<-read.csv('airline.csv',header=TRUE)
city1<-city[,-1]#The first column of this dataset is the name, First remove it
for (i in 1:9)
for (j in (i+1):10)
city1[i,j]=city1[j,i] #Complement the upper triangle part
rownames(city1)<-colnames (city1) #Add the row name back again
city2<-as.dist(city1, diag = TRUE, upper = TRUE)#Convert to dist type
city3<-as.matrix(city2) #Convert to matrix
citys<-cmdscale( city3,k=2) #Calculate MDS, for visualization, take the first two principal coordinates
plot(citys[,1],citys[,2],type='n')
#drawing text(citys[,1],citys [,2],labels(city2),cex=.7) #Mark the name of the city
#See that the direction of the map and the real map are reversed, modify it:
plot(-citys[,1],-citys[,2],type='n')
text(-citys[,1],-citys[,2],labels(city2),cex=.7)

[Section The second part, nonlinear dimensionality reduction method]
The attributes of the data are strongly correlated
[manifold learning]
Manifold is a nonlinear extension of linear subspace, and manifold learning is a nonlinear dimensionality reduction method
Assumption: The high-dimensional data is located or approximately located on the potential low-dimensional prevalence
Ideas : Maintain a certain "invariant feature quantity" of the high-dimensional data and the low-dimensional data to find the low-dimensional feature representation The
invariant feature quantity is divided into:
· · Isomap: Geodesic Distance
...LLE: Local Reconstruction Coefficient
...LE: Data Domain Relation [
ISOMAP] Isometric Feature Mapping
The invariance of Euclidean distance of dimensional data to find low-dimensional feature representation
Geodesic distance: The geodesic distance between the points that are closer is replaced by the
Euclidean distance; LLE] Local Linear Embedding
Assumption : The low-dimensional manifold where the sampled data is located is locally linear, that is, each sample point can be linearly represented by its neighbors

Basic idea: dimensionality reduction is achieved by maintaining the local domain geometry between high-dimensional data and low-dimensional data and local reconstruction coefficients


This article is a reprinted article, the original address: https://www.douban.com/note/469279998/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324693967&siteId=291194637