Curse of dimensionality and PCA principal component analysis

background　　

　　Curse of dimensionality is a common phenomenon in machine learning, specifically refers to with the increasing number of feature dimensions, data need to be addressed with respect to the formation of spatial characteristics in terms of relatively sparse, limited by fitting the training data can be a good model It applied to the training data, but for unknown test data, far away from a great chance to model space, the model can not be trained to deal with these unknown data points to form "over-fitting" phenomenon.

Program

　　Since the curse of dimensionality seriously affect the generalization model, then how to solve it? Easy to think of a solution is to increase the amount of data, but if more feature dimensions, requires a large amount of data to the entire feature space "filled with" too costly; there is a relatively easy to implement but the results were good solutions dimensionality reduction approach is characteristic. The main idea of dimensionality reduction features are: filter out unimportant features, or to merge some related features, try to keep the information about the original data while reducing the number of feature dimensions.

　　PCA mainly includes the following steps:

　　1, a sample normalized raw data matrix; (each column is typically a dimension normalized columns, is not standardized whole, are independent of each column)

　　2, obtaining normalized covariance matrix of the matrix;　　

　　3, the calculation of the covariance matrix of eigenvalues and eigenvectors;

　　4, in accordance with the size of the eigenvalue, eigenvector selection key;

　　5, generates new features.

tool

　　As used herein tools: Anaconda, PyCharm, python language, sklearn

Python code implementation

from numpy Import *
 from numpy Import linalg AS LA
 from sklearn.preprocessing Import Scale
 from sklearn.decomposition Import the PCA
 Import PANDAS PD AS 

Data = { ' A ' : [1,2,3,40], ' B ' : [. 5, 6,17,8], ' C ' : [9,10,81,12 ]} 
X = pd.DataFrame (Data)
 # X_s = (xx.mean ()) / x.std () 
# matrix columns standardize, each column is a dimension
Scale = X_s (X, with_mean = True, with_std = True, Axis = 0)
 Print ( " matrix after normalization is: {} " .format (X_s)) 

# acquired normalized matrix of a covariance matrix 
x_cov = CoV (X_s. TRANSPOSE ()) 

# calculation of the covariance matrix of the eigenvalues and eigenvectors of 
E, V = LA.eig (x_cov)
 Print ( " eigenvalue: {} " .format (E))
 Print ( " feature vector: \ n {} " .format (V)) 

# in accordance with the size of the eigenvalue, eigenvector selection main 
PCA = the PCA (2 ) 
pca.fit (X_s) 

# contents of the matrix output transform 
Print ( " variance (feature value): " , PCA .explained_variance_)
Print ( " main component (feature vector) " , pca.components_)
 Print ( " the transformed sample matrix: " , pca.transform (X_s))
 Print ( " information: " , pca.explained_variance_ratio_)

Code running results

After normalization matrix is: [[- 0.63753558 -0.84327404 -0.6205374 ] 
 [ -0.5768179 -0.63245553 -0.58787754 ] 
 [ -0.51610023 1.68654809 1.73097275 ] 
 [ 1.73045371 -0.21081851 -0.52255781 ]] 
Eigenvalue: [ 2.72019876e 1.27762876e + 00 + 00 2.17247549 03-E ] 
eigenvector: 
[[ 0.2325202 0.13219394 -0.96356584 ] 
 [ -0.67714769 -0.25795064 -0.68915344 ] 
 [ -0.69814423 0.71245512 -0.07072727 ]] 
variance (eigenvalue): [ 2.72019876 1.27762876 ] 
main component (eigenvectors) [[ -0.2325202 0.67714769 0.69814423 ] 
 [0.96356584 0.25795064 0.07072727 ]] 
the transformed sample matrix: [[ -0.85600578 -0.8757195 ] 
 [ -0.7045673 -0.76052331 ] 
 [ 2.4705145 0.06017659 ] 
 [ -0.90994142 1.57606622 ]] 
Amount: [ 0.68004969 0.31940719]