Andrew Ng machine learning (xiv) - dimensionality reduction

14.1 a motivation: data compression

Dimensionality reduction is a non-supervised learning method, dimensionality reduction does not require the use of tag data.
One of the purposes of dimension reduction is data compression, data compression can only compress data using less computer memory or disk space, but also accelerate our learning algorithm.
Dimensionality reduction can be a good deal redundancy features, such as: When do the project, there are several different engineering teams, perhaps the first project team to give you two hundred feature, the second engineering team to give you another three hundred of feature, a third project team to give you five hundred feature, more than 1,000 features all together, these features often there is a huge redundancy, but also to keep track of a large number of these features will become extremely difficult.
Here Insert Picture Description
The 2-dimensional feature dimensions down to 1
or less for example, if a length measurement item, the horizontal axis indicates cm use as a result of the measurement units, and the vertical axis represents the use of feet as a result of the measurement units, characterized in that the two However, a large amount of redundancy due to rounding during measurement results in the measurement results may not be equal, so we want to remove the redundancy of the data dimension reduction way

At this point seems to want to find a majority, according to its fall line next, so that all data can be projected at just online, through this practice, I was able to measure the position of each sample in order to create a new online wherein Z1, i.e., the original data I need x (1), x (2 ) represents a two dimensional features, and now only one Z value of the new features can be expressed by the content of the original two features
projected on the sample by a linear approximation, can be represented by the original data can be set by a real value for all samples wherein x (1), x (2 ), x (3), x (4) ... x (m ) to indicate the data sample set, x1, x2 represents the original data set for the feature, z (i) represents the i-th sample used by the new feature dimension reduction obtained.
Here Insert Picture Description

The 3-dimensional to 2-dimensional features of
the three-dimensional vector projected onto a two-dimensional plane, forcing all the data on the same plane, down to the two-dimensional feature vectors. Original three-dimensional data points into a two-dimensional plane, the two-dimensional position data indicating the feature points on a two dimensional plane. Wherein the primitive features using three X1, X2, X3 represents a new feature use of Z1, Z2 represents, meaning two axes projection plane, $ z ^ {(i) } denotes the i th sample to drop through new features peacekeeping obtained.
Here Insert Picture Description
Here Insert Picture Description

14.2 Motivation II: Data Visualization

At present we can only 2-3 dimensional data visualization, once the data becomes large dimensions, we can not rule intuitive discovery data. At this point, dimension reduction has become a very intuitive and very important job.
The following is a report of the level of national development, assessment of the country by 50 indicators, we want to see with intuitive visualization methods, but 50-dimensional data it is impossible to use graphics rendering, for which we use dimensionality reduction be lowered to the method of 2-dimensional view.
Here Insert Picture Description
It is integrated into two new features Z1 and Z2 by means of dimensionality reduction of 50 dimensions, but the significance of our new features, we do not know. That dimension reduction can only reduce the dimensions of the data needs to rediscover the meaning and definition of the new features.
Here Insert Picture Description
Use drop graphical representation of the new features of the dimension:
the horizontal axis represents about overall economic strength of the country / countries GDP GDP
vertical axis represents about happiness index / GDP per capita
Here Insert Picture Description

14.3 principal component analysis principle

Formulation Proncipal Component Analysis Problem
principal component analysis (PCA) is the most common dimension reduction algorithm
when the main component of the number of K = 2, our goal is to find a low-dimensional projection plane, when all the data are projected onto the low dimensional plane when the hope that all the samples the mean projection error can be as small as possible. Two projection plane is a longitudinal plane passing through the origin the vector from the plan, the projection error is made perpendicular to the projection plane from the feature vector.
When the number of principal components K = 1, our goal is to find a direction vector (Vector direction), when we put all the data onto the vector, it is desirable that all the samples the mean projection error can be as small as possible. Direction vector is a vector through the origin, the projection error (projection error) from the feature vector to be the vector length of a perpendicular direction.
Here Insert Picture Description
Is the projection data illustrated in FIG case where a two-dimensional space, FIG black × represents the original sample points, is the direction vector of the red, blue projection error, a green dot represents the projection data in the direction of the vector. And object is to find the PCA i.e., a direction vector such that all projection data is projected in the direction of the minimum error vector
Note PCA before use, and need to be normalized normalized feature

Principal component analysis principle
a find down a dimension from a two dimensional data onto a projection can be made on the minimum error of the direction vector (μ (1) ∈Rn).
Dropped from the n-dimensional k-dimensional vectors found k μ (1), μ (2 ), μ (3) ... μ (k), so that the original data is projected to the minimum linear subspace projection error of these vectors.
Here Insert Picture Description

PCA and linear regression is different
from the mechanism and results point of view, much like the PCA and linear regression, seemingly is to find a line or plane can be approximated by fitting the original data, even though it looked very similar but in fact it totally different.
Principal Component Analysis projection error is minimized (ProjectedError), and the linear regression attempt is to minimize the prediction error. Principal component analysis is an unsupervised learning method, the linear regression is a supervised learning method, the purpose is to predict the result of linear regression, and principal component analysis makes no prediction, all the characteristic properties of the original data is the same in principal component analysis treated. Below, the left is the linear regression error (horizontal axis perpendicular to the projection), the right is the main component analysis error (direction perpendicular to the projection vector).
Here Insert Picture Description

PCA advantages and disadvantages of
the advantages of
a big advantage PCA technology is to reduce the dimensionality of the data processing. Effect we can sort the importance Vector "pivot" newly determined, in accordance with the foregoing the need to take the most important part, the dimension of the latter is omitted, so that dimension reduction can be achieved or the simplified model data compression . While maintaining the greatest degree of information of the original data.
A great advantage PCA technology is that it is completely non-parametric limits. PCA during the calculation of the parameters set completely without human intervention or calculated according to any of the empirical model, the final result is only related to the data, the user is independent.
The disadvantage
that also can be seen as a drawback. If you have some prior knowledge of the object to be observed, to master some of the characteristics of the data, but can not intervene in the process by parametric methods, may not be the desired effect, the efficiency is not high.

14.4 principal component analysis algorithm Proncipal Component Analysis algorithm

Suppose the original data using the PCA method of N-dimensional data set down to the K-dimensional
mean normalization calculated mean uJ all of the features, then the original data by subtracting the mean uJ all dimensions of the dimension, and even if xj = xj-μj, If the number of features on different stages, but also need to be divided by the standard deviation σ2 dimension itself
covariance matrix calculating a set of samples (covariance matrix), each N-dimensional vector of dimension (Nl), multiplied by itself dimension ( 1N) transpose, to give a (N * N) symmetric matrices, and matrices All samples obtained after adding the sample set of covariance matrices [Sigma
i.e.:
[Sigma = 1N 1mΣi = (X (I)) (x (i)) T

Note If x (i) itself is stored row vector, assuming that X is the sample x (i) a matrix layer by layer stacking of the sample, there are:
[Sigma XT * X * = 1M
i.e.,
Here Insert Picture Description
calculation of the covariance matrix [Sigma characteristics of vector (Eigenvectors), may be used singular value decomposition (singular value decomposition) is solved using the statement in matlab [U, S, V] = svd (sigma), where sigma represents Σ covariance matrix i.e. sample set,
Here Insert Picture Description
the above formula the matrix U is a direction vector having the smallest error between projection data configuration. If we want the data from the N-dimensional reduced dimension K, we simply select the first K U in the figure above i.e. vectors u (1), u (2 ), u (3), ... u (K) from, to obtain a N × K dimension of a matrix, using Ureduce expressed and obtained the required by calculating a new feature vector z (i), that is
z (i) = UTreduce * x (i)
where x (i) is N * 1 dimension sample vectors, and UT is a matrix of K * N-dimensional direction vector configuration, so the final result Z ^ {(i)} is a K * 1-dimensional vectors, i.e., the new feature vector by PCA obtained
Here Insert Picture Description
are summarized
Here Insert Picture Description

14.5 rebuild the compressed representation

Reconstruction from Compressed Representation
using the PCA, 1000 can compress data to 100-dimensional feature dimensions, or compressed three-dimensional data to a two-dimensional representation. So, if the PCA if the task is a compression algorithm should be able to return to this form before compression of said return to an approximation of the original high-dimensional data. The figure is used to map the PCA samples x (i) to the z (i)
Here Insert Picture Description
i.e. whether it is possible to resume some way to use x (1) and x (2) the data represented by two-dimensional manner on the point z.

The method of
using Xappox represents an n-dimensional vector (n * 1) reconstructed samples, using Ureduce represents a feature matrix (n * k) selected using PCA algorithm K eigenvectors composition, using the Z indication PCA dimension reduction after the data samples The new feature (k * 1) has:.
Xappox the Z * = Ureduce
i.e.
Here Insert Picture Description

Select the number of principal components 14.6

Choosing the number of pricipal components
mean squared error of the mapping (Average Squared Projection Error), and the total variation (Total Variation)
object of PCA is to reduce the mean square error of the mapping ,, i.e., to reduce the original sample x (i) and reconstructed by the mean of the squared difference samples x (i) appox (low-dimensional mapping point) of
1mΣi = 1m || x (i) -x (i) appox || 2
total variation data (total Variation): is defined as the mean length of the original data samples:
1mΣi 1M = || X (I) || 2
means: on average from the raw data from the zero vector.

The rule of thumb is to select the value of K
at a ratio of mean square error of the mapping and as total variation is small (typically 0.01 selection) selecting the smallest possible value of K for this ratio is less than 0.01, for professional: reserved data 99 % of the difference (99% of variance is retained)
Here Insert Picture Description

Selecting a parameter K, and 99% of the differences is retained
commonly have other values 0.05 and 0.10, is 95% and 90% of the difference is preserved.

Number Principal Component selection algorithm
less efficient method
Shilling K = 1, then subjected to principal component analysis to obtain Ureduce and z (1), z (2 ), ... z (m), and then calculate the low-dimensional mapping point x ( i) appox, then calculated the ratio of the mean square error of the mapping and the total variation is less than 1%. If not, again making K = 2, and so on, until it finds the minimum value of K may be such that the ratio of less than 1%
Here Insert Picture Description

A better way
and some better way to select K, when calculating covariance matrix Sigma, call "svd" function, we get three parameters:
[the U-, S, V] = SVD (Sigma)
, wherein U are eigenvectors, and S is a diagonal matrix of diagonal elements S11, S22, S33 ... and the remaining elements of the matrix are 0 Snn.
Here Insert Picture Description
It can be proven (in this formula only shows the proof is not given), the following two equations are equal, ie:
Here Insert Picture Description
Therefore, the original condition can be converted to: Here Insert Picture Description
find the smallest value of K satisfying condition according to the formula .

Application Recommendations 14.7 Principal Component Analysis of

Testing and validation sets and should be used as a training set of feature vectors Ureduce
if we are being a computer vision for a 100 × 100 pixel image machine learning, ie a total of 10000 features.
The first step is to use principal component analysis to the data compression feature 1000
and the training set running a learning algorithm
in predicting, using a training set of learning feature comes Ureduce input x is converted into a feature vector Z, then prediction
Note that if we have a set of test set cross-validation, also used Ureduce training set of learning from

PCA is not solved method for fitting
a common mistake to use principal component analysis of the situation is that the PCA to reduce over-fitting (by reducing the number of features). This is very bad, we should use the regularization process. The reason is that the main component analysis is only approximate discarded some of the features, and it does not take into account any outcome variable y (ie, predicted label) information, and therefore may be missing a very important feature. PCA, after all, no method of supervised learning, any feature, whether the input attribute or tag attributes, which are treated the same, do not take into account the impact of reducing the input information on the label y by PCA discard part of the input property was not done on the label any compensation. However, when we regularization process, due to the logistic regression or neural networks or SVM will take into account the regularization and impact on outcome variables (predicted label) changes in input attributes and take a feedback, so regularization will not lose important data features.

PCA is not necessary method
PCA is when large volumes of data, so to compress data dimensions, reducing data take up memory and speed up the use of speed training, or when the need to understand the data by using data visualization, rather than a way required. Default added to the PCA machine learning system regardless of the performance of PCA is not added when the system is wrong. Since the PCA will lose part of the data, the data is perhaps the critical dimension, it should first machine learning system does not consider the use of PCA, while conventional training methods only (algorithm running too slow or occupy when necessary too much memory) before considering the use of principal component analysis.

References [Andrew Ng machine learning Notes] 14 1-2 dimensionality reduction applied dimension reduction data compression and data visualization
Andrew Ng machine learning notes 48- dimensionality reduction target
13 machine learning (Andrew Ng): dimensionality reduction

Published 80 original articles · won praise 140 · views 640 000 +

Guess you like

Origin blog.csdn.net/linjpg/article/details/104269881