Article Directory
Ten of machine learning algorithms - principal component analysis PCA (principal components analysis)
Dimensionality reduction
Universe, is the sum of time and space. Time is one-dimensional, and the dimensions of space, say, so far inconclusive. 9-dimensional string theory said, Hawking accepted by M-theory is considered to be 10 dimensions. They explained that the dimensions other than the three-dimensional human beings are perceived by curling in a very small spatial scales. Of course, these are not for talking about selling "three-body" series of books, more than guide the reader to explore the true meaning of the universe, or even suspect that the nature of life, but to elicit machine learning classroom theme today - dimensionality reduction.
Dimensionality of the data in machine learning and spatial dimensions of the real world this same Moli. In machine learning, data often need to be represented as a vector to form an input to train the model. But we all know, high-dimensional vector processing and analysis, will greatly consume system resources, and even curse of dimensionality. For example in a CV (Computer Vision) Field of the RGB image feature extracting a 100x100 pixels, dimensions will reach 30,000; based on the NLP (Natural Language Processing) field <document - word> feature matrix, also produced a few hundreds of web Feature vector. Therefore, to reduce the dimensions that characterize the original high-dimensional vector with a low dimension is particularly important. Just think, if the universe really as M-theory said, the position of each celestial body consists of a ten-dimensional coordinates to describe, there should be no normal person can imagine a structure in which the space. But when we put these planets projected onto a two-dimensional plane, the entire universe will be as intuitive as the Milky Way above them.
Common dimensionality reduction methods include principal component analysis (PCA), Linear Discriminant Analysis (LDA), isometry (Isomap), locally linear embedding (LLE), Laplacian feature mapping (LE), locality preserving projection (LPP )Wait. These methods can turn as a linear / non-linear, supervision / unsupervised global / local, differently divided. Where the PCA as the most classic method, has been 100 years of history, it belongs to a linear, unsupervised, global dimensionality reduction algorithm. Today we take a look back at this century classic enduring.
Principal component analysis statistical belongs, through a set of orthogonal transform to convert the variable correlation may exist as a set of linear uncorrelated variables, the set of the converted variables called principal components.
Some practical applications include principal component analysis data compression, simplify data presentation, data visualization. It is worth mentioning that the domain knowledge needed to determine the suitability of using principal component analysis algorithm. If noise data is too large (ie, the variance of the individual components are large), not suitable for the use of principal component analysis algorithm.
PCA algorithm
PCA principle and objective function
PCA (principal components analysis), principal component analysis, the main component is intended to find the data, and using these original data characterizing a main component, so as to achieve the purpose of dimensionality reduction. As a simple example, a series of data points in three dimensional space, these points are distributed in a plane through the origin. If we use the natural coordinates x, y, z data to represent the three axes, the need to use three dimensions, and in fact only in these points on a two-dimensional plane, so that if we rotate the plane coordinate system and the data is located by x, y plane coincident, then we can ', y' raw data expressed by two dimensions x, and without any loss, thus completing the data dimension reduction, and X ', y' information contained in the two axes It is the main ingredient we want to find.
But in high-dimensional space, we tend to not like this just intuitively imagine a form of distributed data, it is more difficult to pinpoint the main shaft component corresponds to what. Wish, we start with the most simple two-dimensional PCA data to look at exactly how this works.
Upper (left) is passed through a set of two-dimensional space of the data center, we can easily see the general direction of where the axis of the main component (hereinafter referred to as a spindle), i.e., the right axis of the green line is located. Since the shaft is located in the green line, data distribution is more dispersed, which means greater variance data in this direction. In the field of signal processing we believe that the signal has a large variance, noise has a smaller variance than the signal to noise ratio is called signal to noise ratio, the better the quality the greater the signal to noise ratio means that data. Thus we can elicit PCA goal, that is to maximize the projected variance, that is, let the data on the spindle projected maximum variance.
PCA solution method
Readers familiar with linear algebra will soon find the original, the variance of x is the projection of the eigenvalues of the covariance matrix. We want to find the largest variance is the largest eigenvalues of the covariance matrix, the best projection direction is the greatest feature of the feature vector corresponding to the value. Orthogonal spatial sub-optimal projection direction of the projection is located in the optimum direction, it is the second largest eigenvector corresponding to the eigenvalue, and so on. So far, we have been solving method of PCA:
PCA least square error theory
Problem Description
In fact, solve the PCA observed that the best projection direction, that is a straight line, which is a mathematical problem in linear regression goals coincide, whether the definition of PCA from the perspective of return objectives and accordingly solve the problem?
analysis
We still consider these two-dimensional space sample points, the maximum angle variance solving is a straight line, so that the sample points are projected to the maximum variance on this line. From ideas for solving linear, it is easy to think of linear regression problem in mathematics, the goal is to solve a linear function such that the corresponding straight line to better fit the sample collection point. If we define the target PCA from this perspective, then the problem will be transformed into a regression problem.
Along the way, in the high-dimensional space, we actually want to find a d-dimensional hyper-plane, so that the data point to the square of the distance and hyper-plane minimum. For the one-dimensional case, a hyperplane is a straight line degradation, i.e. the optimal sample points onto a straight line, it is to minimize the sum of squared distances to all the points of a straight line and, as shown in FIG.
W has nothing to do with our first xkTxk selected, is a constant. We have just obtained by the projection vector representation of the second and third terms, respectively, to continue
where ωiTxk and ωjTxk represents a projected length, are digital. And when i ≠ j, ωiTωj = 0, so the formula of the cross-term only item d.
We want to minimize the equation that is the summation of all k can be written as
If we d W in a base ω1, ω2, ..., ωd in order to solve, and you will find on a fully equivalent method. For example, when d = 1, we actually solve the problem is
consistent with the optimal projection of the best straight line for Solving ω and varimax direction, i.e., the maximum eigenvalue of the covariance matrix of the corresponding eigenvectors, the only difference is the covariance matrix a multiple of Σ, and a constant bias, but this does not affect our optimized to the maximum.