Tssk9 Chapter 10 Dimensionality Reduction and Metric Space

1. Chapter Main Content

The main content of this chapter is dimensionality reduction and metric learning, which is a very important part of the machine learning field. Before entering the specific introduction, those who are not clear about dimensionality reduction and metric learning can actually try to understand literally what a dimensionality reduction and metric learning do, and how they are related to machine learning.

I believe that anyone who has read science fiction should be familiar with the concept of dimensionality reduction. In some science fiction novels, aliens living in higher dimensions can reduce their dimensionality to combat low-latitude life forms. However, because of the high-latitude characteristics of this type of attack, low-latitude creatures are often difficult to resist (although some sci-fi articles say that high-latitude creatures cannot directly contact low-latitude life, there is no argument here. ). Based on the same understanding, in the field of machine learning, can we also gain certain advantages by reducing dimensionality? With such a question, let's enter the content of this chapter.

1) Before we start to introduce dimensionality reduction and metric learning, let’s first understand a simple machine learning algorithm: k nearest neighbor learning

k-Nearest Neighbor (k-Nearest Neighbor, referred to as kNN) learning is a commonly used supervised learning method, and its working mechanism is very simple: given a test sample, find the k samples closest to it in the training set based on a certain distance metric, and then The prediction is made based on the information of these k neighbors. The test sample can usually be predicted by the "voting method" or "average method" (for details, please see Chapter 8 Integrated Learning).

One of the characteristics of the kNN learning algorithm is that it does not need to train the model in advance. The prediction result can be obtained only by calculating the training samples during prediction. This is a typical " lazy learning " algorithm. The following figure is a schematic diagram of a kNN learning algorithm:

 

Obviously, the choice of k is very important, because different choices of k will lead to completely different prediction results. Similarly, different distance measurement methods will also lead to completely different results because of the different neighbors selected.

The reason why we introduce kNN learning in the first section lies in its property, that is, if the condition of having a sample in an arbitrarily small distance is met, the error rate is the worst that does not exceed two of the error rate of the optimal Bayes classifier. Times.

For such a simple learning algorithm, the worst error rate is no more than twice the error rate of the Bayesian optimal classifier. The key point is that this learning algorithm does not need to be trained in advance. How wonderful!

Unfortunately, this property is based on the assumption that the test sample can find a neighbor within an arbitrarily small distance, and it is not so easy to meet this condition. This difficulty is closely related to the "dimension" of the main content of our chapter!

2) The problem of exponential increase in difficulty caused by linear increase in the number of attributes: dimensional disaster

For the assumptions in the previous section, if there is a neighbor node within a distance of 0.001 after normalization, we need 1000 samples to be evenly distributed in the sample space to be feasible. This is the case where the number of attributes is 1. When the number of attributes is 20, we need 1000 to the 20th power, that is, 10 to the 60th power. This is an astronomical number. It is impossible to obtain data of this order of magnitude, let alone the attributes of the sample in practical applications. Maybe thousands.

Problems such as sparse samples and difficulty in distance calculation in high-dimensional situations are serious obstacles faced by all machine learning methods, which are called "curse of dimensionality" (curse of dimensionality)

An important way to alleviate the curse of dimensionality is dimension reduction, also known as "dimension reduction" (another important way will be introduced in Chapter 11: feature selection), that is, to transform the original high-dimensionality through some mathematical transformation The attribute space is transformed into a low-dimensional subspace, in which the sample density is greatly increased, and the distance calculation becomes easier.

Why can dimensionality be reduced? This is because in many cases, what is closely related to the learning task is only a low-dimensional distribution in the attribute space, that is, a low-dimensional embedding in a high-dimensional space. The figure below gives an intuitive example, the original high-dimensional sample points are easier to learn in the low-dimensional space

 

If the distance between the samples in the original space is required to be maintained in the low-dimensional space, a classic dimensionality reduction method called "Multiple Dimensional Scaling" (MDS) is obtained.

The goal of multi-dimensional scaling is that after scaling, the Euclidean distance of the sample in the new low-dimensional space is equal to its distance in the original space

In order to reduce dimensionality effectively in reality, it is often only necessary that the distance after dimensionality reduction be as close as possible to the distance in the original space, but not strictly equal.

The specific process of the MDS algorithm is:

[1] For a data set of m samples mapped to X space, the distance between any sample xi and xj can be easily calculated to form a distance matrix D of size m*m, where dij represents the distance from sample xi to xj

[2] Because it is assumed that the distances between the samples mapped to the low-dimensional space Z are equal, then the inner product of any two samples zi and zj after the mapping can be calculated from the previous distance matrix D, and then the m after the low-dimensional space mapping is calculated *m's inner product matrix B

[3] By eigen-decomposing the inner product matrix B , a set of eigenvalues ​​and eigenvectors can be obtained, and the diagonal matrix A composed of the largest d'eigenvalues ​​and the corresponding eigenvector matrix V

[4] Through the diagonal matrix A and the eigenvector matrix V, we can get the mapping of the original sample on the low-dimensional d'space

Personal thoughts: The MDS algorithm actually performs a lot of distance and inner product calculations before dimensionality reduction. This consumes resources and time when the data samples and attribute dimensions are very large. Therefore, for dimensional disasters, MDS’s The mitigation effect is not great.

Generally speaking, to obtain a low-dimensional subspace, the easiest way is to perform a linear transformation on the original high-dimensional space. Personal thoughts: This linear transformation process is very common in the field of machine learning, whether it is linear regression algorithms, support vector machines or neural networks, linear transformations are used. To compare our dimensionality reduction algorithm here with a neural network, reducing the dimension of the d-dimensional attribute sample to the d'-dimensional attribute space is actually equivalent to a shallow neuron with d input neurons and d'output neurons Network structure, borrow the diagram in Chapter 5 to explain

 

Each new attribute x'after dimensionality reduction is actually a linear combination of high-dimensional attributes x1, x2,..., xn according to the weight W. This method of dimensionality reduction based on linear transformation is called linear dimensionality reduction method. It has different requirements on the properties of low-dimensional subspaces, and imposes different constraints on the weight W. The evaluation of the dimensionality reduction effect is usually to compare the performance of the learner before and after the dimensionality reduction. If the performance is improved, the dimensionality reduction is considered to be effective.

3) A commonly used method of linear dimensionality reduction: Principle Component Analysis (PCA)

Principal component analysis is the most commonly used method of dimensionality reduction. It is designed based on the core idea: For a set of samples, if there is a hyperplane, the distance between the samples is close enough or the projections are as separate as possible , Then this hyperplane is a very appropriate representation of this set of samples.

Then this hyperplane itself can be regarded as the target space for dimensionality reduction. Remember that the hyperplane is composed of n-dimensional orthogonal basis vectors W = {w1,w2,...,wn}, then the principal component analysis reduces The dimension method is to find this set of orthogonal basis vectors.

So how should we find this set of basis vectors? Here is a logic commonly used in machine learning algorithms: since the hyperplane transformed by dimensionality reduction can represent the sample data well, then the point we mapped from the hyperplane back to the original space should be the same as the point before the mapping The distance is similar. We can calculate the best orthogonal basis vector matrix W by minimizing this variation error

This selection logic is also at risk in previous machine learning algorithms, such as in the chapter on neural networks

The Boltzmann machine algorithm, its training process is based on the same logic, and its training process (Contrastive Divergence algorithm) is as follows: calculate the hidden layer distribution through the input layer, and then recalculate the new distribution of the input layer through the hidden layer distribution; and Use the difference between the new distribution and the old distribution to adjust the connection weight.

Fortunately, here we do not need to repeatedly adjust the weights to find the best W. From the formula calculation in the textbook, we can know that we can obtain W by eigen decomposition of the covariance matrix of sample X.

The larger the eigenvalue, it means that the eigenvector corresponds to the direction with the greater variance. Decomposing in this direction will better represent the sample data. So, if we want to reduce the dimensionality to d'dimension, we only need to take out the highest ranked d'feature vectors (w1, w2,..., wd'). Here the value of d'is determined in advance by the user, and we can choose the best value of d'through cross-validation and other methods

The above is the dimensionality reduction method based on linear transformation, so for the nonlinear case, we will introduce the dimensionality reduction of the nonlinear case in the next section

4) A commonly used nonlinear dimensionality reduction method: kernelized linear dimensionality reduction

In real tasks, sometimes the direct use of linear dimensionality reduction will cause some information to be lost. The example in this book is based on such a scenario. After the low-dimensional space is mapped to the high-dimensional space, reducing the dimensionality to the low-dimensional space again will cause The original low-dimensional structure is lost.

 

The essence of kernelized linear dimensionality reduction is to "kernelize" the linear dimensionality reduction method based on kernel techniques. The linear dimensionality reduction of the previous principal component analysis method is an example. The kernelized principal component analysis algorithm is based on principal component analysis. On the basis, the sample projection X of the high-dimensional space is converted into the kernelized k(X) for calculation, and the kernel matrix corresponding to the kernel function is eigen-decomposed to obtain the d'-dimensional eigenvector of the projection

5) A dimensionality reduction method that draws on the concept of topological manifolds: manifold learning

The core concept of popular learning is: Although the distribution of samples in high-dimensional space looks very complicated, it still has the properties of Euclidean space locally, so we can establish a dimensionality reduction mapping relationship locally, and then map the local Generalize to the overall situation to simplify the cost of dimensionality reduction.

Personal thoughts: The essence of popular learning is to keep the properties of the sample distribution in Euclidean space before and after dimensionality reduction unchanged.

There are different manifold learning methods according to the nature of the selected Euclidean space. This chapter introduces two well-known manifold learning methods.

[1] Euclidean distance property: Isometric Mapping (Isomap for short)

Equal metric mapping tries to keep the distance of the sample on the "manifold" after dimensionality reduction

The basic starting point of equal metric mapping is to think that after the low-dimensional manifold is embedded in the high-dimensional space, it is misleading to calculate the straight-line distance directly in the high-dimensional space, because the straight-line distance in the high-dimensional space is unreachable in the low-dimensional space. .

 

As shown in Figure 10.7(a), suppose the two ends of the red line segment are points A and B respectively. For a creature living on a two-dimensional plane, the distance from point A to point B is the length of the red line segment. For creatures living on a three-dimensional plane, the distance from point A to point B is the length of the black line segment.

Obviously, in the popularity of S-shaped, the length of the red line segment is a more appropriate distance representation, and the path is more suitable for the distribution of sample data.

Faced with this situation, the equal-metric mapping algorithm is designed as:

* By setting the nearest neighbor k and calculating the distance between the sample xi and the neighbor, the distance between non-nearest neighbors is infinite

* Calculate the distance between any two samples xi and xj by Dijkstra algorithm

*Using the calculated distance to use the multi-dimensional scaling algorithm to reduce the sample dimension

[2] Vector representation properties of Euclidean space: Locally Linear Embedding (LLE)

Unlike the Isomap algorithm, LLE tries to maintain the linear relationship between samples in the neighborhood, that is, a sample can be reconstructed by linear combination of samples in its neighborhood, and this reconstruction relationship can still be maintained after dimensionality reduction.

It is worth noting that for manifold learning to effectively perform neighborhood preservation, dense sampling of samples is required, which is a major obstacle in high-dimensional situations. Therefore, the dimensionality reduction performance of manifold learning methods in practice is often not as good as expected; But the idea of ​​neighborhood preservation is very meaningful, and it has an important impact on other branches of machine learning, such as the famous popular hypothesis in semi-supervised learning.

6) Alternative to dimensionality reduction, direct learning distance metric: metric learning

In machine learning, the main purpose of dimensionality reduction on high-dimensional data is to find a suitable low-dimensional space where learning can be better than the original space. In fact, each space corresponds to an appropriate distance metric defined on the sample attributes. So why not just try to "learn" an appropriate distance metric? This is the basic motivation of metric learning

To learn the distance metric, we need to add weights to the distance calculation between samples, and can train the weights based on specific samples. The matrix formed by this weight is called the "metric matrix".

The purpose of metric learning is to calculate a suitable "metric matrix". In actual calculations, we can directly embed the metric matrix M into the evaluation system of the nearest neighbor classifier, and obtain M by optimizing the performance index.

2. Basic knowledge

1) Lazy learning

This kind of learning technology only saves the samples in the training phase, the training time overhead is zero, and the test samples are received before processing

2) eager learning

Contrary to lazy learning, this type of learning technology learns and processes samples during the training phase

3) Curse of dimensionality

Problems such as sparse samples and difficult distance calculations in high-dimensional situations are serious obstacles faced by all machine learning methods, which are called "dimensional disasters"

4) Trace of the matrix

In linear algebra, the sum of the elements on the main diagonal of an n×n matrix A (the diagonal from the upper left to the lower right) is called the trace (or number of traces) of the matrix A, generally denoted as tr (A).

5) Eigenvalue decomposition of matrix

A method of decomposing a matrix into the product of its eigenvalues ​​and eigenvectors. It should be noted that eigen decomposition can only be applied to the diagonalizable matrix. The simple understanding is to transform the matrix into the sum of several mutually perpendicular vectors. For example, in a two-dimensional space, any vector can be expressed as a combination of x-axis and y-axis.

3. Summary

1) k-nearest neighbor learning is a simple and commonly used classification algorithm. When the sample distribution is sufficient, its worst error does not exceed twice the Bayesian optimal classifier

2) In actual situations, the attribute dimension is too large, which will lead to "dimensionality disaster", which is a common obstacle in all machine learning

3) The effective way to alleviate the disaster of dimensionality is dimensionality reduction, that is, high-dimensional samples are mapped into low-dimensional space. This not only reduces the attribute dimensionality and reduces the computational cost, but also increases the sample density.

4) The original data information must be lost during the dimensionality reduction process, so we can get different dimensionality reduction methods according to different dimensionality reduction goals

5) The goal of multidimensional scaling is to ensure that the distance between samples remains unchanged after dimensionality reduction

6) The goal of the linear dimensionality reduction method is to ensure that the reduced hyperplane can better represent the original data

7) The goal of the kernel linear dimensionality reduction method is to avoid the loss of the low-dimensional structure after the sampling space is projected to the high-dimensional space and then reduced by the kernel function and the kernel method.

8) The goal of isometric mapping is to keep the distance of the sample on the "manifold" after dimensionality reduction

9) The goal of local linear embedding is to allow the sample to be reconstructed from its neighborhood vector and the relationship can still be maintained after dimensionality reduction

10) Metric learning bypasses the process of dimensionality reduction and transforms the learning objective into the learning of the weight matrix for distance measurement calculation

Interesting little data: the total number of elementary particles in the universe is about 10 to the 80th power, and a grain of dust contains billions of elementary particles, and the number of samples to satisfy the assumption of the k-nearest neighbor algorithm is 20 and the number of samples is 60 out of 10. Power.

 

 

Guess you like

Origin blog.csdn.net/yanyiting666/article/details/99071518