Chapter 8 - Dimensionality Reduction

Machine learning problems may contain hundreds or thousands of features. Too many features not only make training time-consuming, but also difficult to find a solution. This problem is known as the curse of dimensionality. In order to simplify the problem and speed up training, dimensionality reduction is required.

Dimensionality reduction loses some information (such as compressing the image to jpeg format will reduce the quality), so although it will speed up, it may make the model slightly worse. So the first thing to do is to use the original data for training. If the speed is too slow, then consider dimensionality reduction.

8.1 The Curse of Dimensionality

We live in a three-dimensional space, and even four-dimensional space cannot be intuitively understood, let alone a higher-dimensional space ( wiki introduction to four-dimensional space , and a video on the tubing that expands four-dimensional space into three-dimensional space ). There is still a big difference between high-dimensional space and low-dimensional space. For example, a unit square, only about 0.4% of the part is within 0.001 of the boundary (the area of ​​this part of the edge is about $0.001 \times 1 \times 4 = 0.004$, accounting for 0.4% of the total area). But in a 10,000-dimensional unit hypercube, the probability becomes 99.999999%, and the vast majority of points are very close to a certain dimension. An interesting fact is that humans have many different attributes, and everyone you know can be extremist for a certain trait (like the amount of sugar in coffee).

There is a more troublesome difference: if you pick two points arbitrarily in the unit square, the distance between them is about 0.52 on average. In the unit cube this distance is 0.66. In a 1,000,000-dimensional unit hypercube, this distance increases to 408.25 (approximately $\sqrt{1000000/6}$). This shows that the high-dimensional data set is likely to be quite sparse, the distance between sample instances is large, the distance between the predicted new sample and the training set sample is also large, and the prediction reliability is much lower than that of the status data set. Simply put, high-dimensional datasets are prone to overfitting.

In theory, one solution to the curse of dimensionality is to increase the number of samples so that the training set reaches sufficient density. But this is not feasible in practice because the computational complexity is exponential.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325090779&siteId=291194637