Machine Learning Algorithms: A Deeper Look at UMAP

guide

Dimensionality reduction is a common method for machine learning practitioners to visualize and understand large high-dimensional datasets. One of the most widely used visualization techniques is t-SNE , but its performance is affected by the size of the dataset, and using it correctly can have a learning cost.

UMAP is a new algorithm developed by McInnes et al. It has many advantages over t-SNE, most notably increased computational speed and better preservation of the global structure of the data. In this article , we'll look at UMAPthe theory behind it to better understand how the algorithm works, how to use it correctly and effectively, and t-SNEhow it performs compared to

UMAP projection

So, UMAPwhat to bring? Most importantly, UMAPit is fast and scales well with both dataset size and dimensionality. For example, the 784-dimensional, 70,000-point MNIST dataset UMAPcan be reduced in less than 3 minutes , compared to 45 minutes for . Also, tend to better preserve the global structure of the data. This can be attributed to the strong theoretical foundation, which enables the algorithm to better strike a balance between emphasizing local structure and global structure.scikit-learnt-SNEUMAPUMAP

1. UMAP vs t-SNE

Before diving into UMAPthe theory behind it, let's take a look at how it performs on real-world high-dimensional data. The image below shows the use UMAPand reduction of a subset of the t-SNE784-dimensional Fashion MNIST dataset to 3 dimensions. Note the degree to which each distinct class clusters (local structure), while similar classes (such as sandals, sneakers, and ankle boots) tend to cluster (global structure).

Dimensionality reduction

While both algorithms exhibit strong local clustering, and bring together similar categories together, it UMAPmore clearly separates these groups of similar categories from each other. It is worth noting that the calculation time UMAPtakes 4 minutes, while the multi-core t-SNEtakes 27 minutes.

2. Theory

UMAPThe core of and t-SNEis very similar, both use the graph layout ( graph layout) algorithm to arrange data in a low-dimensional space. In simple terms, UMAPa high-dimensional graph representation of the data is constructed first, and then the low-dimensional graph is optimized to be as structurally similar as possible. Although UMAPthe mathematics used to construct high-dimensional graphs is complex, the idea behind it is very simple.

To construct the initial high-dimensional graph, a thing UMAPcalled fuzzy simplicial complexThis is really just a representation of a weighted graph, where edge weights indicate the likelihood that two points are connected. To determine connectivity, UMAPa radius is extended outward from each point, and points are connected when these radii overlap. The choice of this radius is critical: too small will result in small, isolated clusters, too large and everything will be fully connected. UMAPThis difficulty is overcome by locally choosing the radius based on the distance to each point's nth nearest neighbor. UMAPThe graph is then made by decreasing the probability of connections as the radius grows fuzzy. UMAPFinally, the local structure is ensured to be balanced with the global structure by specifying that each point must be connected to at least its nearest neighbor .

radius

Once the high-dimensional graph is constructed, UMAPthe layout of the low-dimensional simulations is optimized to be as similar as possible. The process is t-SNEbasically the same as , but a few tricks are used to speed up the process.

The key to effective use UMAPlies in understanding the construction of the initial high-dimensional graph. Although the idea behind the process is quite intuitive, the algorithm relies on some advanced mathematics to provide strong theoretical guarantees about how well the graph actually represents the data. Interested readers can go to: In-depth understanding of UMAP theory .

3. Parameters

Once you understand UMAPthe theory behind it, it becomes much easier to understand the parameters of an algorithm, especially when compared to the parameters t-SNEin perplexity. We will consider two of the most commonly used parameters: n_neighborsand min_dist, which are effectively used to control the balance between local and global structure in the final dimensionality reduction result.

parameters

  • n_neighbors

The most important parameter is n_neighborsthe number of approximate nearest neighbors used to construct the initial high-dimensional graph. It effectively controls UMAPhow to balance local versus global structure: smaller values ​​will push more attention to local structure by limiting the number of neighboring points considered when analyzing high-dimensional data UMAP, while larger values ​​will push to UMAPrepresent global structure while losing detail.

  • min_dist

The second parameter we will study is min_distthe minimum distance between points in a low-dimensional space. This parameter controls UMAPhow tightly the points are clustered together, lower values ​​result in tighter embeddings. Larger min_distvalues ​​will UMAPpack points together more loosely, focusing instead on preserving broad topology.

The visualization below explores UMAPthe effect of parameters on the 2D projection of 3D data. By changing the n_neighborsand min_distparameters, you can explore their effect on the resulting projection.

dimensions

While UMAPmost of the applications of , involve the projection of high-dimensional data, the projection of 3D can serve as a useful analogy to understand UMAPhow to prioritize global and local structure according to its parameters. As n_neighbors, UMAPmore and more adjacent points are connected when constructing graph representations of high-dimensional data, resulting in projections that more accurately reflect the global structure of the data. At very low values, any information about global structure is almost completely lost. As min_distthe parameter increases, UMAPit tends to "spread out" the projected points, resulting in less clustering of the data and less emphasis on the global structure.

4. UMAP vs t-SNE 2.0

The biggest difference in the output of , compared to , is the balance between local and global structure, and is generally t-SNEbetter at preserving the global structure in the final projection. This means that the relationship between clusters may be more meaningful than . Importantly, any given axis or distance in a lower dimension is still not directly interpretable by techniques such as , since and σ necessarily distort the higher-dimensional shape of the data when projected to a lower dimension.UMAPUMAPt-SNEUMAPt-SNEPCA

comparison

Going back to the 3D mammoth example, we can easily see some huge differences between the outputs of the two algorithms. For lower perplexityparameter values, t-SNEthere is a tendency to "unroll" the projected data, with little global structure preserved. In contrast, UMAPthere is a tendency to group adjacent parts of high-dimensional structures together in low dimensions, which reflects the global structure. Note that using t-SNEextremely high values perplexity​​(~1000) is required to start seeing global structure, and at such large perplexityvalues ​​the computation time stretches significantly. It is also worth noting that the projections vary widely from run to run t-SNE, with different high-dimensional data being projected to different locations. Although UMAPalso a stochastic algorithm, the resulting projections are surprisingly similar each time it is run and with different parameters.

It is worth noting that t-SNEthe UMAPperformance on the toy example in the earlier figure is very similar, except for the example below. Interestingly, UMAPit is not possible to separate two nested clusters, especially in high dimensions.

toy datasets

The algorithm's failure to handle this case of inclusion may be due to the UMAPuse of local distances in the initial graph construction. Since the distances between high-dimensional points are often very similar (the curse of dimensionality), UMAPit seems to connect the outer points of the inner clusters with the outer points of the outer clusters. This actually blends the two clusters together.

5. understand

While UMAPoffering many t-SNEadvantages over others, it is by no means a panacea, and reading and interpreting its results requires caution.

  1. Hyperparameters really matter

Choosing good hyperparameters is not easy and depends on the data and goals. This is UMAPa big advantage of speed, and by running it multiple times with various hyperparameters UMAP, you can get a better idea of ​​how the projection is affected by its parameters.

  1. cluster size doesn't make sense

As in t-SNE, the size of clusters relative to each other is essentially meaningless. This is due to UMAPthe use of a local distance concept to build its high-dimensional graph representation.

  1. The distance between clusters may not have any meaning

Likewise, the distance between clusters may be meaningless. While it is true that UMAPthe global position of the clusters is better preserved in , the distance between them is meaningless. Again, this is due to the use of local distances when constructing the graph.

  1. Random noise doesn't always look random

Especially at n_neighborslow values, spurious clustering can be observed.

  1. Need to visualize results multiple times

Since UMAPthe algorithm is randomized, different runs with the same hyperparameters may produce different results. Also, since the choice of hyperparameters is so important, it can be useful to run the projection multiple times with various hyperparameters.

Summarize

UMAPis a very powerful tool in a data scientist's arsenal and t-SNEhas many advantages over While somewhat similar UMAPto t-SNEthe output of , the increased speed, better preservation of global structure, and easier-to-understand parameters make it a UMAPmore effective tool for visualizing high-dimensional data. Finally, it's important to remember that no dimensionality reduction technique is perfect, and this UMAPis no exception. However, by building an intuitive understanding of how an algorithm works and how to tune its parameters, we can use this powerful tool more effectively to visualize and understand large, high-dimensional datasets.

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/swindler_ice/article/details/127820903