guide
Dimensionality reduction is a common method for machine learning practitioners to visualize and understand large high-dimensional datasets. One of the most widely used visualization techniques is t-SNE , but its performance is affected by the size of the dataset, and using it correctly can have a learning cost.
UMAP is a new algorithm developed by McInnes et al. It has many advantages over t-SNE
, most notably increased computational speed and better preservation of the global structure of the data. In this article , we'll look at UMAP
the theory behind it to better understand how the algorithm works, how to use it correctly and effectively, and t-SNE
how it performs compared to
So, UMAP
what to bring? Most importantly, UMAP
it is fast and scales well with both dataset size and dimensionality. For example, the 784-dimensional, 70,000-point MNIST dataset UMAP
can be reduced in less than 3 minutes , compared to 45 minutes for . Also, tend to better preserve the global structure of the data. This can be attributed to the strong theoretical foundation, which enables the algorithm to better strike a balance between emphasizing local structure and global structure.scikit-learn
t-SNE
UMAP
UMAP
1. UMAP vs t-SNE
Before diving into UMAP
the theory behind it, let's take a look at how it performs on real-world high-dimensional data. The image below shows the use UMAP
and reduction of a subset of the t-SNE
784-dimensional Fashion MNIST dataset to 3 dimensions. Note the degree to which each distinct class clusters (local structure), while similar classes (such as sandals, sneakers, and ankle boots) tend to cluster (global structure).
While both algorithms exhibit strong local clustering, and bring together similar categories together, it UMAP
more clearly separates these groups of similar categories from each other. It is worth noting that the calculation time UMAP
takes 4 minutes, while the multi-core t-SNE
takes 27 minutes.
2. Theory
UMAP
The core of and t-SNE
is very similar, both use the graph layout ( graph layout
) algorithm to arrange data in a low-dimensional space. In simple terms, UMAP
a high-dimensional graph representation of the data is constructed first, and then the low-dimensional graph is optimized to be as structurally similar as possible. Although UMAP
the mathematics used to construct high-dimensional graphs is complex, the idea behind it is very simple.
To construct the initial high-dimensional graph, a thing UMAP
called fuzzy simplicial complex
This is really just a representation of a weighted graph, where edge weights indicate the likelihood that two points are connected. To determine connectivity, UMAP
a radius is extended outward from each point, and points are connected when these radii overlap. The choice of this radius is critical: too small will result in small, isolated clusters, too large and everything will be fully connected. UMAP
This difficulty is overcome by locally choosing the radius based on the distance to each point's nth nearest neighbor. UMAP
The graph is then made by decreasing the probability of connections as the radius grows fuzzy
. UMAP
Finally, the local structure is ensured to be balanced with the global structure by specifying that each point must be connected to at least its nearest neighbor .
Once the high-dimensional graph is constructed, UMAP
the layout of the low-dimensional simulations is optimized to be as similar as possible. The process is t-SNE
basically the same as , but a few tricks are used to speed up the process.
The key to effective use UMAP
lies in understanding the construction of the initial high-dimensional graph. Although the idea behind the process is quite intuitive, the algorithm relies on some advanced mathematics to provide strong theoretical guarantees about how well the graph actually represents the data. Interested readers can go to: In-depth understanding of UMAP theory .
3. Parameters
Once you understand UMAP
the theory behind it, it becomes much easier to understand the parameters of an algorithm, especially when compared to the parameters t-SNE
in perplexity
. We will consider two of the most commonly used parameters: n_neighbors
and min_dist
, which are effectively used to control the balance between local and global structure in the final dimensionality reduction result.
n_neighbors
The most important parameter is n_neighbors
the number of approximate nearest neighbors used to construct the initial high-dimensional graph. It effectively controls UMAP
how to balance local versus global structure: smaller values will push more attention to local structure by limiting the number of neighboring points considered when analyzing high-dimensional data UMAP
, while larger values will push to UMAP
represent global structure while losing detail.
min_dist
The second parameter we will study is min_dist
the minimum distance between points in a low-dimensional space. This parameter controls UMAP
how tightly the points are clustered together, lower values result in tighter embeddings. Larger min_dist
values will UMAP
pack points together more loosely, focusing instead on preserving broad topology.
The visualization below explores UMAP
the effect of parameters on the 2D projection of 3D data. By changing the n_neighbors
and min_dist
parameters, you can explore their effect on the resulting projection.
While UMAP
most of the applications of , involve the projection of high-dimensional data, the projection of 3D can serve as a useful analogy to understand UMAP
how to prioritize global and local structure according to its parameters. As n_neighbors
, UMAP
more and more adjacent points are connected when constructing graph representations of high-dimensional data, resulting in projections that more accurately reflect the global structure of the data. At very low values, any information about global structure is almost completely lost. As min_dist
the parameter increases, UMAP
it tends to "spread out" the projected points, resulting in less clustering of the data and less emphasis on the global structure.
4. UMAP vs t-SNE 2.0
The biggest difference in the output of , compared to , is the balance between local and global structure, and is generally t-SNE
better at preserving the global structure in the final projection. This means that the relationship between clusters may be more meaningful than . Importantly, any given axis or distance in a lower dimension is still not directly interpretable by techniques such as , since and σ necessarily distort the higher-dimensional shape of the data when projected to a lower dimension.UMAP
UMAP
t-SNE
UMAP
t-SNE
PCA
Going back to the 3D mammoth example, we can easily see some huge differences between the outputs of the two algorithms. For lower perplexity
parameter values, t-SNE
there is a tendency to "unroll" the projected data, with little global structure preserved. In contrast, UMAP
there is a tendency to group adjacent parts of high-dimensional structures together in low dimensions, which reflects the global structure. Note that using t-SNE
extremely high values perplexity
(~1000) is required to start seeing global structure, and at such large perplexity
values the computation time stretches significantly. It is also worth noting that the projections vary widely from run to run t-SNE
, with different high-dimensional data being projected to different locations. Although UMAP
also a stochastic algorithm, the resulting projections are surprisingly similar each time it is run and with different parameters.
It is worth noting that t-SNE
the UMAP
performance on the toy example in the earlier figure is very similar, except for the example below. Interestingly, UMAP
it is not possible to separate two nested clusters, especially in high dimensions.
The algorithm's failure to handle this case of inclusion may be due to the UMAP
use of local distances in the initial graph construction. Since the distances between high-dimensional points are often very similar (the curse of dimensionality), UMAP
it seems to connect the outer points of the inner clusters with the outer points of the outer clusters. This actually blends the two clusters together.
5. understand
While UMAP
offering many t-SNE
advantages over others, it is by no means a panacea, and reading and interpreting its results requires caution.
- Hyperparameters really matter
Choosing good hyperparameters is not easy and depends on the data and goals. This is UMAP
a big advantage of speed, and by running it multiple times with various hyperparameters UMAP
, you can get a better idea of how the projection is affected by its parameters.
- cluster size doesn't make sense
As in t-SNE
, the size of clusters relative to each other is essentially meaningless. This is due to UMAP
the use of a local distance concept to build its high-dimensional graph representation.
- The distance between clusters may not have any meaning
Likewise, the distance between clusters may be meaningless. While it is true that UMAP
the global position of the clusters is better preserved in , the distance between them is meaningless. Again, this is due to the use of local distances when constructing the graph.
- Random noise doesn't always look random
Especially at n_neighbors
low values, spurious clustering can be observed.
- Need to visualize results multiple times
Since UMAP
the algorithm is randomized, different runs with the same hyperparameters may produce different results. Also, since the choice of hyperparameters is so important, it can be useful to run the projection multiple times with various hyperparameters.
Summarize
UMAP
is a very powerful tool in a data scientist's arsenal and t-SNE
has many advantages over While somewhat similar UMAP
to t-SNE
the output of , the increased speed, better preservation of global structure, and easier-to-understand parameters make it a UMAP
more effective tool for visualizing high-dimensional data. Finally, it's important to remember that no dimensionality reduction technique is perfect, and this UMAP
is no exception. However, by building an intuitive understanding of how an algorithm works and how to tune its parameters, we can use this powerful tool more effectively to visualize and understand large, high-dimensional datasets.
This article is published by mdnice multi-platform