1. Overview of UMAP

Unified Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that can be used for visualizations similar to t-SNE, but also for general nonlinear dimensionality reduction. UMAP is a dimensionality reduction algorithm based on manifold learning technology and topological data analysis ideas. It provides a very general framework for dealing with manifold learning and dimensionality reduction, but can also provide concrete concrete implementations.

The algorithm is based on three assumptions about the data:

The data are uniformly distributed on the Riemannian manifold; the
Riemannian metric is locally constant (or can be approximated);
the manifolds are locally connected.

Based on these assumptions, manifolds can be modeled with fuzzy topologies. Embeddings are found by searching for low-dimensional projections of the data with the closest equivalent fuzzy topology.
UMAP offers many advantages over t-SNE, most notably increased speed and better preservation of the global structure of the data.

The core of UMAP is very similar to t-SNE - both use a graph layout algorithm to arrange data in a low-dimensional space. In the simplest sense, UMAP builds a high-dimensional graph representation of the data, and then optimizes the low-dimensional graph to be as similar in structure as possible. While the math UMAP uses to build high-dimensional graphs is advanced, the intuition behind them is quite simple.

To build the initial high-dimensional graph, UMAP builds something called a "fuzzy simplicial complex". This is really just a representation of a weighted graph, with edge weights representing the likelihood that two points are connected. To determine connectivity, UMAP extends a radius outward from each point, connecting points when these radii overlap. Choosing this radius is critical - choosing too small will result in small, isolated clusters, while choosing too large will connect everything together. UMAP overcomes this challenge by locally choosing a radius based on the distance to each point n's nearest neighbor. UMAP then "blurs" the graph by reducing the likelihood of connections as the radius grows. Finally, by specifying that each point must be connected to at least its nearest neighbor, UMAP ensures that the local structure is balanced with the global structure.

2. UMAP installation

UMAP depends on scikit-learn, so scikit-learn depends on numpy and scipy.

pip install umap-learn

(1) Drawing function

UMAP includes a umap.plot subpackage for plotting UMAP embedding results. This package needs to be imported separately because it has extra dependencies (matplotlib, datashader and holoviews). It allows fast and easy drawing. Example of use:

import umap
import umap.plot
from sklearn.datasets import load_digits

digits = load_digits()

mapper = umap.UMAP().fit(digits.data)
umap.plot.points(mapper, labels=digits.target)

If you want to use the drawing function, you can install it with the following command

pip install umap-learn[plot]

(2) Parameterized UMAP

The parameter UMAP provides support for training neural networks to learn UMAP-based data transformations. This can be used to support faster inference on new unseen data, more powerful inverse transforms, autoencoder versions of UMAP, and semi-supervised classification (especially for data that is well separated by UMAP and very limited amount of labeled data).
If you wish to use Parametric UMAP, you need to install Tensorflow

pip install umap-learn[parametric_umap]

3. UMAP use

import umap
from sklearn.datasets import load_digits
digits = load_digits()
embedding = umap.UMAP(n_neighbors=5,
	min_dist=0.3,
	metric='correlation').fit_transform(digits.data)
umap.plot.points(mapper, labels=digits.target)

The main parameters that UMAP can set are as follows:

n_neighbors: This determines the number of neighbors used in the local approximation of the manifold structure. Larger values will cause more global structure to be preserved at the loss of detailed local structure. In general, this parameter should often be in the range of 5 to 50, choosing 10 to 15 as a reasonable default.

min_dist: This controls how tightly the embedding allows the points to be compressed together. Larger values ensure a more even distribution of embedded points, while smaller values allow the algorithm to optimize more accurately for local structures. Reasonable values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.

metric: This determines the choice of metric used to measure distances in the input space. A wide variety of metrics have been coded, and a user-defined function can be passed as long as it has been JITded by numba.

(1) Example 1

UMAP is very effective at embedding large, high-dimensional datasets. In particular it scales well in both the input dimension and the embedding dimension. For best performance, we recommend installing the nearest neighbor computation library pynndescent. UMAP can run without it, but it will run faster if installed, especially on multi-core machines.

For problems such as the 784-dimensional MNIST digit dataset with 70,000 data samples, UMAP can embed in under a minute (compared to about 45 minutes for scikit-learn's t-SNE implementation). Despite this runtime efficiency, UMAP still produces high-quality embeddings.

Mandatory MNIST digit dataset, embedded in 42 seconds using a 3.1 GHz Intel Core i7 processor (n_neighbors=10, min_dist=0.001) (after pynndescent installed and numba jit warm-up):

(2) Example 2

The "Fashion MNIST" dataset (also 70,000 data samples in 784 dimensions). UMAP generated this embedding in 49 seconds (n_neighbors=5, min_dist=0.1)

(3) Example 3

The UCI Shuttle dataset (43500 samples in 8 dimensions) embeds the correlation distance well in 44 seconds (note that the correlation distance calculation takes longer):

Machine Learning Notes - What is UMAP?