UMAP: Powerful Visualization & Anomaly Detection Tool

insert image description here

The most important thing in data dimensionality reduction is to reduce the dimensionality of the data while retaining as much original information as possible. The most well-known of which are PCA and tSNE, but both of them have some problems.

The speed of PCA is relatively fast, but at the cost of losing a lot of underlying structural information after data reduction; tSNE can preserve the underlying structure of the data, but the speed is very slow;
UMAP is a dimensionality reduction and visualization algorithm proposed in 2018, which uses Uniform flow Shape approximation and projection (UMAP), which can not only obtain the speed advantage of PCA, but also retain as much data information as possible, and its visualization effect is also very beautiful, as follows:

UMAP has some big wins in its current incarnation.

Introduction to Umap
Unified Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that can be used for visualizations similar to t-SNE, but also for general nonlinear dimensionality reduction. The algorithm is based on three assumptions about the data:

  • The data is uniformly distributed on the Riemannian manifold;
  • The Riemannian metric is a local constant (or can be approximated);
  • Manifolds are locally connected.

Based on these assumptions, manifolds can be modeled with fuzzy topologies. Embeddings are found by searching for low-dimensional projections of the data with the closest equivalent fuzzy topology.
First, UMAP is fast. It can handle large datasets and high-dimensional data effortlessly, beyond what most t-SNE packages can manage. This includes very high dimensional sparse datasets. UMAP has been successfully used directly with data in excess of one million dimensions.

Second, UMAP scales well in the embedding dimension - it's not just for visualization! You can use UMAP as a general dimensionality reduction technique as a preliminary step for other machine learning tasks. With a little attention, it works well with the hdbscan clustering library (see Clustering with UMAP for more details).

Third, compared to most t-SNE implementations, UMAP generally performs better in preserving some aspects of the global structure of the data. This means it often provides a better "big picture" view of your data and preserves local neighbor relationships.

Fourth, UMAP supports a variety of distance functions, including non-metric distance functions such as cosine distance and correlation distance. You can finally properly embed word vectors using cosine distance!

Fifth, UMAP supports adding new points to existing embeddings via standard sklearn transformation methods. This means that UMAP can be used as a preprocessing transformer in sklearn pipelines.

Sixth, UMAP supports supervised and semi-supervised dimensionality reduction. This means that if you want to use the label information as extra information for dimensionality reduction (even if it's only a partial label), you can do so - as simple as providing it as the y parameter in the fit method.

Seventh, UMAP supports a variety of additional experimental features, including: an "inverse transform" that approximates a high-dimensional sample that will map to a given location in the embedding space; the ability to embed in non-Euclidean spaces, Includes hyperbolic embeddings and embeddings with uncertainty; there is also very rudimentary support for embedding data frames.

Finally, UMAP has a solid theoretical foundation in manifold learning (see our paper on ArXiv). This both justifies the approach and allows for further extensions that will be added to the library soon.

code show as below:

import umap
import umap.plot
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import QuantileTransformer
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)
pipe = make_pipeline(SimpleImputer(), QuantileTransformer())
X_processed = pipe.fit_transform(X)
manifold = umap.UMAP().fit(X_processed, y)
umap.plot.points(manifold, labels=y, theme="fire")

insert image description here

Guess you like

Origin blog.csdn.net/m0_47256162/article/details/122506244
Recommended