There is not a large body of practical work on validating Uniform Manifold Approximation and Projection (UMAP). In this blog post, I will show you a real example, in hopes to provide an additional method for validating the algorithm’s results.
In general, a common practice is to validate UMAP’s convergence based on a downstream task. For example, in the case of classification, you use an objective metric such as F1-Score as proxy performance metric in evaluating dimensionality reduction. However, a high F1-Score does not assure that UMAP accurately captured the data’s structure. High accuracy on the downstream task just tells you that the data is separable at lower-dimension, performing well given it’s inputs.
Simply put, use both a measure to evaluate the underlying data’s structure retention and a downstream task measure. Trustworthiness and Continuity does the former.
This blog post will walk you through how to run Trustworthiness and Continuity as an extra check using the Amazon DenseClus package to confirm that UMAP converged into a stable result.
Before that, What is UMAP?
UMAP is a non-linear dimensionality reduction technique for high dimensional data. “Visually similar to the t-SNE algorithm (also eclipsing it), UMAP takes in the assumption that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant” (see: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction — umap 0.3 documentation”)
In the UMAP paper (UMAP: Uniform Manifold Approximation and Projection McInnes et al 2018), there is analyses that requires a PHD in Topology to fully comprehend.
For now, let’s define it as a neighbor-based dimensionality reduction method that can handle numeric and/or categorical data.
If you desire a deeper level of understanding, check out the UMAP documentation link above or one of the PyData talks by the authors.
Fitting a UMAP
At any rate, let’s grab some data to work with.
You’ll grab a data from the Churn Pipeline repo directly the to run the example.
The original churn dataset is publicly available and mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.
import matplotlib.pyplot as...
Continue reading: https://towardsdatascience.com/on-the-validating-umap-embeddings-2c8907588175?source=rss—-7f60cf5620c9—4