Using UMAP for Dimensionality Reduction

Dimensionality reduction is one of the most important aspects while we are dealing with the large dataset because it helps in transforming data into a lower dimension so that we could identify some of the important features and their properties. It is used generally to avoid the curse of dimensionality which arises while analyzing large datasets.

Dealing with high dimensional data and can be difficult when we are working on numerical analysis or creating a Machine Learning model. Using a high-dimensional dataset can result in high variance and the model will not be generalized. If we lower down the dimensions we can make the Machine Learning models more generalized and avoid over-fitting.

UMAP is an open-source Python library that helps in visualizing dimensionality reduction.

In this article, we will explore some of the functionalities that UMAP provides.

Let’s get started…

We will start by installing the UMAP library by using pip. The command given below will do that.

!pip install umap-learn

In this step, we will import the required libraries for loading the dataset, and visualizing dimensionality reduction.

import umap
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

For this article, we will use the famous Palmer Penguins dataset that we will fetch from Github.

penguins = pd.read_csv("https://github.com/allisonhorst/palmerpenguins/raw/5b5891f01b52ae26ad8cb9755ec93672f49328a8/data/penguins_size.csv")
penguins.head()

After loading the dataset we will start by Dropping the null values and create a reducer object using UMAP. This reducer will be used for dimensionality reduction and further use for visualization.

penguins = penguins.dropna()
penguins.species_short.value_counts()
reducer = umap.UMAP()
penguin_data = penguins[
[
"culmen_length_mm",
"culmen_depth_mm",
"flipper_length_mm",
"body_mass_g",
]
].values
scaled_penguin_data = StandardScaler().fit_transform(penguin_data)
embedding = reducer.fit_transform(scaled_penguin_data)

In this step, we will plot the Dimensionality Reduction.

plt.scatter(embedding[:, 0], embedding[:, 1], c=[sns.color_palette()[x] for x in penguins.species_short.map({"Adelie":0, "Chinstrap":1, "Gentoo":2})])
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Penguin dataset', fontsize=24)

Here you can clearly visualize the Dimensionality Reduction for the Penguin…

Continue reading: https://towardsdatascience.com/visualizing-dimensionality-reduction-18172a92e5ed?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com