In this article, you’ll learn everything that you need to know about SMOTE. SMOTE is a machine learning technique that solves problems that occur when using an imbalanced data set. Imbalanced data sets often occur in practice, and it is crucial to master the tools needed to work with this type of data.

SMOTE stands for Synthetic Minority Oversampling Technique. The method was proposed in a 2002 paper in the Journal of Artificial Intelligence Research. SMOTE is an improved method of dealing with imbalanced data in classification problems.

To get started, let’s review what imbalanced data exactly is and when it occurs.

Imbalanced data is data in which observed frequencies are very different across the different possible values of a categorical variable. Basically, there are many observations of some type and very few of another type.

SMOTE is a solution when you have imbalanced data.

As an example, imagine a data set about sales of a new product for mountain sports. For simplicity, let’s say that the website sells to two types of clients: skiers and climbers.

For each visitor, we also record whether the visitor buys the new mountain product. Imagine that we want to make a classification model that allows us to use customer data to make a prediction of whether the visitor will buy the new product.

Most e-commerce shoppers do not buy: often, many come for looking at products and only a small percentage of visitors actually buy something. Our data set will be imbalanced, because we have a huge number of non-buyers and a very small number of buyers.

The following schema represents our example situation:

In the data example, you see that we have had 30 website visits. 20 of them are skiers and 10 are climbers. The goal is to build a machine learning model that can predict whether a visitor will buy.

This example has only 1 independent variable: whether the visitor is a skier or a climber. As a thought experiment, let’s consider two very simple models:

• a model that uses the variable “skier vs climber”
• a model that does not use the variable “skier vs climber”

I want to avoid going in-depth into different machine learning algorithms here, but let’s just see from a logical analysis whether it is useful to use the independent variable for predicting buyers.

10% of climbers buy, whereas only 5% of skiers buy. Based on this data, we could say that climbers are more likely to buy than…