Essential guide to Pandas get_dummies() and Sklearn One-hot Encoder

Image by Mediamodifier from Pixabay

Machine learning algorithms require the input data source in a specific format of numerical vectors. Feature engineering is an important component of a data science model development life cycle, which refers to converting the raw data to numerical format fit for training a robust model.

A data scientist spends about 80% of the time on data preparation and feature engineering. The performance of the model depends on the feature engineering strategies. The raw dataset contains various data types of features including numerical, categorical, date time, etc. There are various feature-engineering techniques that convert different data types of data features to numerical vectors.

Dummy Encoding refers to an encoding strategy to convert a categorical feature to a numerical vector format. There are various other techniques to encode a categorical feature including Count Encoder, One Hot Encoder, Tf-Idf Encoder, etc.

pd.get_dummies() is a function from Pandas that performs dummy encoding in a single line of code. Data scientists mostly use this for feature encoding, but it’s not recommended to use it in production or Kaggle competitions. In this article, we will discuss the reason behind it and what the best choice is for the get_dummies() function.

The get_dummies() function from the Pandas library can be used to convert a categorical variable into dummy/indicator variables. It is in a way a static technique for encoding in its behavior.

We will take a random dataset with 2 numerical and 1 categorical feature (‘color’) for further demostration. The ‘color’ categorical variable has 3 unique categories: green, red, blue.

(Image by Author), get_dummies() usage results for training data

You can observe the encoded results for pd.get_dummies() . The usage of this function is not recommended in production or on Kaggle as it is in a way static in nature in its behavior. It cannot learn the characteristics from the training data and hence is unable to propagate its findings onto the test dataset.

The categorical feature color has 3 feature values: green, red, blue. In the training sample that causes encoding the color feature into 3 feature categories. But the test data may or may not have all the feature values, which may cause data mismatch issues while modeling.

(Image by Author), get_dummies() usage results for test data

The blue feature value is missing in the…

Continue reading:—-7f60cf5620c9—4