Predict missing data using Random Forest and k-NN based Imputation

Image by Willi Heidelbach from Pixabay

A real-world dataset often has a lot of missing records that may be caused due to data corruption or failure to record the values. To train a robust machine learning model handling of missing values is essential during the feature engineering pipeline.

There are various imputation strategies that can be used to impute missing records for categorical, numerical, or time-series features. You can refer to one of my previous articles where I have discussed 7 strategies or techniques to handle missing records in the dataset.

In this article, we will discuss the implementation of an open-source Python library — missingpy that predicts missing values in a numerical feature using Random forest and k-NN based models.

missingpy is an open-source python library that imputes missing data using prediction-based Imputation strategies. It has an API similar to that of scikit-learn, so developers can find the interface familiar. As of now, missingpy only supports Random Forest and k-NN based Imputation strategies.

We will be using few features from the Credit card fraud detection dataset from Kaggle to impute the missing records and compare the performance of the missingpy library.


missingpy can be installed from PyPI using:

pip install missingpy

KNNImputer() and MissForest() are two API from the missingpy package.

We will be using only 8 features and 25,000 instances from the credit card detection dataset for further demonstration. As the dataset does not have missing records, we will create a copy of the ‘Amount’ feature and replace the actual values with NaN values.

After preparing the data, the copy of the ‘Amount’ feature ‘Amount_with_NaN’ has 4,512 null records out of a total sample data of 25,000 records.

(Image by Author), Missing records numbers

MissForest — Random Forest-based Imputation:

missingpy comes with a Random Forest-based imputation model that can be implemented in a single line of Python code using MissForest() function.

from missingpy import MissForestimputer = MissForest()
df_new = imputer.fit_transform(df)

After instantiating the MissForest model, fit the dataset having missing records. The fit_transform() the method returns the dataset along with the imputed values.

Now, let’s compare the Imputed values with the real values of the ‘Amount’ feature and see the deviation of the Imputation.

Continue reading:—-7f60cf5620c9—4