This section assumes some knowledge of Python object-oriented programming (OOP). Specifically, the basics of creating classes and inheritance. If you are not already down with those, check out my Python OOP series, written for data scientists.
One of the most common scaling options for skewed data is a logarithmic transform. But here is a caveat: if a feature has even a single 0, the transformation with
np.log or Sklearn’s
PowerTransformer return an error.
So, as a workaround, Kagglers add 1 to all samples and then apply the transformation. If the transformation is performed on the target array, you will also need an inverse transform. For that, after making predictions, you need to use the exponential function and subtract 1. Here is what it looks like in code:
This works, but we have the same old problem — we can’t include this into a pipeline out of the box. Sure, we could use our newfound friend
FunctionTransformer, but it is not well-suited for more complex preprocessing steps such as this.
Instead, we will write a custom transformer class and create the
transform functions manually. In the end, we will again have a Sklearn-compatible estimator that we can pass into a pipeline. Let’s start:
We first create a class that inherits from
TransformerMixin classes of
sklearn.base. Inheriting from these classes allows Sklearn pipelines to recognize our classes as custom estimators.
Then, we will write the
__init__ method, where we initialize an instance of
Next, we write the
fit where we add 1 to all features in the data and fit the PowerTransformer:
fit method should return the transformer itself, which is done by returning
self. Let’s test what we have done so far:
Next, we have the
transform, in which we use the
transform method of PowerTransformer after adding 1 to the passed data:
Let’s make another check:
Working as expected. Now, as I said earlier, we need a method for reverting the transform:
We also could have used
np.exp instead of
inverse_transform. Now, let’s make a final check:
But wait! We didn’t write
fit_transform– where did that come from? It is simple – when you inherit from
TransformerMixin, you get a
fit_transformmethod for free.
After the inverse transform, you can compare it with the original data:
Now, we have a custom transformer ready to be included in a pipeline. Let’s put everything together:
Even though log transform actually hurt the score, we got our custom pipeline…
Continue reading: https://towardsdatascience.com/how-to-write-powerful-code-others-admire-with-custom-sklearn-transformers-34bc9087fdd?source=rss—-7f60cf5620c9—4