Count data is everywhere. Count data sounds so easy to deal with: they are just infinite integers, nothing special. If you think so, then you probably would handle them *wrong*. How so? This blog aims to provide you some tips for working with count data in Machine Learning (ML), to help you prevent some common mistakes that you may never have noticed before.
Let’s start with a simple question.
Suppose that we are developing ML models about movie watching, and there is a field called “count of cartoon movies that the user watched in the last 6 months”. Since everyone has a different taste of movies, we see values like 0, 1, 2, 3…101 (yes, the user who watched 101 movies must be a huge fan).
Now here’s the question for you:
What statistical distribution does this count data may follow?
If your answer is “normal distribution” or “I don’t know”, then congrats! I am sure this blog will help you.
If you don’t have much statistics knowledge, that’s fine. This blog aims to provide hands-on ML techniques, though I am also providing some details in statistics for readers who are curious to know.
First, let’s look at the plot and the data summary of this count data (a toy data created for demonstration). As we know, the fastest to dive into a brand new dataset is to make some plots. So here we go:
import numpy as np
import pandas as pd
import plotly.express as px
count = np.concatenate((np.zeros(201),np.repeat(1, 50),np.repeat(2, 40)))
df = pd.DataFrame(np.concatenate((count, a, b, )),columns=['val'])
fig = px.histogram(df, x='val',nbins=200)
Now I am sure you would no longer assume it’s a normal distribution. And you probably notice that the data is…
…highly skewed with a long tail on the right, with a lot of 0s.
Actually, if we look at the median, half of them are 0s which indicates half of the users did not watch any cartoon movies in the last 6 months. You may start to agree with me (if you didn’t before) that this kind of data may need to be treated properly in your ML model. Yes, this type of data has a specific category in statistics: zero-inflated, indicating the distribution of the count data is sparse with many 0s.
Next, in section 2 we will discuss the common mistakes that newbies have when dealing with skewed count…
Continue reading: https://towardsdatascience.com/avoid-mistakes-in-machine-learning-models-with-skewed-count-data-e3512b94d745?source=rss—-7f60cf5620c9—4