How descriptive statistics alone can mislead you

If you are new to data science and have taken a course to do preliminary data analysis, chances are one of the first steps taught into doing exploratory data analysis (EDA) is to view the summary / descriptive statistics. But what do we really intend to accomplish with this step?

Summary statistics are important because they tell you two things about your data that are important for modeling: location and scale parameters. Location parameters, in statistics, refer to the mean. Knowing this lets you know if your data is normal and whether there is potential skewness to help in modeling decisions.

For example, if the dataset is normal, modeling techniques like Ordinary-Least-Squares may be sufficient and powerful enough for predicting purposes. In addition to that, the way we do our initial data cleaning, such as handling of missing data, may depend on whether the data exhibits normal behavior.

Scale parameter refers to the amount of dispersion we have in the data (e.g. standard deviation and variance). The larger this parameter, the more distributed or spread out our distribution is.

So, looking at descriptive statistics is essential particularly for modeling and research design purposes. Thinking, however, that descriptive statistics are enough for EDA may be one of the most costly assumptions a data professional could commit.

To see that let’s perform an exercise in visualization by a popular dataset known as the datasaurus dataset.

It is a common view to think that visualization takes center stage only when we are reporting or communicating data/insights/results.

The importance, however, of never missing the visualization portion of the EDA was made apparent when looking into the datasaurus dataset created by Alberto Cairo. The dataset can be found in this link.

We can divide the dataset into 12 sub-datasets where we can individually check their descriptive statistics.

To see firsthand, let’s first get the descriptive statistics of each of the 12 datasets.

# Preliminariesimport pandas as pd
import numpy as np
These are the 12 sub-datasets within the datasaurus dataset.
unique = df.dataset.unique()for i in unique:
temp = df[df.dataset==i]
print(f'---Summary Statistics for '{i}' dataset---')
print(np.round(temp.describe().loc[['mean', 'std'],],2))
print(f"nCorrelation: {np.round(temp.corr().iloc[0,1],2)} n")
Image by the…

Continue reading:—-7f60cf5620c9—4