How to find the best-matching statistical distributions for your data points — in an automated and easy way. And, then how to extend the utility further.

Image source: Prepared by the author with Pixabay image (Free to use)

You have some data points. Numeric, preferably.

And you want to find out which statistical distribution they might have come from. Classic statistical inference problem.

There are, of course, rigorous statistical methods to accomplish this goal. But, maybe you are a busy data scientist. Or, a busier software engineer who happens to be given this dataset to quickly write an application endpoint to find the best distribution that matches the data. So that another machine learning app can use some synthetic data generated based on this distribution.

In short, you don’t have a lot of time on hand and want to find a quick method to discover the best-matching distribution that the data could have come from.

Basically, you want to run an automated batch of goodness-of-fit (GOF) tests on a number of distributions and summarize the result in a flash.

You can, of course, write code from scratch to run the data through standard GOF tests using say Scipy library, one by one, for a number of distributions.

Or, you can use this small but useful Python library — distfit to do the heavy lifting for you.

As per their website, distfit is a python package for probability density fitting of univariate distributions. It determines the best fit across 89 theoretical distributions using the Residual Sum of Squares (RSS) and other measures of GOF.

Let’s see how to use it. Here is the demo notebook.

Install as usual,

pip install -U distfit

Generate test data and fit it

Generate some Normally distributed test data and fit them to the distfit object.

Basically, you want to run an automated batch of goodness-of-fit (GOF) tests on a number of distributions and summarize the result in a flash

How good is the fit?

So, was the fit any good?

Note, in the code above the model dist1 has no knowledge of the generative distribution or its parameters i.e. the loc or scale parameters of the Normal distribution or the fact that we called np.random.normal to generate the data.

We can test the goodness of fit and the estimated parameters in one shot by a simple piece of code,

dist1.plot(verbose=1)

Here is the expected plot (note that the plot will definitely look somewhat different in your case because of the random nature of the generated data).

Note the loc and scale parameters as…

Continue reading: https://towardsdatascience.com/find-the-best-matching-distribution-for-your-data-effortlessly-bcc091aa08ab?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com