The first time I looked at a data set that contained over a billion rows of data, I thought my COUNT(*) was wrong.

It wasn’t.

That didn’t stop the need for me to have to analyze the data. The business had data (lots apparently), the business needed insight from that data, and the business had no idea how to get it. So, I swallowed the lump in my throat (it was my first real data science job) and did what any junior data scientist would have done at the time, port the data set into the enterprise analytics behemoth…SAS Enterprise Guide (EG; now I’m dating myself).

My anxiety didn’t stop there. Never mind the vertically massive memory on the server used to support our point-n-click SAS EG software. It still wasn’t robust enough to handle the overly complex cluster analytics I was attempting to perform. Memory errors abounded. So I took the next step any junior data scientist would take at this stage in their project, I started Googling.

And this was my first introduction to distributed data science.

Since that time, I have spent a considerable part of my career supporting Big Data analytics leveraging different distributed frameworks. Though my anxiety is eased today when confronted with data sets that keep getting larger…thanks IoT https://www.iotacommunications.com/blog/iot-big-data/ psht…the world of distributed data science continues to rapidly change and evolve to this day.

So today I want to provide my way of thinking about how those evolving frameworks fit in the context of distributed data science workflows. Let’s get to it!

The biggest distinction to make when considering distributed data science frameworks is between exploratory data analysis (EDA) versus data science deployments. Or distributed data in versus distributed data out. Let’s start with EDA.

To be clear, I am taking a few liberties with the acronym “EDA” by lumping in all the things data scientists need to do before finalizing a trained model for deployment on new data. The typical pipeline includes accessing data sources, feature engineering and discovery, and model training.

In 2001, Banko and Brill, two researchers at Microsoft, demonstrated that more data has a bigger effect on model performance than the specific type of model chosen. And this concept has stuck with business leaders even though there are several situations where this does not necessarily hold true.

Consequently, businesses are tasking their data scientists to leverage ever larger data sets when…

Continue reading: https://towardsdatascience.com/on-distributed-data-science-65b5f2a3d37f?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com