Why analytics work should often prioritize discoverability and reproducibility, not version control and code review.

[Image from Freepik]

A critical aspect of scaling organizations is process. Process allows you to normalize and incorporate best practices to ensure things work smoothly and scalably even when no one is minding the controls. But process in analytics organizations is something that is frequently overlooked, and too often we default to the same processes that engineering abides by: i.e. using git and related patterns to share and store analytics work.

For some parts of data science / analytics, such a transference of engineering behaviors is suitable: analytics infra, analytics engineering, deployment of machine learning models, libraries — all of these workflows are inherently code-based and benefit from the rigorous test + PR culture familiar in engineering organizations. But for the remainder of analytics work — the kind that occurs daily in SQL IDEs and Jupyter notebooks — the fit is poor. 90% of our job, by the nature of our work as analysts and data scientists, is exploratory. And here, unfortunately, engineering practices not only fall short, but can be detrimental to the org. Why?

Blindly enforcing version control and code review as a gatekeeper for sharing exploratory work leads to unshared exploratory work.

So I’d argue we need a different process. And to understand what the processes need to be, we first need to establish the objective of analytics organizations here. In engineering, maintainability, reliability, and scalability are the objectives that underpin practices like version control, code reviews, code coverage, validation testing. But in analytics work, the underlying objectives are necessarily different: reliability, maintainability, and scalability are still important, but they manifest differently. Let’s ditch the emperor’s clothes and replace these concepts with what we really want: discoverability and reproducibility. In other words, we need to put the “science” back into data science (and analytics).

With these concepts in mind, I’ll discuss the following in this article:

  • Why discoverability and reproducibility are of the utmost importance in analytics and data science organizations.
  • How to orient processes towards those ends.

This is an oversimplified engineering code base, where arrows indicate imports.

Image by author.


Source: towardsdatascience.com