By Vidhi Chugh, Data Scientist


Andrew Ng is probably the reason most aspiring data scientists find it easy to break into this field. The ease with which he explains the most technical concepts is unparalleled.

I look up to him for reasons more than one, but primarily my technical writing skills derive a major motivation from him i.e. to make it easy for everyone to understand the difficult jargon. He literally makes learning data science an art rather than a tedious gamut of the vast curriculum that often gets overwhelming.

I recently watched the recording where he introduced the difference between model-centric and data-centric AI. This was something I had observed in the ML projects I have delivered, however, could not talk about it in principle with such detail, thanks to the imposter syndrome. But when Andrew explained the importance of working more on data rather than frantically trying the cutting edge and advanced algorithms, it totally resonated with me.

I am writing this article with the intent to keep the notes and summary from his talk, maybe you find it useful too.

The current model-centric state of the data science projects tends to hit the wall beyond a certain point. There is only little you can do with trying multiple sophisticated models when your data is not allowing you to go further.

The whole spectrum of experimenting with different models and checking what works best for the given data and business case does not keep the ball rolling for long. If your best model does not meet the metric that the business wants to be able to give a go-ahead for the project, understand that it is time to go closer to the data and dig deeper as to what part of the data is not qualified enough to make it to the training set. This is the part where you analyze if there are some specific attributes of the test data where your predictions are far from reality.

Data-centric Approach


Source: created by the author using PowerPoint

Let us see what all can we do with data under ‘data-centric approach:

Data label quality: It could be entirely possible that the different labelers gave different labels to a different section of data. If there is inconsistency in terms of how the human experts see a particular problem, there is a slim chance that machines will pick it either.

Data Augmentation: Generate the data that your model has not seen during the training…

Continue reading: