Since I was in high school, I’ve had this weird obsession of squeeze the key concepts of everything that I learn in one page. Looking back, that was probably my lazy mind’s way to get away with the least amount of required work to pass an exam…but interestingly that abstraction effort also helped a lot to learn those concepts in a deeper level and to remember them longer. Nowadays when I teach Machine Learning, I try to teach it in two parallel tracks: a) main concepts and b) methods and theoretical details, and make sure my students can look at each new method through the lens of the same concepts. Recently I got a chance to read “Machine Learning Yearning” by Andrew Ng, which seemed to be his version of abstracting some of the practical ML concepts without getting into any formula or implementation details. While they can see so simple and obvious, as an ML engineer I can attest that losing sight of those simple tips are among the most common causes for an ML research to fail in production, and being mindful of them is what distinguishes a good data science work from a mediocre one. Here I wanted to summarize my takeaways from Andrew’s book on one page, and into 5 important tips, so without further ado lets get to it:
1- How you split your available data matters…a lot!
Even if you are not a data scientist, you probably already know that to measure your generalization power of your algorithm, you should split your available data into train, dev (validation) and test sets and you are not supposed to use the test set for any model optimization/tuning. As obvious as it sounds, I have seen a lot of practitioners who used their test set for manually optimizing the hyper-parameters and then report the best results on the same set as the generalization performance…you guessed it right…that’s cheating! and they still get surprised when the model performance drops after deployment. You should make sure not to use the test set for any kind of optimization, including manual hyper-parameter optimization, that’s what your dev/validation set is for.
Also be aware of any data leakage from your train set to dev or test set. A silly cause of data leakage can be existence of duplicate instances in your data, which can show up both in train and test sets. If you are doing joins to compile your final dataset, be particularly wary of this.
In the classical ML and when datasets were small, we…
Continue reading: https://towardsdatascience.com/5-simple-tips-to-supercharge-your-machine-learning-practice-fb40e850e491?source=rss—-7f60cf5620c9—4