The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!

Machine learning is becoming an important tool in many industries and fields of science. But ML research and product development present several challenges that, if not addressed, can steer your project in the wrong direction.

In a paper recently published on the arXiv preprint server, Michael Lones, Associate Professor in the School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, provides a list of dos and don’ts for machine learning research.

The paper, which Lones describes as “lessons that were learnt whilst doing ML research in academia, and whilst supervising students doing ML research,” covers the challenges of different stages of the machine learning research lifecycle. Although aimed at academic researchers, the paper’s guidelines are also useful for developers who are creating machine learning models for real-world applications.

Here are my takeaways from the paper, though I recommend anyone involved in machine learning research and development to read it in full.

Pay extra attention to data

Machine learning models live and thrive on data. Accordingly, across the paper, Lones reiterates the importance of paying extra attention to data across all stages of the machine learning lifecycle. You must be careful of how you gather and prepare your data and how you use it to train and test your machine learning models.

No amount of computation power and advanced technology can help you if your data doesn’t come from a reliable source and hasn’t been gathered in a reliable manner. And you should also use your own due diligence to check the provenance and quality of your data. “Do not assume that, because a data set has been used by a number of papers, it is of good quality,” Lones writes.

Your dataset might have various problems that can lead to your model learning the wrong thing.

For example, if you’re working on a classification problem and your dataset contains too many examples of one class and too few of another, then the trained machine learning model might end up learning to predict every input as belonging to the stronger class. In this case, your dataset suffers from “class imbalance.”

While class imbalance can be spotted quickly with data exploration practices, finding other problems needs extra care and experience. For example, if all the pictures in your…

Continue reading: