Many companies are learning that bringing a model that works in the research lab into production is much easier said than done.
Written by Bob Nugman, ML Engineer at Doordash, and Aparna Dhinakaran, CPO of Arize AI. In this piece, Bob and Aparna discuss the importance of reliability engineering for ML initiatives.
Machine learning is quickly becoming a key ingredient in emerging products and technologies. This has caused the field to rapidly mature as it attempts to transform the process of building ML models from an art to an engineering practice. In other words, many companies are learning that bringing a model that works in the research lab into production is much easier said than done.
One particular challenge that ML practitioners face when deploying models into production environments is ensuring a reliable experience for their users. Just imagine, it’s 3 am and you awake to a frantic phone call. You hop into a meeting and the CTO is on the line, asking questions. The number of purchases has suddenly plummeted in the newly launched market, resulting in a massive loss of revenue every minute. Social media has suddenly filled with an explosion of unsavory user reports. The clock is ticking. Your team is scrambling, but it’s unclear where to even start. Did a model start to fail in production? As the industry attempts to turn machine learning into an engineering practice, we need to start talking about solving this ML reliability problem.
An important part of engineering is ensuring reliability in our products, and those that incorporate machine learning should be no exception. At the end of the day, your users aren’t going to give you a pass because you are using the latest and greatest machine learning models in your product. They are simply going to expect things to work.
To frame our discussion about reliability in ML, let’s first take a look at what the field of software engineering has learned about shipping reliable software.
Virtually any modern technological enterprise needs a robust Reliability Engineering program. The scope and shape of such a program will depend on the nature of the business, and the choices will involve the trade-offs around complexity, velocity, cost, etc.
A particularly important trade-off is between velocity (“moving fast”) and reliability (“not breaking things”). Some domains, such as fraud detection, require both.
Adding ML into the mix makes things even more interesting.
Continue reading: https://towardsdatascience.com/move-fast-without-breaking-things-in-ml-c070bfca2705?source=rss—-7f60cf5620c9—4