By Jennifer Prendki, Founder and CEO @ Alectio, Machine Learning Entrepreneur.

The concept of agility is certainly a popular one in technology, but not one that you would naturally associate with data labeling. And it’s fairly easy to understand why: “Agile” typically inspires efficiency. Labeling, however, is hardly discussed in ML circles without triggering a flurry of frustrated sighs.

Figure 1: The Agile Manifesto describes a set of ‘rules’ that software developers believe would make them more productive.

To understand how Agile became so widely adopted, you need to go back to its origins. In 2001, a group of 17 software engineers met at a resort in Utah to brainstorm how to make their industry better. They thought the way projects were managed was inappropriate, inefficient, and overly regulated. So, they came up with the Agile Manifesto, a set of guidelines they thought could improve the throughput (and the level of sanity!) of software engineering teams. The Agile Manifesto was an outcry against a lack of process that was impeding progress. And in many ways, this is exactly what is needed for data labeling.

Figure 2: A deep dive into the Agile Manifesto and its core principles.

Back to Machine Learning. No question about it: the progress we have accomplished in the field over the past decades is simply mind-boggling. So much so in fact that most experts agree that the technology has evolved too fast for our laws and institutions to keep up with. (Not convinced? Just think of the dire consequences that DeepFakes could have on world peace). Still, despite the explosion of new AI products, the success of ML projects boils down to one thing: data. If you don’t have the means to collect, store, validate, clean, or process the data, then your ML model will remain a distant dream forever. Even OpenAI, one of the most prestigious ML companies in the world, decided to shut down one of their departments after coming to terms that they didn’t have the means to acquire the data necessary for their researchers.

And if you think all it takes is to find an open-source dataset to work with, think again: not only are the use cases for which relevant open-source data exists few and far between, most of these datasets are also surprisingly mistaken-ridden, and using them in production would be nothing short of irresponsible.

Naturally, with ever better and more affordable hardware, collecting your own dataset shouldn’t be much of a…

Continue reading: