Some neural networks are too big to use. There is a way to make them smaller but keep their accuracy. Read on to find out how.

Photo by Avery Evans on Unsplash

Practical machine learning is all about tradeoffs. We can get better accuracy from neural networks by making them bigger, but in real life, large neural nets are hard to use. Specifically, the problem arises not in training, but in deployment. Large neural nets can be successfully trained on giant supercomputer clusters, but the problem arises when it comes time to deploy these networks on regular consumer devices. The average person’s computer or phone cannot handle running these large networks. If we want to use these networks in practice, we, therefore, need to decrease their size while still maintaining accuracy. Is this possible?

This is an important question, so it’s not surprising that lots of research has been done. There are two approaches I want to highlight. One approach does the size reduction during training, systematically deleting the least important weights in the network. This is called pruning, and you’re interested you can read more about it here.

The second approach is the one we’ll discuss here, called knowledge distillation. Instead of decreasing the network size during training, we first train the network at full size. Then, we train another, smaller, network using the fully-trained big network as a source of truth. The idea of using networks to train other networks is a type of transfer learning. On a high level, knowledge distillation is a form of transfer learning with two steps. First, train the big network. Then, use the big network to train the final, small network. Now let’s take a look at the details.

Because knowledge distillation has two training steps, we naturally split our data into two training sets. The second training set (used for training the small neural net) is called the transfer set. Because in the second training step we are using the big neural net as the source of truth (as opposed to data labels), we don’t need the transfer set to be labeled. This is a big advantage, and one of the reasons transfer learning methods like distillation are used. Let’s clarify what this means. The typical machine learning paradigm is to use some data x and labels y to learn a function f(x) that approximates y. What we do with the transfer set is replace y with g(x), where g is the trained, big network. Therefore, y is not needed.

Why is this good? In real…

Continue reading:—-7f60cf5620c9—4