What makes a particular function, or a group of functions “learnable”? On the surface, this seems like an easy question. A simple answer would be to say: well, a function is learnable if there is some training algorithm that can be trained on the training set and achieve low error on the test set. After all, that is how the overwhelming majority of machine learning algorithms work.

Is that definition good enough? There’s one main problem with it, specifically with the training set. What training set are we talking about? Imagine a very unlucky training set that consists of one example duplicated many times. Any machine learning algorithm that sees that training set will learn that particular example very well, but it won’t learn anything else. As a consequence, the test set error (or, in more formal terms, the generalization error) will be high. We see that the performance of any algorithm depends on the quality of the sample it is trained on.

If a function can only be trained well on a few specific training sets, we can’t say it is learnable, even if the algorithm achieves great generalization error on those few training sets. Therefore, we need to add to our definition of learnable the caveat that the algorithm must work over many possible training sets. A more formal way of saying this is: a function is learnable if there exists an algorithm that with high probability, when that algorithm trains on a randomly selected training set, we get good generalization error.

The definition we just came up with is an informal way of stating the most popular theoretical definition of learnability — PAC learnability. PAC stands for “probably approximately correct”. “Probably” corresponds to the first part of our informal definition (with high probability, when that algorithm trains on a randomly selected training set), and “approximately correct” corresponds to the second part (we get good generalization error).

Mathematically, the setup of PAC learnability goes like this. We have a function f we’re trying to learn. Samples to train f are chosen from a distribution D. m is the sample size. E is the error function. δ and ε are both real numbers between 0 and 1. Finally, A is an arbitrary learning algorithm. We say f is PAC-learnable if for any δ and ε, there exists a learning algorithm A, and a sample size m that is polynomial in 1/δ and 1/ε, such that P_D(E(A(S)) < ε) > 1-δ. Let’s go through what this means step by step.