Being a subdomain of Machine Learning, Reinforcement Learning (RL) is often likened to a black box. You try a couple of actions, feed the resulting observations into a neural network, and out roll some values — an esoteric policy telling you what to do in any given circumstance.

When traversing a frozen lake or playing a videogame, you will see soon enough whether that policy is of any use. However, there are many problems out there without a clear notion of solution quality, without lower and upper bounds, without visual aids. Think of controlling a large fleet of trucks, of rebalancing a stock portfolio over time, of determining order policies for a supermarket. Determining whether your RL algorithm is any good might become surprisingly hard then.

For such problems, having some quick-and-dirty baseline policies at hand is essential during algorithmic development. The three policies outlined in this article are very easy to implement, serve as a sanity check, and immediately tell you when something is off.

Most RL algorithms have some exploration parameter, e.g., an ϵ that translates to taking random actions 5% of the time. Set it to a 100% and you are exploring all the time; easy to implement indeed.

It is obvious that a blindfolded monkey throwing darts is not a brilliant policy, and that is precisely why your RL algorithm should always — consistently and substantially — outperform it.

There is a bit more to it though, especially if you are not sure to what extent your environment is predictable. If you fail to comprehensively outperform the random baseline, that could be an indication that there are simply no predictable patterns to be learned. After all, even the most sophisticated neural network cannot learn anything from pure noise. (Unfortunately, it could also be that your algorithm just sucks)

The great appeal of RL is that it allows solving complicated sequential decision problems. Determine the best action right now might be a straightforward problem, but anticipating how that action keeps affecting our rewards and environment long afterwards is another matter.

Naturally, if we put all that effort into modeling and learning from the future, we want to see superior results. If we could make decisions of similar quality without considering their downstream impact, why bother doing more? For most problems, the myopic policy simply maximizes (minimizes) the direct rewards (costs) and is…

Continue reading:—-7f60cf5620c9—4