In recent years there has been growing interest in reinforcement learning (RL) algorithms that can learn entirely from fixed datasets without interaction (offline RL). A number of relatively unexplored challenges remain in this research field, such as how to get the most out of the collected data, how to work with growing datasets, and how to compose the most effective datasets.

In a new paper, a DeepMind research team proposes a clear conceptual separation of the RL process into data-collection and inference of knowledge to improve RL data efficiency. The team introduces a “Collect and Infer” (C&I) paradigm and provides insights on how to interpret RL algorithms from the C&I perspective; while also showing how it could guide future research into more data-efficient RL.

The key idea informing the C&I paradigm is the separation of RL into two distinct but interconnected processes: collecting data into a transition memory by interacting with the environment, and inferring knowledge about the environment by learning from the data of said memory.

To optimize each process, the team set two objectives: (1) Given a fixed data batch, what is the right learning setup to get to the maximally performing policy? (optimal inference); and (2) Given an inference process, what is the minimal set of data required to get to a maximally performing policy? (optimal collection).

The team describes their algorithm development desiderata as:

  1. Learning is done offline in a ’batch’ setting assuming fixed data as suggested by (1). Data may have been collected by a behaviour policy different from the one that is the learning target. This enables utilization of the same data to optimize for multiple objectives simultaneously, and coincides with interest in offline RL.
  2. Data-collection is a process that should be optimized in its own right. Naive exploration schemes that employ simple random perturbations of a task policy, such as epsilon greedy, are likely to be inadequate. The behaviour that is optimal for data collection in the sense of (2) may be quite different from the optimal behaviour for a task of interest.
  3. Treating data collection as a separate process offers novel ways to integrate known methods like skills, model-based approaches, or innovative exploration schemes into the learning process without biasing the final task solution.
  4. Data collection may happen concurrently with inference (in which case the two processes actively influence each…

Continue reading: