I recently started a new newsletter focus on AI education and already has over 50,000 subscribers. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:


Most deep learning models we build these days are highly optimized for a specific type of dataset. Architectures that are good at processing textual data cant be applied to computer vision or audio analysis. That level of specialization naturally influences the creation of models highly specialized in a given task and that are not able to adapt to other tasks. This constraint highly contrasts with human cognition in which many tasks require diverse inputs such as vision and audio. Recently, DeepMind published two papers unveiling general-purpose architectures that can process different types of input datasets.

The first paper titled “Perceiver: General Perception with Iterative Attention” introduces Perceiver, a transformer architecture that can process data including images, point clouds, audio, video, and their combinations but its limited to simple tasks such as classification. In “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, DeepMind presents Perceiver IO, a more general version of the Perceiver model that can be applied to complex multi-modal tasks such as computer games.

Both Perceiver models are based on transformer architectures. Despite all its success with models like Google BERT or OpenAI GPT-3, most transformer models have been mostly effective in scenarios with inputs of maximum a few thousand elements. Data types such as images, videos or books can contain millios of elements which makes the use of transformers a bit challenging. To address this, Perceiver relies on a general attention layer that does not make any domain-specific assumptions about the input. Specifically, the Perceiver attention model first encodes the input into smaller latent arrays which processing cost is independent of the size of the input. This allow the Perceiver model to scale gracefully with the inputs.

Image Credit: DeepMind

Beyond the scalability benefits, the previous architecture allows the Perceiver model to achieve…

Continue reading: