Introducing a new VAE-based architecture to generate novel musical samples

Photo by Rezli on Unsplash

Deep learning has radically transformed the fields of computer vision and natural language processing, in not just classification but also generative tasks, enabling the creation of unbelievably realistic pictures as well as artificially generated news articles. But what about the field of audio — or more specifically — music? In this project, we aim to create novel neural network architectures to generate new music, using 20,000 MIDI samples of different genres from the Lakh Piano Dataset, a popular benchmark dataset for recent music generation tasks.

This project was a group effort by Isaac Tham and Matthew Kim, senior-year undergraduates at the University of Pennsylvania.


Music generation using deep learning techniques has been a topic of interest for the past two decades. Music proves to be a different challenge compared to images, among three main dimensions: Firstly, music is temporal, with a hierarchical structure with dependencies across time. Secondly, music consists of multiple instruments that are interdependent and unfold across time. Thirdly, music is grouped into chords, arpeggios and melodies — hence each time-step may have multiple outputs.

However, audio data has several properties that make them familiar in some ways to what is conventionally studied in deep learning (computer vision and natural language processing, or NLP). The sequential nature of the music reminds us of NLP, which we can use Recurrent Neural Networks for. There are also multiple ‘channels’ of audio (in terms of tones, and instruments), that are reminiscent of images that Convolutional Neural Networks can be used for. Additionally, deep generative models are exciting new areas of research, with the potential to create realistic synthetic data. Some examples are Variational Autoencoders (VAEs) and Generative Adversarial Neworks (GANs), as well as language models in NLP.

Most early music generation techniques have used Recurrent Neural Networks (RNNs), which naturally incorporates dependencies across time. Skuli (2017) used LSTMs to generate single-instrument music in the same fashion as language models. This same method was used by Nelson (2020), who adapted this to generate lo-fi music.

Recently, Convolutional Neural Networks (CNNs) have been used to generate music with great success, with DeepMind in 2016 showing the effectiveness of WaveNet, which…

Continue reading:—-7f60cf5620c9—4