A music generation model launched by OpenAI: Jukebox

(Please go to SoundCloud to watch the video)

OpenAI launched their music generation model in August 2019: Jukebox

Jukebox is able to generate complete music and vocal songs in multiple genres and artist styles based on the lyrics, artist and genre information provided.

The most amazing thing is that the quality was already like this three years ago…

And it is said that Jukebox 2 will be released soon…

Large-scale music dataset training

Training was based on a large data set of 1.2 million songs, equipped with corresponding lyrics and metadata.
Using these rich data resources, Jukebox is able to learn and imitate complex musical structures and styles.

Main functional characteristics:

1. Generation of diverse musical styles: Jukebox can generate music in a variety of musical styles and artist styles, including the ability to simulate primary singing. This means that Jukebox can not only create music played by instruments, but also generate songs that include human voices.

2. Raw audio output: Unlike other models that only generate music symbol data, Jukebox generates raw audio data, including melodies, harmonies, and singing voices. The high quality of the music is maintained, making the generated music sound more natural and closer to a real performance.

3. Generating music based on lyrics: Jukebox can generate new music samples based on the lyrics, artists, and musical styles provided, which means it can create new music samples from scratch given creative guidance, even lyrics that have not been seen during training.

4. Synchronization of lyrics and melody: Jukebox can not only generate music, but also generate lyrics that are synchronized with music, realizing the collaborative creation of music and lyrics.

5. Style and artist imitation: It can generate music based on specified artists and musical styles, allowing users to guide the generation process to produce music that matches a specific style or theme.

Details of technical principles:

1. VQ-VAE: Jukebox uses a technology called VQ-VAE (Vector Quantized Variable AutoEncoder) to compress audio data into a lower-dimensional representation while retaining important characteristics of the music such as tone, timbre and volume.

2. Transformer model: Based on VQ-VAE, Jukebox uses the Transformer model to generate new music code. These codes are then decoded back to the original audio to generate new music fragments. The Transformer model can handle long-term dependence problems and is suitable for data such as music that requires long-term memory.

3. Hierarchical structure: Jukebox adopts a three-layer VQ-VAE structure, each layer corresponding to a different compression rate and audio detail level, allowing the model to learn the structure of music at different levels.

4. Condition generation: The Jukebox model can generate music based on information conditions such as artist, style and lyrics. This is achieved by using this information as additional input during the training process, allowing the generated music to reflect specified characteristics.

5. Automatic lyrics alignment: Faced with the challenge of lack of accurate alignment of lyrics data, Jukebox adopted a heuristic method to estimate the correspondence between lyrics and audio, and used advanced lyrics alignment technology to improve accuracy.

Project address:https://openai.com/research/jukebox
Thesis:https://cdn.openai.com/papers/jukebox.pdf
GitHub：https://github.com/openai/jukebox

Video: