Efficient production of long-format audio: Quickly generate 44.1kHz stereo music and sound for 95 seconds based on text prompts.
Variable length audio output: Achieve fine control of the content and length of generated audio, and support variable length audio output.
Stereoscopic audio rendering: Ability to render stereo signals, providing a rich and in-depth audio experience.
Fast reasoning time: It takes only 8 seconds to generate 95 seconds of stereo audio on the A100 GPU, showing extremely high computing efficiency.
Structured music generation: Unlike other tools, this tool can create music with clear structure based on your text prompts, such as beginning, middle development and end, making the music sound more interesting.
Performance is better than AudioLDM2 and MusicGen-check the indicators in the paper.

Problems solved:

It improves the efficiency of long-format audio generation, overcomes the limitation of fixed-size output, and allows the generation of variable-length audio.
Through the latent diffusion model and time conditioning, fine control of the length of generated audio is achieved while maintaining computational efficiency.

Thesis: https://arxiv.org/abs/2402.04825
Code: https://github.com/Stability-AI/stable-audio-tools
Indicators: https://github.com/Stability-AI/stable-audio-metrics
Demonstration: https://stability-ai.github.io/stable-audio-demo/

Video: