- Efficient production of long-format audio: Quickly generate 44.1kHz stereo music and sound for 95 seconds based on text prompts.
- Variable length audio output: Achieve fine control of the content and length of generated audio, and support variable length audio output.
- Stereoscopic audio rendering: Ability to render stereo signals, providing a rich and in-depth audio experience.
- Fast reasoning time: It takes only 8 seconds to generate 95 seconds of stereo audio on the A100 GPU, showing extremely high computing efficiency.
- Structured music generation: Unlike other tools, this tool can create music with clear structure based on your text prompts, such as beginning, middle development and end, making the music sound more interesting.
- Performance is better than AudioLDM2 and MusicGen-check the indicators in the paper.
Problems solved:
It improves the efficiency of long-format audio generation, overcomes the limitation of fixed-size output, and allows the generation of variable-length audio.
Through the latent diffusion model and time conditioning, fine control of the length of generated audio is achieved while maintaining computational efficiency.
Thesis: https://arxiv.org/abs/2402.04825
Code: https://github.com/Stability-AI/stable-audio-tools
Indicators: https://github.com/Stability-AI/stable-audio-metrics
Demonstration: https://stability-ai.github.io/stable-audio-demo/
Video: