Multimodal combination for high-quality video-to-audio synthesis

A multi-modal joint training technology that achieves high-quality video to audio synthesis.
Video and/or text can be entered, and MMAudio will generate audio that synchronizes with it.

MMAudio is a cutting-edge AI project jointly developed by the University of Illinois at Urbana-Champaign and Sony AI that aims to achieve high-quality video to audio synthesis through multimodal joint training. The project has been published at CVPR 2025 and provides online demonstrations and open source code.

Project overview

The core goal of MMAudio is to automatically generate highly synchronized and semantically consistent audio based on the input video or text content, including background music, environmental sound effects, etc. Its main innovation is the adoption of a multimodal joint training framework that enables models to be trained on large-scale audio-video and audio-text datasets, thereby improving the quality and synchronization of audio generation.

ˇCore functions and technical characteristics

Video-to-audio synthesis: Automatically generate matching audio based on video content to achieve sound and picture synchronization.
Text-to-audio synthesis: Generate corresponding audio based on text descriptions, suitable for scenes that do not require video material.
Multimodal joint training: Models are trained on datasets containing audio, video, and text to improve understanding and generation of different modal data.
Synchronization module: Introduce a synchronization module to ensure that the generated audio is accurately aligned with video frames or text descriptions to achieve a high degree of synchronization.

Application scenarios

Film and television production: In the production of movies, TV series and short films, generate or enhance background sound effects, dialogue and environmental sounds to improve production efficiency and work quality.
Game development: In video games, sound effects that match the game screen are generated in real time to enhance players ‘immersion and interactive experience.
Virtual reality (VR) and augmented reality (AR): In VR and AR applications, audio is generated that is synchronized with the virtual environment to enhance the user’s immersive experience.
Animation production: Generate sound effects and background music for animated movies or videos that match the animated pictures, simplifying the audio production process.
News and documentaries: In news reports or documentaries, generate or enhance narration and commentary for video content to improve the efficiency of information transmission.

🚀Quick experience and resource links

Project homepage:https://hkchengrex.com/MMAudio
GitHub repository:https://github.com/hkchengrex/MMAudio
Online demo: Hugging Face Demo
Colab demo: Google Colab Demo
Replicate demo: Replicate Demo

📚Technical papers

The paper titled “MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis” was first submitted on December 19, 2024 and updated to the second edition on April 7, 2025.

You can access detailed information and PDF downloads of the paper through the following link:

arXiv page:https://arxiv.org/abs/2412.15322

Official website:https://hkchengrex.com/MMAudio/

Oil tubing: