Developed by Tencent and the National University of Singapore, M2UGen can understand a variety of music, including style, instruments played, emotions expressed, etc., and conduct music Q&A.
It can also generate a variety of music from text, images, videos, and audio, and can also understand the generated music and edit the music according to the text description.
Key features of M2UGen:
- Music Q&A: M2UGen is able to understand different types of music, including their style, instruments used, moods and emotions expressed, and more. Based on the questions asked, the model is then able to understand and answer music-related queries.
- Text-to-Music Generation: Users can input text, and the model generates corresponding music based on this text.
- Image-to-Music Generation: The model is capable of generating matching music based on the provided image content.
– Video-to-Music Generation: Based on the video content, the model can understand the main content of the video and generate corresponding music. - Music editing: Users can edit the generated music, such as changing instruments, adjusting the tempo, etc., and only need to be described through text.
M2UGen uses a variety of encoders, including MERT for music understanding, ViT for image understanding, and ViViT for video understanding, as well as the MusicGen/AudioLDM2 model as a music generation model (music decoder).
In addition, the model combines an adapter and an LLaMA 2 model.
Working principle:
1. Multimodal input processing: M2UGen is capable of processing various types of inputs, including text, images, video, and audio.
It uses specific encoders to understand different input modalities. For example, the MERT model is used to process the music input, the ViT model to process the image input, and the ViViT model to process the video input.
2. Music Understanding: Utilizing the LLaMA 2 model, M2UGen is able to understand various aspects of music, such as style, instrument usage, and emotional expression. It is capable of answering music-related questions, which involves a deep understanding of the content of the music.
3. Music generation: M2UGen can not only understand music, but also generate music based on different inputs. It explores the use of models like AudioLDM 2 and MusicGen to generate music based on text, images, or video inputs.
4. Dataset generation and training: To train M2UGen, developers used MU-LLaMA and MPT-7B models to generate a large number of multimodal music pairing datasets. These datasets help M2UGen learn how to extract information from different inputs and generate corresponding music.
Projects and Presentations: https://crypto-code.github.io/M2UGen-Demo/
Paper: https://arxiv.org/abs/2311.11255
GitHub:https://github.com/shansongliu/M2UGen