It is massively multimodal and can be used as inputs: text, image, depth and optical flow or mask video, and is one of the first models to generate video + audio!
More information is below ⬇️ ⬇️
By entering a video, it generates believable audio for it without any text prompts!
That’s all, the original author @alexcarliera