Google has just released VideoPoet: a multimodal video generation model!

It is massively multimodal and can be used as inputs: text, image, depth and optical flow or mask video, and is one of the first models to generate video + audio!

More information is below ⬇️ ⬇️

By entering a video, it generates believable audio for it without any text prompts!

That’s all, the original author @alexcarliera