Google has also created a project where a photo + audio can generate a talking and singing video

VLOGGER: Text and audio-driven generation of talking human videos from a single photo

What makes VLOGGER unique is that:

There is no need to train everyone.
Does not rely on facial detection and cropping.
What is generated is a complete image (not just the face or lips).
A wide range of scenarios (e.g., visible torsos or diverse identities) are considered, which are crucial for humans to correctly synthesize and communicate.

But looking at the demonstration video, the effect seems to be not as good as Ali’s EMO…

Project address:https://enriccorona.github.io/vlogger/
Thesis:https://arxiv.org/abs/2403.08764

In terms of video translation, VLOGGER can take existing videos in a specific language and edit lip and facial areas to accommodate new audio, such as Spanish.

Video: