Google has also created a project where a photo + audio can generate a talking and singing video
VLOGGER: Text and audio-driven generation of talking human videos from a single photo
What makes VLOGGER unique is that:
- There is no need to train everyone.
- Does not rely on facial detection and cropping.
- What is generated is a complete image (not just the face or lips).
- A wide range of scenarios (e.g., visible torsos or diverse identities) are considered, which are crucial for humans to correctly synthesize and communicate.
But looking at the demonstration video, the effect seems to be not as good as Ali’s EMO…
Project address:https://enriccorona.github.io/vlogger/
Thesis:https://arxiv.org/abs/2403.08764
In terms of video translation, VLOGGER can take existing videos in a specific language and edit lip and facial areas to accommodate new audio, such as Spanish.
Video: