VividTalk: A single photo + an audio to make a photo speak

All you need to do is provide a still photo of the person and a voice recording, and VividTalk will combine them to create a video that looks like the person actually speaking.

Moreover, facial expressions and head movements are very natural, and the mouth shape can be synchronized, support multiple languages, and different styles, such as real style, cartoon style, etc.

The project was jointly developed by Nanjing University, Alibaba, ByteDance and Nankai University.

VividTalk enables high-quality, realistic audio-driven talking avatar video generation through advanced audio-to-3D mesh mapping technology and grid-to-video conversion technology.

Detailed explanation of how it works:

1. Mapping of audio to grid (first stage):

At this stage, VividTalk first maps the input audio onto a 3D mesh. This involves learning two types of movement: non-rigid expression movements and rigid head movements.

For expression movement, techniques use blendshapes and vertices as intermediate representations to maximize the representation of the model. Blended shapes provide grossing movement of the global landscape, while vertex offsets describe more nuanced lip movements.

For natural head movements, VividTalk proposes a novel learnable head posture codebook that employs a two-stage training mechanism.

2. Grid-to-video conversion (second stage):

In the second phase, VividTalk uses a two-branched motion – VAE (variational autoencoder) and generator to convert the learned mesh into dense motion and synthesize high-quality video frame by frame based on these motions.

This process involves converting the motion of the 3D mesh into a dense 2D motion, which is then fed into a generator to synthesize the final video frame.

3. High visual quality and realism:

The videos generated by VividTalk are of high visual quality, including realistic facial expressions, diverse head poses, and significant improvements in lip synchronization.

With this approach, VividTalk is able to generate realistic talking avatar videos that are highly synchronized with the input audio, enhancing the realism and dynamics of the video.

Projects and Demonstrations: https://humanaigc.github.io/vivid-talk/
Thesis: https://arxiv.org/pdf/2312.01841.pdf
GitHub：https://github.com/HumanAIGC/VividTalk