Media2Face: Text To Speech 3D Face - Tarogo General Blogs

Media2Face is able to generate speech-synchronized, expressive 3D facial animations based on sound.

It also allows users to make more detailed personalized adjustments to the generated facial animations, such as emotional adjustments, “happy” or “sad”.

It can also understand various types of input information (audio, text, images) and use this information as a guide for generating facial animations.

Practical application:

Create dialogue scenes: For example, based on the script you wrote, the computer can generate animated scenes of characters talking to each other.
Make stylized facial animations: You can give a computer an emoji, and it can create an animation based on that symbol.
Emotional singing: Computers can also sing songs in different languages, expressing corresponding emotions.
Personalized animation: The most amazing thing is that this project is able to create personalized facial animations that match different races, ages, and genders.

Working principle:

The working principle of the Media2Face project involves several key technologies and steps that enable it to generate 3D facial animations with rich expressions and emotions from Text To Speech. Here is the main workflow of the project:

General Neural Parameterized Facial Assets (GNPFA):

Face mapping: First, the research team created a special tool (called GNPFA) that acts like a large database of facial expressions. This tool will help you find whatever expression you want, and also ensure that everyone’s facial animation is unique and not confused with others.

This process realizes the decoupling of expression and identity, that is, the ability to convert the same expression between different identities.

High-quality expression and head pose extraction:

Then, they processed a lot of videos with this tool to extract high-quality expressions and head movements. This created a huge dataset with a wide variety of facial animations and corresponding emotion and style labels.

Multimodal guided animation generation:

Diffusion Model Application: Media2Face uses a diffusion model to animate the underlying space of GNPFA, which can accept multimodal guidance from audio, text, and images.

Conditional Fusion: The model treats audio features and CLIP latent codes as conditions, together with the noisy version of the expression latent code sequence and the head movement code (i.e. head posture) for denoising.

Cross-attention mechanism: Conditions are randomly masked and processed by cross-attention with noisy head movement codes.

Animations with high fidelity and stylistic diversity:

Expression and head pose generation: During inference, the head motion code is sampled by DDIM, and then the expression latent code is input into the GNPFA decoder to extract the expression geometry, and the facial animation enhanced by the head pose parameters is generated by combining the model template.

Fine-tuning and personalization:

Expression and style fine-tuning: Key frame expression latent code is extracted through the expression encoder, and style cues such as “happy” or “sad” are provided for each frame through CLIP, allowing users to adjust the intensity and control range of the animation.

Through these technical steps, Media2Face is able to generate speech-synchronized, expressive 3D facial animations that support complex emotional expression and style changes, providing powerful tools for creating virtual characters and enhancing the interactive experience of digital characters.

Projects and Demonstrations:https://sites.google.com/view/media2face
Paper:https://arxiv.org/abs/2401.15687
GitHub: coming soon..

Video: