Tencent has also launched a project to make photos sing and speak

Open source first than Ali EMO

AniPortrait: Generating dynamic videos that can talk and sing based on audio and image input

It can automatically generate realistic facial animations based on audio (such as speech sounds) and a static face picture, and keep the mouth shape consistent.

Supports multiple languages, as well as facial redrawing and head pose control.

Main functions:

1. Audio-driven animation synthesis: AniPortrait can use audio file driving to generate realistic portrait animations. This means that users can provide an audio file and a reference portrait picture, and AniPortrait will dynamically generate portrait animations of speech or expression changes based on the rhythm of the voice and sound in the audio.
2. Facial reproduction: In addition to audio-driven animation, AniPortrait also supports facial reproduction. By analyzing facial expressions and actions in a given video, AniPortrait can reproduce the same expressions and actions on another reference portrait. For example, users can provide a video to reproduce the facial expressions and movements of the characters in the video on a new portrait. This technology can be used to create realistic virtual character animations that reproduce the expressions and movements of real people.
3. Head pose control: Users can specify a head pose or select a preset pose configuration to control the head movements in the generated animation, making the animation effect more natural and diverse.
4. Support self-driven and audio-driven video generation: The project not only supports audio-driven animation generation, but also can perform self-driven video generation, that is, it does not require external audio input, but creates animations based on preset or randomly generated actions.
5. High-quality animation generation: niPortrait aims to generate highly realistic portrait animations, striving to approach the appearance and performance of real characters in terms of visual quality and naturalness of movements.
6. Flexible model and weight configuration: The project provides a set of pre-trained models and weight configurations that users can download and configure according to their own needs, including for denoising, reference generation, pose guidance, action modules and audio-to-grid transformation model. Including StableDiffusion V1.5, denoising_unet, reference_unet, pose_guide, motion_module, and audio2mesh.

GitHub：https://github.com/Zejun-Yang/AniPortrait
Thesis:https://arxiv.org/abs/2403.17694

Video: