Alibaba's EMO: Emotional portraits come alive

Generating emotion-filled portrait videos through audio-to-video diffusion models under simple conditions

abstract

EMO, a framework capable of generating emotion-filled portrait videos based on a single reference picture and sounds such as speaking or singing, was proposed. This method not only captures rich facial expressions and diverse head postures, but also freely adjusts the duration of the video based on the length of the sound.

method

Our framework is divided into two main parts. The first is the “frame encoding” stage, where features are extracted from reference pictures and motion frames through ReferenceNet. Then in the “diffusion process” stage, the pre-trained audio encoder begins processing the sound data. We accurately control the generation of facial images by combining facial area masks with multi-frame noise. In addition, we use Backbone Network for noise removal, and use two attention mechanisms in it: reference attention and audio attention, which are used to maintain the consistency of character identities and adjust the naturalness of character actions respectively. The addition of the time module allows us to flexibly control the speed of movements.

Project address:https://humanaigc.github.io/emote-portrait-alive/

Video: