SyncAnimation: A Real-Time End-to-End Framework for Audio-Driven Human Pose and Talking Head Animation

Abstract

Generating talking avatars driven by audio remains a significant challenge. Existing methods typically require high computational costs and often lack sufficient facial detail and realism, making them unsuitable for applications that demand high real-time performance and visual quality. Additionally, while some methods can synchronize lip movement, they still face issues with consistency between facial expressions and upper body movement, particularly during silent periods. In this paper, we introduce SyncAnimation, the first NeRF-based method that achieves audio-driven, stable, and real-time generation of speaking avatars by combining generalized audio-to-pose matching and audio-to-expression synchronization. By integrating AudioPose Syncer and AudioEmotion Syncer, SyncAnimation achieves high-precision poses and expression generation, progressively producing audio-synchronized upper body, head, and lip shapes. Furthermore, the High-Synchronization Human Renderer ensures seamless integration of the head and upper body, and achieves audio-sync lip movements.

Audiodynamic Image

SyncAnimation is the first NeRF-based fully generative approach that utilizes audio-driven generation to create expressions and an adjustable torso (left). SyncAnimation requires only audio and monocular, or even noise, to render highly detailed identity information, along with realistic and dynamic facial and torso changes, while maintaining audio consistency (right).

Overall Pipeline

Audiodynamic Image

SyncAnimation Framework: Given an image and audio, the preprocessing extracts 3DMM parameters for Audio2Pose and Audio2Emotion as references (or noise). It then generates the upper body, head, and lip refinement, ensuring pose consistency and facial expression alignment with the audio.

Comparison with SOTA

Comparison with Gan

We compare with four one-shot methods: Wav2Lip (ACM MM 2020), DINet (AAAI 2023), IP-LAP (CVPR 2023), and EDTalk (ECCV 2024). SyncAnimation achieves dynamic and clear lip shapes, as well as lip-audio consistency.

Comparison with Nerf

We compare with three one-shot methods: ER-NeRF (ICCV 2023), SyncTalk (CVPR 2024), and GeneFace++ (arXiv 2023). SyncAnimation is the only method that successfully achieves avatar poses and expressions with strong audio correlation.

Comparison with Stable Diffusion

We compare with three one-shot methods: Hallo (arXiv 2024), V-Express (arXiv 2024), and EchoMimic (arxiv 2024). SyncAnimation generates realistic avatars with more facial details and richer poses and expressions.

Torso Scaling Expansion

We deviate from the conventional approach of restricting upper-body scale and reject the practice of pasting back to the original frame. Our method, SyncAnimation, directly renders four upper-body avatars driven by four audio segments, ensuring strong audio correlation and natural, unrestricted poses and facial expressions.

Live News

In news reporting or television broadcasts, two journalists or interviewees engage in dialogue, where one remains momentarily "still" while the other speaks. Achieving this requires not only a larger upper body, as individuals are often not represented solely by their heads, but also strong audio correlation with the avatar's poses and expressions to constrain movement and expression during "still" periods We extracted male and female audio from two news dialogue segments and used SyncAnimation to render four avatars with upper bodies and strong audio correlation (shown in the first row). The videos show how the rendered avatars successfully perform the dialogue task, highlighting SyncAnimation's superior audio-driven pose and facial expression capabilities, along with its ability to render the upper body.

-->

-->

-->