DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Bytedance Intelligent Creation
*Equal Contribution Corresponding Author

We present DreamActor-M1, a DiT-based human animation framework, with hybrid guidance to achieve fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence. Given a reference image, DreamActor-M1 can imitate human behaviors captured from videos, producing highly expressive and realistic human videos cross multiple scales, ranging from portrait to full-body animations. The resultant video is temporally consistent, identity-preserved, and of high fidelity.

Abstract

While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency.

Method Overview

Interpolate start reference image.

Overview of DreamActor-M1. During the training stage, we first extract body skeletons and head spheres from driving frames and then encode them to the pose latent using the pose encoder. The resultant pose latent is combined with the noised video latent along the channel dimension. The video latent is obtained by encoding a clip from the input full video using 3D VAE. Facial expression is additionally encoded by the face motion encoder, to generate implicit facial representations. Note that the reference image can be one or multiple frames sampled from the input video to provide additional appearance details during training and the reference token branch shares weights of our DiT model with the noise token branch. Finally, the denoised video latent is supervised by the encoded video latent. Within each DiT block, the face motion token is integrated into the noise token branch via cross-attention (Face Attn), while appearance information of ref token is injected to noise token through concatenated self-attention (Self Attn) and subsequent cross-attention (Ref Attn).

Diversity

Our method is robust to various character and motion styles.

Controllability and Robustness

Comparing to SOTA Methods

Our method generates results with fine-grained motions, identity preservation, temporal consistency and high fidelity.

Pose Transfer

Portrait Animation

BibTeX

@misc{luo2025dreamactorm1holisticexpressiverobust,
      title={DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance}, 
      author={Yuxuan Luo and Zhengkun Rong and Lizhen Wang and Longhao Zhang and Tianshu Hu and Yongming Zhu},
      year={2025},
      eprint={2504.01724},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.01724}, 
}