DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

1Bytedance Intelligent Creation, 2Key Lab of Intell. Info. Process., ICT, CAS,
3University of Chinese Academy of Sciences, 4Southeast University
*Equal Contribution Project Lead §Corresponding Authors

We introduce DreamActor-M2, a universal character image animation framework that reformulates motion conditioning as a spatiotemporal in-context learning task. Our design harnesses the inherent generative priors of video foundation models while facilitating a critical evolution toward pose-free, end-to-end motion transfer directly from raw videos. This paradigm eliminates the need for explicit pose estimation, enabling DreamActor-M2 to achieve exceptional generalization and high-fidelity results across diverse and complex scenarios.

Abstract

Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a "see-saw", and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequantely capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AWBench, a versatile benchmark encompassing a wide spectrum of characters categories and motion types. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization.

Method Overview

Interpolate start reference image.

The schematic overview of proposed DreamActor-M2.

Demo Gallary

Character Images: DreamActor-M2 achieves exceptional versatility by animating a broad spectrum of subjects, ranging from realistic humans to stylized cartoons, and even diverse animal species.

Reference Videos: DreamActor-M2 demonstrates robust end-to-end generalization to non-human motion signals, enabling seamless motion transfers without relying on any intermediate pose priors.

Multiple Characters: DreamActor-M2 effectively synchronizes motion across multiple distinct characters or maps complex multi-person dynamics to new groups while maintaining structural integrity.

Human-Object Interactions: Our end-to-end paradigm enables DreamActor-M2 to master intricate human-object interactions, preserving critical spatial and semantic details that are typically lost in pose-estimated representations.

Cross-Shot Animations: By fully harnessing the intrinsic generative priors of foundation models, DreamActor-M2 excels in cross-morphology mapping, adaptively synthesizing missing motions or pruning redundant signals to align with the target shot requirements.

Complex Motions: By extracting motion cues directly from raw RGB videos, DreamActor-M2 accurately captures fine-grained and highly dynamic motion patterns.

Video Comparison

Our method generates results with fine-grained motions, identity preservation, temporal consistency and high fidelity.

BibTeX

@article{luo2025dreamactor,
  title={DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance},
  author={Luo, Yuxuan and Rong, Zhengkun and Wang, Lizhen and Zhang, Longhao and Hu, Tianshu and Zhu, Yongming},
  journal={arXiv preprint arXiv:2504.01724},
  year={2025}
}