: Towards Interactive Humanoid Video Generation
Overview of FlowAct-R1. It consists of training and inference stages: training integrates converting base full-attention DiT to streaming AR model via autoregressive adaptation, joint audio-motion finetuning for better lip-sync and body motion, multi-stage diffusion distillation; inference adopts a structured memory bank (Reference/Long/Short-term Memory, Denoising Stream) with chunkwise autoregressive generation and memory refinement. Complemented by system-level optimizations, it achieves 25fps real-time 480p video generation (TTFF ~1.5s) with vivid behavioral transitions.
FlowAct-R1 streamingly synthesizes lifelike humanoid videos with naturally expressive behaviors, enabling infinite durations for truly seamless interaction.
FlowAct-R1 exhibits highly responsive interaction capabilities, demonstrating significant potential to empower real-time, low-latency instant communication scenarios.
Our method is robust to various character and motion styles.
FlowAct-R1 outperforms SOTA methods in human preference evaluation by simultaneously achieving real-time streaming, infinite-duration generation, and superior behavioral naturalness.
The orange segments indicate the percentage of user votes favoring FlowAct-R1 over other methods.
@article{flowact-r1,
title={FlowAct-R1: Towards Interactive Humanoid Video Generation},
author={},
journal={},
year={2026}
}