FlowAct-R1 Logo : Towards Interactive Humanoid Video Generation

Bytedance Intelligent Creation
*Core Contributors Corresponding Author
FlowAct-R1 Overview

We present FlowAct-R1, a novel framework that enables lifelike, responsive, and high-fidelity humanoid video generation for seamless real-time interaction.

  • Streaming & Infinite-Length Generation: By integrating a MMDiT backbone with a chunkwise diffusion forcing strategy, our framework enables continuous, arbitrary-duration video generation while maintaining superior temporal consistency.
  • Real-Time Performance with Low Latency: Through synergistic model-level distillation and system-level optimizations, our method achieves a stable 25 FPS video generation at 480p resolution with a time-to-first-frame (TTFF) of around 1.5s.
  • Vividness & Generalization: The model delivers exceptional behavioral vividness and perceptual realism, capturing subtle human nuances for natural transitions across complex interactive states, while maintaining high-fidelity synthesis across diverse character styles from a single reference image.

Method Overview

Interpolate start reference image.

Overview of FlowAct-R1. It consists of training and inference stages: training integrates converting base full-attention DiT to streaming AR model via autoregressive adaptation, joint audio-motion finetuning for better lip-sync and body motion, multi-stage diffusion distillation; inference adopts a structured memory bank (Reference/Long/Short-term Memory, Denoising Stream) with chunkwise autoregressive generation and memory refinement. Complemented by system-level optimizations, it achieves 25fps real-time 480p video generation (TTFF ~1.5s) with vivid behavioral transitions.

Livestreaming

FlowAct-R1 streamingly synthesizes lifelike humanoid videos with naturally expressive behaviors, enabling infinite durations for truly seamless interaction.

Video Conferencing

FlowAct-R1 exhibits highly responsive interaction capabilities, demonstrating significant potential to empower real-time, low-latency instant communication scenarios.

Generalization

Our method is robust to various character and motion styles.

Comparing to SOTA Methods

Experimental Results

FlowAct-R1 outperforms SOTA methods in human preference evaluation by simultaneously achieving real-time streaming, infinite-duration generation, and superior behavioral naturalness.

Experimental results comparing to SOTA methods.

The orange segments indicate the percentage of user votes favoring FlowAct-R1 over other methods.

Video Comparison

BibTeX

@article{flowact-r1,
  title={FlowAct-R1: Towards Interactive Humanoid Video Generation},
  author={},
  journal={},
  year={2026}
}