MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

🔗 Project Page | 📄 arXiv | 💻 Code | 🤗 Hugging Face Model

Abstract

Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods entirely rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more faithful spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios.

Motivation

PNG Image

Our motivation is that directly tokenizing 4D motion captures more faithful and expressive information than traditional 2D-rendered pose images derived from the driven video.

Method

Our MTVCrafter comprises a 4D motion tokenizer (4DMoT) and a motion-aware video diffusion transformer (MV-DiT). The 4DMoT encodes raw 4D motion into compact and expressive motion tokens, while the MV-DiT integrates these tokens into a powerful video DiT backbone via 4D motion attention and 4D positional encodings.

Method 1

(1) Our 4D motion tokenizer consists of an encoder-decoder framework to learn spatio-temporal latent representations of SMPL motion sequences, and a vector quantizer to learn discrete tokens in a unified space. All operations are performed in 2D space along frame and joint axes.

Method 2

(2) Based on video DiT architecture, we design a 4D motion attention module to combine motion tokens with vision tokens. Since the patchify disrupted positional information, we introduce 4D RoPE to recover the spatio-temporal relationships. To further improve the quality of generation and generalization, we use learnable unconditional tokens for motion classifier-free guidance.

Animation