[CVPR 2026] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

This is the official repository for the CVPR'26 paper 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation.

Chiao-An Yang*, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
(*Work done during the internship at NVIDIA Research)

📖 Introduction

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Furthermore, existing 3D and 4D Video Question Answering (VQA) benchmarks emphasize static scenes and lack region-level prompting.

To tackle these issues, we introduce:

4D-RGPT: A specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal and spatial perception.
Perceptual 4D Distillation (P4D): A training-only framework that transfers 4D representations (e.g., depth, optical flow) from a frozen expert model into 4D-RGPT for comprehensive 4D perception—without introducing any additional inference cost.
R4D-Bench: A rigorous benchmark for depth-aware dynamic scenes featuring region-level prompting, built via a hybrid automated and human-verified pipeline.

Our experiments demonstrate that 4D-RGPT achieves notable improvements over strong baselines on existing 3D/4D benchmarks (+5.3% on average across 6 benchmarks) as well as our proposed region-based R4D-Bench (+4.3%).

💥 News 💥

[Upcoming] Training and Inference Code release.
[Upcoming] 4D-RGPT Model Weights release.
[Apr. 2026] R4D-Bench Dataset release.
[Feb. 2026] 🔥🔥 4D-RGPT is accepted to CVPR 2026! 🎉🎉
[Dec. 2025] Paper, Project Page, and Hugging Face page released.

(Please watch/star this repository to stay updated on the code and model releases!)

Dataset Preparation

R4D-Bench

Please follow our HF dataset instructions here: https://huggingface.co/datasets/nvidia/R4D-Bench.

Standard 4D/Spatial Benchmarks

Please follow the official instructions of the datasets used in the paper: STI-Bench, VLM4D, OmniSpatial, MMSI-Bench, SAT, and VSTI-Bench.

Citation

If you find our work useful, please consider giving a star and citation:

@article{yang20254d,
  title={4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation},
  author={Yang, Chiao-An and Hachiuma, Ryo and Liu, Sifei and Radhakrishnan, Subhashree and Yeh, Raymond A and Wang, Yu-Chiang Frank and Chen, Min-Hung},
  journal={arXiv preprint arXiv:2512.17012},
  year={2025}
}

Licenses

This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
teaser_4D-RGPT.png		teaser_4D-RGPT.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2026] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

📖 Introduction

💥 News 💥

Dataset Preparation

R4D-Bench

Standard 4D/Spatial Benchmarks

Citation

Licenses

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2026] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

📖 Introduction

💥 News 💥

Dataset Preparation

R4D-Bench

Standard 4D/Spatial Benchmarks

Citation

Licenses

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!