This is the official repository for the CVPR'26 paper 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation.
Chiao-An Yang*, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
(*Work done during the internship at NVIDIA Research)
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Furthermore, existing 3D and 4D Video Question Answering (VQA) benchmarks emphasize static scenes and lack region-level prompting.
To tackle these issues, we introduce:
- 4D-RGPT: A specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal and spatial perception.
- Perceptual 4D Distillation (P4D): A training-only framework that transfers 4D representations (e.g., depth, optical flow) from a frozen expert model into 4D-RGPT for comprehensive 4D perception—without introducing any additional inference cost.
- R4D-Bench: A rigorous benchmark for depth-aware dynamic scenes featuring region-level prompting, built via a hybrid automated and human-verified pipeline.
Our experiments demonstrate that 4D-RGPT achieves notable improvements over strong baselines on existing 3D/4D benchmarks (+5.3% on average across 6 benchmarks) as well as our proposed region-based R4D-Bench (+4.3%).
- [Upcoming] Training and Inference Code release.
- [Upcoming] 4D-RGPT Model Weights release.
- [Apr. 2026] R4D-Bench Dataset release.
- [Feb. 2026] 🔥🔥 4D-RGPT is accepted to CVPR 2026! 🎉🎉
- [Dec. 2025] Paper, Project Page, and Hugging Face page released.
(Please watch/star this repository to stay updated on the code and model releases!)
Please follow our HF dataset instructions here: https://huggingface.co/datasets/nvidia/R4D-Bench.
Please follow the official instructions of the datasets used in the paper: STI-Bench, VLM4D, OmniSpatial, MMSI-Bench, SAT, and VSTI-Bench.
If you find our work useful, please consider giving a star and citation:
@article{yang20254d,
title={4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation},
author={Yang, Chiao-An and Hachiuma, Ryo and Liu, Sifei and Radhakrishnan, Subhashree and Yeh, Raymond A and Wang, Yu-Chiang Frank and Chen, Min-Hung},
journal={arXiv preprint arXiv:2512.17012},
year={2025}
}Copyright © 2026, NVIDIA Corporation. All rights reserved.
This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.
