🔮 Spaeing the Unseen
Haoyi Jiang1,
Liu Liu2, Xinjie Wang2, Yonghao He3,
Wei Sui3, Zhizhong Su2,
Wenyu Liu1, Xinggang Wang1
1Huazhong University of Science & Technology,
2Horizon Robotics,
3D-Robotics
Please clone this project with --recursive.
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install submodules/vggt
pip install -e submodules/lmms-eval
We utilize a combination of large-scale indoor scene datasets: ScanNet and ScanNet++.
- Video-centric VSI-Bench: We fine-tune our model on the VSI-590K dataset.
- Image-based benchmarks: We use a composite training set aligned with VG-LLM.
Our processed annotations are available here. Please configure the local data and annotation paths in data/__init__.py before starting the training.
To train the Predictive Spatial Field Modeling (PSFM) framework from scratch:
export PYTHONPATH=.
python scripts/train_spa3r.py
Set the pre-trained Spa3R path in the script: geometry_encoder_path=/path/to/spa3r.ckpt
bash scripts/train_vlm_sft.sh
To evaluate Spa3-VLM on spatial reasoning benchmarks:
bash scripts/eval_vlm.sh
If you find our work helpful for your research, please consider starring this repository ⭐ and citing our work:
@article{Spa3R,
title={Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning},
author={Haoyi Jiang and Liu Liu and Xinjie Wang and Yonghao He and Wei Sui and Zhizhong Su and Wenyu Liu and Xinggang Wang},
journal={arXiv preprint arXiv:2602.21186},
year=2026
}This project is released under the MIT License.