Hi, thank you for sharing this great work and releasing the resources.
I noticed that during training, BEV images and metadata are constructed from ground-truth 3D annotations, while during testing they are generated from reconstructed 3D scenes (which may be noisy).
Does this train-test discrepancy introduce a distribution gap? How did you consider or address this potential mismatch?
Thanks in advance!