Overview of Contrastive Tensor Pre-training (CTP). We propose this framework that simultaneously aligns multiple modalities in a similarity tensor.
For the “car” class, we project features from three modalities using 500 samples onto a 2D plane.
To run a quick inference demo, clone the repository, set up the environment, and execute the notebook located at demo/inference.ipynb.
The provided script can be used to set up the environment:
bash ./scripts/setup_env.shWe can create a conda environment named ctp:
conda create -n ctp python=3.10Then activate the environment and install required libraries:
conda activate ctpInstall PyTorch based on your GPU:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126Install other libraries:
pip install transformers==5.1.0 nuscenes-devkit==1.2.0 pandas==2.3.3 open3d==0.19.0 wandb==0.25.1 tensorboard==2.20.0 git+https://github.com/openai/CLIP.git matplotlib==3.9.4 huggingface_hub==1.8.0 umap-learn==0.5.11 accelerate==1.13.0 beautifulsoup4==4.14.3 typeguard==4.4.4 pyyaml==6.0.3 tqdm==4.67.1 idna==3.11 ipykernel==7.2.0 ipywidgets==8.1.8 pickleshare==0.7.5 jmespath==1.1.0 pyrootutils==1.0.4Triplet data preparation can be divided into two steps. First, we extract annotations, cropped images, and the corresponding point clouds. Then, the images and annotations are fed into a VLM (Qwen3-VL-8B-Instruct) to generate pseudo captions.
dataset/
├── nuscenes_triplets/
│ ├── nuscenes_image/
│ ├── nuscenes_lidar/
│ ├── nuscenes_triplet_train.jsonl
│ └── nuscenes_triplet_val.jsonl
├── kitti_triplets/
│ ├── kitti_image/
│ ├── kitti_lidar/
│ └── kitti_triplet_train.jsonl
└── waymo_triplets/
├── waymo_image/
├── waymo_lidar/
└── waymo_triplet_val.jsonl
The dataset is available on Hugging Face. For more details, please refer to the dataset page.
python3 ./TripletBuilder.py --dataset nuscenes --data_path /PATH/TO/NUSCENES/DATASET --split trainpython3 ./CaptionGen.py --jsonl_path dataset/nuscenes_triplets/nuscenes_triplet_train.jsonl- NuScenes
python3 ./TripletBuilder.py --dataset nuscenes --data_path /PATH/TO/NUSCENES/DATASET --split valpython3 ./CaptionGen.py --jsonl_path dataset/nuscenes_triplets/nuscenes_triplet_val.jsonl- KITTI
python3 ./TripletBuilder.py --dataset kitti --data_path /PATH/TO/KITTI/DATASETpython3 ./CaptionGen.py --jsonl_path dataset/kitti_triplets/kitti_triplet_train.jsonl- Waymo
To generate the Waymo triplet dataset, first create a separate environment named waymo:
conda create -n waymo python=3.10
conda activate waymoInstall the required dependencies:
pip install numpy pandas pyarrow pillow tqdm scipy open3d waymo-open-dataset-tf-2-12-0Then generate the triplet data and pseudo captions:
python3 ./TripletBuilder_waymo.py --data_path /PATH/TO/WMOD/DATASET --segment_filter {0..49}python3 ./CaptionGen.py --jsonl_path dataset/waymo_triplets/waymo_triplet_val.jsonlAfter finishing the data generation, we can switch back to the ctp environment:
conda activate ctpTo train a model, simply provide a configuration file. The configuration files can be modified in the ./configs folder.
python3 ./train.py --config configs/default.yamlConfiguration Options:
- masked (
True/False): Whether to use the masking strategy. - pc_only (
True/False): Whether to train only the point cloud encoder or all encoders. - use_tb (
True/False): Whether to enable TensorBoard logging. - use_wandb (
True/False): Whether to enable Weights & Biases logging. Runwandb loginfirst to authenticate.
Checkpoints are available on Hugging Face.
To evaluate a trained model, first set checkpoint_path in the configuration file. Then choose an evaluation dataset from the following options:
dataset/nuscenes_triplets/nuscenes_triplet_val.jsonldataset/kitti_triplets/kitti_triplet_train.jsonldataset/waymo_triplets/waymo_triplet_val.jsonl
For example:
python3 ./eval_acc.py --config configs/default.yaml --eval_path dataset/nuscenes_triplets/nuscenes_triplet_val.jsonl --tau 0.5The parameter tau controls modality usage during evaluation:
tau = 0: Only the point cloud modality is used.tau = 1: Only the image modality is used.tau = 0.5: Both modalities are jointly evaluated.
To evaluate the alignment effect, high-dimensional features are projected onto a 2D plane to compare representations before and after alignment.
Run the example command:
python3 ./eval_align.py --config configs/default.yaml --eval_path dataset/nuscenes_triplets/nuscenes_triplet_train.jsonl --after_ckpt PATH/TO/CHECKPOINT --label carArugments:
-
--eval_path
Supported evaluation datasets:dataset/nuscenes_triplets/nuscenes_triplet_val.jsonldataset/kitti_triplets/kitti_triplet_train.jsonldataset/waymo_triplets/waymo_triplet_val.jsonl
-
--after_ckpt: Path to the checkpoint file you want to evaluate. -
--label: Object category used to visualize alignment effects. Supported labels include:"car""truck""pedestrian"
If you find this work useful for your research, please cite our work:
@misc{tao2026ctp,
title={Toward Unified Multimodal Representation Learning for Autonomous Driving},
author={Ximeng Tao and Dimitar Filev and Gaurav Pandey},
year={2026},
eprint={2603.07874},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.07874},
}
