Skip to content

TAMU-CVRL/CTP

Repository files navigation

Toward Unified Multimodal Representation Learning for Autonomous Driving

arXiv Hugging Face Model Hugging Face Dataset GitHub

Overview of Contrastive Tensor Pre-training (CTP). We propose this framework that simultaneously aligns multiple modalities in a similarity tensor.

pipeline

For the “car” class, we project features from three modalities using 500 samples onto a 2D plane. result To run a quick inference demo, clone the repository, set up the environment, and execute the notebook located at demo/inference.ipynb.

Requirements

Option 1: Automatic Setup

The provided script can be used to set up the environment:

bash ./scripts/setup_env.sh

Option 2: Manual Setup

We can create a conda environment named ctp:

conda create -n ctp python=3.10

Then activate the environment and install required libraries:

conda activate ctp

Install PyTorch based on your GPU:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

Install other libraries:

pip install transformers==5.1.0 nuscenes-devkit==1.2.0 pandas==2.3.3 open3d==0.19.0 wandb==0.25.1 tensorboard==2.20.0 git+https://github.com/openai/CLIP.git matplotlib==3.9.4 huggingface_hub==1.8.0 umap-learn==0.5.11 accelerate==1.13.0 beautifulsoup4==4.14.3 typeguard==4.4.4 pyyaml==6.0.3 tqdm==4.67.1 idna==3.11 ipykernel==7.2.0 ipywidgets==8.1.8 pickleshare==0.7.5 jmespath==1.1.0 pyrootutils==1.0.4

Triplet Data Preparation

Triplet data preparation can be divided into two steps. First, we extract annotations, cropped images, and the corresponding point clouds. Then, the images and annotations are fed into a VLM (Qwen3-VL-8B-Instruct) to generate pseudo captions.

Dataset Structure

dataset/
├── nuscenes_triplets/
│   ├── nuscenes_image/
│   ├── nuscenes_lidar/
│   ├── nuscenes_triplet_train.jsonl
│   └── nuscenes_triplet_val.jsonl
├── kitti_triplets/
│   ├── kitti_image/
│   ├── kitti_lidar/
│   └── kitti_triplet_train.jsonl
└── waymo_triplets/
    ├── waymo_image/
    ├── waymo_lidar/
    └── waymo_triplet_val.jsonl

The dataset is available on Hugging Face. For more details, please refer to the dataset page.

Training Triplet Dataset

python3 ./TripletBuilder.py --dataset nuscenes --data_path /PATH/TO/NUSCENES/DATASET --split train
python3 ./CaptionGen.py --jsonl_path dataset/nuscenes_triplets/nuscenes_triplet_train.jsonl

Test Triplet Dataset

  • NuScenes
python3 ./TripletBuilder.py --dataset nuscenes --data_path /PATH/TO/NUSCENES/DATASET --split val
python3 ./CaptionGen.py --jsonl_path dataset/nuscenes_triplets/nuscenes_triplet_val.jsonl
  • KITTI
python3 ./TripletBuilder.py --dataset kitti --data_path /PATH/TO/KITTI/DATASET
python3 ./CaptionGen.py --jsonl_path dataset/kitti_triplets/kitti_triplet_train.jsonl
  • Waymo

To generate the Waymo triplet dataset, first create a separate environment named waymo:

conda create -n waymo python=3.10
conda activate waymo

Install the required dependencies:

pip install numpy pandas pyarrow pillow tqdm scipy open3d waymo-open-dataset-tf-2-12-0

Then generate the triplet data and pseudo captions:

python3 ./TripletBuilder_waymo.py --data_path /PATH/TO/WMOD/DATASET --segment_filter {0..49}
python3 ./CaptionGen.py --jsonl_path dataset/waymo_triplets/waymo_triplet_val.jsonl

After finishing the data generation, we can switch back to the ctp environment:

conda activate ctp

Training Models

To train a model, simply provide a configuration file. The configuration files can be modified in the ./configs folder.

python3 ./train.py --config configs/default.yaml

Configuration Options:

  • masked (True / False): Whether to use the masking strategy.
  • pc_only (True / False): Whether to train only the point cloud encoder or all encoders.
  • use_tb (True / False): Whether to enable TensorBoard logging.
  • use_wandb (True / False): Whether to enable Weights & Biases logging. Run wandb login first to authenticate.

Checkpoints are available on Hugging Face.

Evaluation

Zero-shot Classification Accuracy

To evaluate a trained model, first set checkpoint_path in the configuration file. Then choose an evaluation dataset from the following options:

  • dataset/nuscenes_triplets/nuscenes_triplet_val.jsonl
  • dataset/kitti_triplets/kitti_triplet_train.jsonl
  • dataset/waymo_triplets/waymo_triplet_val.jsonl

For example:

python3 ./eval_acc.py --config configs/default.yaml --eval_path dataset/nuscenes_triplets/nuscenes_triplet_val.jsonl --tau 0.5

The parameter tau controls modality usage during evaluation:

  • tau = 0: Only the point cloud modality is used.
  • tau = 1: Only the image modality is used.
  • tau = 0.5: Both modalities are jointly evaluated.

Alignment

To evaluate the alignment effect, high-dimensional features are projected onto a 2D plane to compare representations before and after alignment.

Run the example command:

python3 ./eval_align.py --config configs/default.yaml --eval_path dataset/nuscenes_triplets/nuscenes_triplet_train.jsonl --after_ckpt PATH/TO/CHECKPOINT --label car

Arugments:

  • --eval_path
    Supported evaluation datasets:

    • dataset/nuscenes_triplets/nuscenes_triplet_val.jsonl
    • dataset/kitti_triplets/kitti_triplet_train.jsonl
    • dataset/waymo_triplets/waymo_triplet_val.jsonl
  • --after_ckpt: Path to the checkpoint file you want to evaluate.

  • --label: Object category used to visualize alignment effects. Supported labels include:

    • "car"
    • "truck"
    • "pedestrian"

BibTeX

If you find this work useful for your research, please cite our work:

@misc{tao2026ctp,
      title={Toward Unified Multimodal Representation Learning for Autonomous Driving}, 
      author={Ximeng Tao and Dimitar Filev and Gaurav Pandey},
      year={2026},
      eprint={2603.07874},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.07874}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors