Skip to content

TAMU-CVRL/NaviDrive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving

arXiv Hugging Face Model Hugging Face Dataset GitHub

Overview

Overview of NaviDriveVLM. The system is decoupled into two modules, the Navigator and the Driver. The Navigator is a large-scale VLM responsible for scene understanding and high-level reasoning. The Driver is a lightweight VLM, which enables efficient fully supervised fine-tuning (SFT) as a driving expert for future waypoint prediction. pipeline

Examples: results

To run a quick inference demo, clone the repository, set up the environment, and execute the notebook located at demo/inference.ipynb.

Requirements

Option 1: Automatic Setup

The provided script can be used to set up the environment:

bash ./scripts/setup_env.sh

Option 2: Manual Setup

Create a new Conda environment named navidrive:

conda create -n navidrive python=3.10

Then activate the environment and install required libraries:

conda activate navidrive

Install PyTorch based on your GPU:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

Install the remaining dependencies:

pip install transformers==5.1.0 datasets==4.5.0 accelerate==1.12.0 peft==0.18.1 "bitsandbytes>=0.46.1" opencv-python==4.11.0.86 nuscenes-devkit==1.2.0 qwen-vl-utils==0.0.14 beautifulsoup4==4.14.3 typeguard==4.5.1 wandb==0.25.1 tensorboard==2.20.0 ipykernel==7.2.0 ipywidgets==8.1.8 pickleshare==0.7.5 pyrootutils==1.0.4 jmespath==1.1.0
  • Flash-attention FlashAttention is optional in the configuration YAML file. If you would like to enable it, please follow the FlashAttention README for installation instructions.

Dataset

The data folder already includes several pairs of .jsonl files used for training and evaluation, generated from the nuScenes dataset.

  • nuscenes_reasons_Qwen_32B and nuscenes_reasons_val_Qwen_32B are generated by Qwen3-VL-32B-Instruct.
  • nuscenes_reasons_Qwen_8B and nuscenes_reasons_val_Qwen_8B are generated by Qwen3-VL-8B-Instruct.
  • nuscenes_reasons_Gemini and nuscenes_reasons_val_Gemini are generated by Gemini-2.5-Flash.

If you would like to generate reasoning data from the nuScenes dataset, run the following command:

python3 naviGen_Qwen.py --model_id Qwen/Qwen3-VL-32B-Instruct --output_file data/nuscenes_reasons_Qwen_32B.jsonl --data_path /PATH/TO/NUSCENES/DATASET --version v1.0-trainval --is_train 0

Arguments:

  • --model_id: Model ID from Hugging Face.
  • --output_file: Path to the output .jsonl file.
  • --data_path: Path to your nuScenes dataset.
  • --version: nuScenes dataset version (v1.0-trainval or v1.0-mini).
  • --is_train: Dataset split selector
    • 0: training set
    • 1: validation set

The dataset is available on Hugging Face and can be loaded using the load_dataset function. For more details, please refer to the dataset page.

Training Models

Run train.py to train a model. It only requires one argument: --config.

python3 train.py --config configs/default.yaml 

Configuration files are stored in the configs folder. Key arguments include:

  • model_id: Model ID from Hugging Face.
  • attention: Attention implementation. flash_attention_2 requires installing FlashAttention separately.
  • quantization: Whether to use a quantized model.
  • enable_action: Whether to convert waypoints (x, y) to control actions (\alpha, \kappa).
  • enable_image: Whether to include image inputs.
  • enable_reason: Whether to include reasoning inputs.

The fine-tuned model is available on Hugging Face and can be loaded directly. For quick inference, please refer to the model page.

Inference

After training a model, we can run the following command to perform inference. The predicted waypoints will be saved in results/inference for further evaluation.

python3 eval.py --config configs/default.yaml --inference_path data/nuscenes_reasons_val_Qwen_32B.jsonl

Evaluation

L2 Error

To evaluate the L2 error:

python3 eval.py --config configs/default.yaml --eval_L2 True

The results will be stored in results.

Video Generation

To generate a visualization video, specify the start and end indices of the frames using --start_idx and --end_idx:

python3 eval.py --config configs/default.yaml --eval_video True --start_idx 0 --end_idx 2000

The generated video will be stored in results/videos.

BibTeX

If you find this work useful for your research, please cite our work:

@misc{tao2026navidrive,
      title={NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving}, 
      author={Ximeng Tao and Pardis Taghavi and Dimitar Filev and Reza Langari and Gaurav Pandey},
      year={2026},
      eprint={2603.07901},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.07901}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors