Overview of NaviDriveVLM. The system is decoupled into two modules, the Navigator and the Driver. The Navigator is a large-scale VLM responsible for scene understanding and high-level reasoning. The Driver is a lightweight VLM, which enables efficient fully supervised fine-tuning (SFT) as a driving expert for future waypoint prediction.

To run a quick inference demo, clone the repository, set up the environment, and execute the notebook located at demo/inference.ipynb.
The provided script can be used to set up the environment:
bash ./scripts/setup_env.shCreate a new Conda environment named navidrive:
conda create -n navidrive python=3.10Then activate the environment and install required libraries:
conda activate navidriveInstall PyTorch based on your GPU:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126Install the remaining dependencies:
pip install transformers==5.1.0 datasets==4.5.0 accelerate==1.12.0 peft==0.18.1 "bitsandbytes>=0.46.1" opencv-python==4.11.0.86 nuscenes-devkit==1.2.0 qwen-vl-utils==0.0.14 beautifulsoup4==4.14.3 typeguard==4.5.1 wandb==0.25.1 tensorboard==2.20.0 ipykernel==7.2.0 ipywidgets==8.1.8 pickleshare==0.7.5 pyrootutils==1.0.4 jmespath==1.1.0- Flash-attention FlashAttention is optional in the configuration YAML file. If you would like to enable it, please follow the FlashAttention README for installation instructions.
The data folder already includes several pairs of .jsonl files used for training and evaluation, generated from the nuScenes dataset.
nuscenes_reasons_Qwen_32Bandnuscenes_reasons_val_Qwen_32Bare generated byQwen3-VL-32B-Instruct.nuscenes_reasons_Qwen_8Bandnuscenes_reasons_val_Qwen_8Bare generated byQwen3-VL-8B-Instruct.nuscenes_reasons_Geminiandnuscenes_reasons_val_Geminiare generated byGemini-2.5-Flash.
If you would like to generate reasoning data from the nuScenes dataset, run the following command:
python3 naviGen_Qwen.py --model_id Qwen/Qwen3-VL-32B-Instruct --output_file data/nuscenes_reasons_Qwen_32B.jsonl --data_path /PATH/TO/NUSCENES/DATASET --version v1.0-trainval --is_train 0Arguments:
--model_id: Model ID from Hugging Face.--output_file: Path to the output.jsonlfile.--data_path: Path to your nuScenes dataset.--version: nuScenes dataset version (v1.0-trainvalorv1.0-mini).--is_train: Dataset split selector0: training set1: validation set
The dataset is available on Hugging Face and can be loaded using the load_dataset function. For more details, please refer to the dataset page.
Run train.py to train a model. It only requires one argument: --config.
python3 train.py --config configs/default.yaml Configuration files are stored in the configs folder. Key arguments include:
model_id: Model ID from Hugging Face.attention: Attention implementation.flash_attention_2requires installing FlashAttention separately.quantization: Whether to use a quantized model.enable_action: Whether to convert waypoints(x, y)to control actions(\alpha, \kappa).enable_image: Whether to include image inputs.enable_reason: Whether to include reasoning inputs.
The fine-tuned model is available on Hugging Face and can be loaded directly. For quick inference, please refer to the model page.
After training a model, we can run the following command to perform inference.
The predicted waypoints will be saved in results/inference for further evaluation.
python3 eval.py --config configs/default.yaml --inference_path data/nuscenes_reasons_val_Qwen_32B.jsonlTo evaluate the L2 error:
python3 eval.py --config configs/default.yaml --eval_L2 TrueThe results will be stored in results.
To generate a visualization video, specify the start and end indices of the frames using --start_idx and --end_idx:
python3 eval.py --config configs/default.yaml --eval_video True --start_idx 0 --end_idx 2000The generated video will be stored in results/videos.
If you find this work useful for your research, please cite our work:
@misc{tao2026navidrive,
title={NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving},
author={Ximeng Tao and Pardis Taghavi and Dimitar Filev and Reza Langari and Gaurav Pandey},
year={2026},
eprint={2603.07901},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.07901},
}
