NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving

Overview

Overview of NaviDriveVLM. The system is decoupled into two modules, the Navigator and the Driver. The Navigator is a large-scale VLM responsible for scene understanding and high-level reasoning. The Driver is a lightweight VLM, which enables efficient fully supervised fine-tuning (SFT) as a driving expert for future waypoint prediction.

Examples:

To run a quick inference demo, clone the repository, set up the environment, and execute the notebook located at demo/inference.ipynb.

Requirements

Option 1: Automatic Setup

The provided script can be used to set up the environment:

bash ./scripts/setup_env.sh

Option 2: Manual Setup

Create a new Conda environment named navidrive:

conda create -n navidrive python=3.10

Then activate the environment and install required libraries:

conda activate navidrive

Install PyTorch based on your GPU:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

Install the remaining dependencies:

pip install transformers==5.1.0 datasets==4.5.0 accelerate==1.12.0 peft==0.18.1 "bitsandbytes>=0.46.1" opencv-python==4.11.0.86 nuscenes-devkit==1.2.0 qwen-vl-utils==0.0.14 beautifulsoup4==4.14.3 typeguard==4.5.1 wandb==0.25.1 tensorboard==2.20.0 ipykernel==7.2.0 ipywidgets==8.1.8 pickleshare==0.7.5 pyrootutils==1.0.4 jmespath==1.1.0

Flash-attention FlashAttention is optional in the configuration YAML file. If you would like to enable it, please follow the FlashAttention README for installation instructions.

Dataset

The data folder already includes several pairs of .jsonl files used for training and evaluation, generated from the nuScenes dataset.

nuscenes_reasons_Qwen_32B and nuscenes_reasons_val_Qwen_32B are generated by Qwen3-VL-32B-Instruct.
nuscenes_reasons_Qwen_8B and nuscenes_reasons_val_Qwen_8B are generated by Qwen3-VL-8B-Instruct.
nuscenes_reasons_Gemini and nuscenes_reasons_val_Gemini are generated by Gemini-2.5-Flash.

If you would like to generate reasoning data from the nuScenes dataset, run the following command:

python3 naviGen_Qwen.py --model_id Qwen/Qwen3-VL-32B-Instruct --output_file data/nuscenes_reasons_Qwen_32B.jsonl --data_path /PATH/TO/NUSCENES/DATASET --version v1.0-trainval --is_train 0

Arguments:

--model_id: Model ID from Hugging Face.
--output_file: Path to the output .jsonl file.
--data_path: Path to your nuScenes dataset.
--version: nuScenes dataset version (v1.0-trainval or v1.0-mini).
--is_train: Dataset split selector
- 0: training set
- 1: validation set

The dataset is available on Hugging Face and can be loaded using the load_dataset function. For more details, please refer to the dataset page.

Training Models

Run train.py to train a model. It only requires one argument: --config.

python3 train.py --config configs/default.yaml

Configuration files are stored in the configs folder. Key arguments include:

model_id: Model ID from Hugging Face.
attention: Attention implementation. flash_attention_2 requires installing FlashAttention separately.
quantization: Whether to use a quantized model.
enable_action: Whether to convert waypoints (x, y) to control actions (\alpha, \kappa).
enable_image: Whether to include image inputs.
enable_reason: Whether to include reasoning inputs.

The fine-tuned model is available on Hugging Face and can be loaded directly. For quick inference, please refer to the model page.

Inference

After training a model, we can run the following command to perform inference. The predicted waypoints will be saved in results/inference for further evaluation.

python3 eval.py --config configs/default.yaml --inference_path data/nuscenes_reasons_val_Qwen_32B.jsonl

Evaluation

L2 Error

To evaluate the L2 error:

python3 eval.py --config configs/default.yaml --eval_L2 True

The results will be stored in results.

Video Generation

To generate a visualization video, specify the start and end indices of the frames using --start_idx and --end_idx:

python3 eval.py --config configs/default.yaml --eval_video True --start_idx 0 --end_idx 2000

The generated video will be stored in results/videos.

BibTeX

If you find this work useful for your research, please cite our work:

@misc{tao2026navidrive,
      title={NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving}, 
      author={Ximeng Tao and Pardis Taghavi and Dimitar Filev and Reza Langari and Gaurav Pandey},
      year={2026},
      eprint={2603.07901},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.07901}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
configs		configs
data		data
demo		demo
figures		figures
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
driverEngine.py		driverEngine.py
eval.py		eval.py
naviGen_Qwen.py		naviGen_Qwen.py
readme.md		readme.md
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving

Overview

Requirements

Option 1: Automatic Setup

Option 2: Manual Setup

Dataset

Training Models

Inference

Evaluation

L2 Error

Video Generation

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving

Overview

Requirements

Option 1: Automatic Setup

Option 2: Manual Setup

Dataset

Training Models

Inference

Evaluation

L2 Error

Video Generation

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages