This is the official PyTorch implementation of QTrack:
"QTrack: Query-Driven Reasoning for Multi-modal MOT" by Tajamul Ashraf, Tavaheed Tariq, Sonia Yadav, Abrar Ul Riyaz, Wasif Tak, Moloud Abdar, and Janibul Bashir.
- [03/25/2026] 💥 QTrack achieves new state-of-the-art on RMOT26 benchmark with 0.30 MCP and 0.75 MOTP! Check out our project page for demos.
- [03/18/2026] We released the RMOT26 benchmark and QTrack codebase. See more details in our arXiv paper!
- [03/10/2026] Dataset and model checkpoints are now publicly available.
QTrack performs query-driven multi-object tracking based on natural language instructions, tracking only the specified targets while maintaining temporal coherence.
Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency.
Key Contributions:
- RMOT26 Benchmark: A large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust generalization evaluation
- QTrack Model: An end-to-end vision-language model that integrates multi-modal reasoning with tracking-oriented localization
- Temporal Perception-Aware Policy Optimization (TPA-PO): A structured reward strategy to encourage motion-aware reasoning
🔥 Check out our project website for more overview and demos!
QTrack achieves state-of-the-art performance on the RMOT26 benchmark, significantly outperforming both open-source and closed-source models.
| Model | Params | MCP↑ | MOTP↑ | CLE (px)↓ | NDE↓ |
|---|---|---|---|---|---|
| GPT-5.2 | - | 0.25 | 0.61 | 94.2 | 0.55 |
| Qwen3-VL-Instruct | 8B | 0.25 | 0.64 | 96.0 | 0.97 |
| Gemma 3 | 27B | 0.24 | 0.56 | 58.4 | 0.88 |
| Gemma 3 | 12B | 0.18 | 0.73 | 172.9 | 0.95 |
| VisionReasoner | 7B | 0.23 | 0.24 | 428.9 | 2.24 |
| Qwen2.5-VL-Instruct | 7B | 0.24 | 0.48 | 289.2 | 2.07 |
| InternVL | 8B | 0.21 | 0.66 | 117.44 | 0.64 |
| gpt-4o-mini | - | 0.20 | 0.57 | 130.48 | 0.67 |
| QTrack (Ours) | 3B | 0.30 | 0.75 | 44.61 | 0.39 |
| MOT17 Dataset | DanceTrack Dataset | |||||||
|---|---|---|---|---|---|---|---|---|
| Model | MOTA | MOTP | HOTA | MCP | MOTA | MOTP | HOTA | MCP |
| MOTR | 0.61 | 0.81 | 0.22 | 0.44 | 0.42 | 0.70 | 0.35 | 0.51 |
| BoostTrack++ | 0.63 | 0.76 | 0.38 | 0.44 | - | - | - | - |
| MOTRv2 | - | - | - | - | 0.49 | 0.73 | 0.37 | 0.52 |
| TrackTrack | 0.75 | 0.50 | 0.23 | 0.29 | 0.36 | 0.73 | 0.40 | 0.55 |
| VisionReasoner | 0.64 | 0.86 | 0.60 | 0.21 | 0.59 | 0.85 | 0.61 | 0.26 |
| QTrack (Ours) | 0.69 | 0.87 | 0.69 | 0.26 | 0.63 | 0.83 | 0.66 | 0.35 |
| Model | Params | MCP↑ | MOTP↑ | MOTA↑ | NDE↓ |
|---|---|---|---|---|---|
| VisionReasoner | 3B | 0.22 | 0.65 | 0.01 | 0.76 |
| Gemma3 | 4B | 0.18 | 0.73 | -0.16 | 0.95 |
| Qwen2.5-VL | 3B | 0.14 | 0.76 | -0.51 | 3.41 |
| QTrack (Ours) | 3B | 0.30 | 0.75 | 0.21 | 0.39 |
- Python ≥ 3.12
- PyTorch ≥ 2.6
- CUDA ≥ 12.1
- Transformers ≥ 4.51.3
# Create conda environment
conda create -n qtrack python=3.12
conda activate qtrack
# Install PyTorch
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# Install QTrack and dependencies
git clone https://github.com/gaash-lab/QTrack.git
cd QTrack
pip install -r requirements.txt
pip install -e .Training details and scripts will be provided soon.
Evaluation settings and benchmarks will be released soon.
If you find QTrack useful for your research, please cite:
@misc{ashraf2026qtrackquerydrivenreasoningmultimodal,
title={QTrack: Query-Driven Reasoning for Multi-modal MOT},
author={Tajamul Ashraf and Tavaheed Tariq and Sonia Yadav and Abrar Ul Riyaz and Wasif Tak and Moloud Abdar and Janibul Bashir},
year={2026},
eprint={2603.13759},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.13759},
}

