QTrack: Query-Driven Reasoning for Multi-modal MOT

This is the official PyTorch implementation of QTrack:

"QTrack: Query-Driven Reasoning for Multi-modal MOT" by Tajamul Ashraf, Tavaheed Tariq, Sonia Yadav, Abrar Ul Riyaz, Wasif Tak, Moloud Abdar, and Janibul Bashir.

NEWS

[03/25/2026] 💥 QTrack achieves new state-of-the-art on RMOT26 benchmark with 0.30 MCP and 0.75 MOTP! Check out our project page for demos.
[03/18/2026] We released the RMOT26 benchmark and QTrack codebase. See more details in our arXiv paper!
[03/10/2026] Dataset and model checkpoints are now publicly available.

QTrack performs query-driven multi-object tracking based on natural language instructions, tracking only the specified targets while maintaining temporal coherence.

🎯 What is QTrack?

Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency.

Key Contributions:

RMOT26 Benchmark: A large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust generalization evaluation
QTrack Model: An end-to-end vision-language model that integrates multi-modal reasoning with tracking-oriented localization
Temporal Perception-Aware Policy Optimization (TPA-PO): A structured reward strategy to encourage motion-aware reasoning

🔥 Check out our project website for more overview and demos!

📊 Benchmark Results

QTrack achieves state-of-the-art performance on the RMOT26 benchmark, significantly outperforming both open-source and closed-source models.

Main Results on RMOT26

Model	Params	MCP↑	MOTP↑	CLE (px)↓	NDE↓
GPT-5.2	-	0.25	0.61	94.2	0.55
Qwen3-VL-Instruct	8B	0.25	0.64	96.0	0.97
Gemma 3	27B	0.24	0.56	58.4	0.88
Gemma 3	12B	0.18	0.73	172.9	0.95
VisionReasoner	7B	0.23	0.24	428.9	2.24
Qwen2.5-VL-Instruct	7B	0.24	0.48	289.2	2.07
InternVL	8B	0.21	0.66	117.44	0.64
gpt-4o-mini	-	0.20	0.57	130.48	0.67
QTrack (Ours)	3B	0.30	0.75	44.61	0.39

Comparison with Traditional MOT Methods

	MOT17 Dataset				DanceTrack Dataset
Model	MOTA	MOTP	HOTA	MCP	MOTA	MOTP	HOTA	MCP
MOTR	0.61	0.81	0.22	0.44	0.42	0.70	0.35	0.51
BoostTrack++	0.63	0.76	0.38	0.44	-	-	-	-
MOTRv2	-	-	-	-	0.49	0.73	0.37	0.52
TrackTrack	0.75	0.50	0.23	0.29	0.36	0.73	0.40	0.55
VisionReasoner	0.64	0.86	0.60	0.21	0.59	0.85	0.61	0.26
QTrack (Ours)	0.69	0.87	0.69	0.26	0.63	0.83	0.66	0.35

Fine-tuned VLLM Comparison

Model	Params	MCP↑	MOTP↑	MOTA↑	NDE↓
VisionReasoner	3B	0.22	0.65	0.01	0.76
Gemma3	4B	0.18	0.73	-0.16	0.95
Qwen2.5-VL	3B	0.14	0.76	-0.51	3.41
QTrack (Ours)	3B	0.30	0.75	0.21	0.39

🔧 Installation

Requirements

Python ≥ 3.12
PyTorch ≥ 2.6
CUDA ≥ 12.1
Transformers ≥ 4.51.3

Setup Environment

# Create conda environment
conda create -n qtrack python=3.12
conda activate qtrack

# Install PyTorch
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

# Install QTrack and dependencies
git clone https://github.com/gaash-lab/QTrack.git
cd QTrack
pip install -r requirements.txt
pip install -e .

Training

Training details and scripts will be provided soon.

Evaluation

Evaluation settings and benchmarks will be released soon.

Citation

If you find QTrack useful for your research, please cite:

@misc{ashraf2026qtrackquerydrivenreasoningmultimodal,
      title={QTrack: Query-Driven Reasoning for Multi-modal MOT}, 
      author={Tajamul Ashraf and Tavaheed Tariq and Sonia Yadav and Abrar Ul Riyaz and Wasif Tak and Moloud Abdar and Janibul Bashir},
      year={2026},
      eprint={2603.13759},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.13759}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Evaluate_all_json_files		Evaluate_all_json_files
Model_Evaluation_Scripts		Model_Evaluation_Scripts
Plots		Plots
RMOT		RMOT
__pycache__		__pycache__
metrics		metrics
.gitignore		.gitignore
README.md		README.md
divide_dataset.py		divide_dataset.py
functions.py		functions.py
json_to_jsonl.py		json_to_jsonl.py
prepare_dataset_for_visionreasoner.py		prepare_dataset_for_visionreasoner.py
prepare_mcp_dataset.py		prepare_mcp_dataset.py
process_objects.py		process_objects.py
read_arrow_file.py		read_arrow_file.py
rmot_training.sh		rmot_training.sh
test.py		test.py
track.sh		track.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QTrack: Query-Driven Reasoning for Multi-modal MOT

NEWS

🎯 What is QTrack?

📊 Benchmark Results

Main Results on RMOT26

Comparison with Traditional MOT Methods

Fine-tuned VLLM Comparison

🔧 Installation

Requirements

Setup Environment

Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QTrack: Query-Driven Reasoning for Multi-modal MOT

NEWS

🎯 What is QTrack?

📊 Benchmark Results

Main Results on RMOT26

Comparison with Traditional MOT Methods

Fine-tuned VLLM Comparison

🔧 Installation

Requirements

Setup Environment

Training

Evaluation

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages