LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

This is the official implementaion of paper 'LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding', which is accepted in CVPR 2026.

[📖 Paper] [🤗 LongVideo-R1-Qwen2.5] [🤗 LongVideo-R1-Qwen3] [🤗 LongVideo-R1-Data]

👀 Abstract

This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency.

🏆 Performance

The performance of LongVideo-R1.

🔗 Model Download

We provide two versions of the model:

LongVideo-R1-Qwen2.5: obtained by performing SFT and RL using Qwen3-8B as the reasoning model, Qwen2.5-VL-72B as the caption model, and Qwen2.5-VL-32B as the video_qa model.
LongVideo-R1-Qwen3: obtained by performing SFT and RL using Qwen3-8B as the reasoning model, and Qwen3-VL-32B as both the caption model and video_qa model.

LongVideo-R1-Qwen3 delivers better performance.

📚 Contents Link

🔧 Cli Demo

Use cli.py for online testing (tool use + multi-round reasoning).

LongVideo-R1, caption model, and video_qa model should be deployed in vllm serve mode.

Example command:

#Deploy the reasoning model.
MODEL_PATH="path/to/LongVideo-R1"
PORT=25600

echo "Starting GPU 0 vLLM serve (Port $PORT)..."
CUDA_VISIBLE_DEVICES=0 vllm serve $MODEL_PATH \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--host 127.0.0.1 \
--port $PORT \
--served-model-name longvideor1 

#Deploy the caption and video_qa model.
MODEL_PATH="path/to/Qwen3-VL-32B-Instruct"
PORT=9081
GPU_PAIR="6,7"

echo "Starting GPU ${GPU_PAIR} vLLM serve (Port $PORT)..."
CUDA_VISIBLE_DEVICES=$GPU_PAIR vllm serve $MODEL_PATH \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.85 \
--host 127.0.0.1 \
--port $PORT \
--mm-processor-cache-gb 0 \
--served-model-name Qwen3-VL-32B

python cli.py \
  --video_path /path/to/video.mp4 \
  --question "What is the man doing in this video?" \
  --reasoning_base_url http://127.0.0.1:25600/v1 \
  --caption_base_url http://127.0.0.1:9081/v1 \
  --videoqa_base_url http://127.0.0.1:9081/v1

Tips

We found that the evaluation results on H800 and A800 differ: the results on A800 are about 1 point lower on average than those on H800.
Our time calculation is obtained by computing each model individually without considering network latency and video frame extraction latency, and then combining the number of calls for each model.

Acknowledgement

This project is based on LLaMA-Factory (paper, code), verl-tool(paper, code), verl(code), thanks for their excellent works.

Citation

If you find LongVideo-R1 useful for your research and applications, please consider giving us a star ⭐ and citing it by the following BibTeX entry:

@article{qiu2026longvideo,
  title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
  author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
  journal={arXiv preprint arXiv:2602.20913},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
LLaMA-Factory		LLaMA-Factory
data		data
eval		eval
figs		figs
verl-tool		verl-tool
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
data_generation.md		data_generation.md
evaluation.md		evaluation.md
training.md		training.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

👀 Abstract

🏆 Performance

🔗 Model Download

📚 Contents Link

🔧 Cli Demo

Tips

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

👀 Abstract

🏆 Performance

🔗 Model Download

📚 Contents Link

🔧 Cli Demo

Tips

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages