Zican Hu12*, Shilin Zhang12*, Yafu Li2†✉, Jianhao Yan42, Xuyang Hu2, Leyang Cui4, Xiaoye Qu2, Chunlin Chen1, Yu Cheng3✉, Zhi Wang12✉
1Nanjing University 2Shanghai AI Laboratory 3The Chinese University of Hong Kong 4Westlake University
*Equal contributions. Zican Hu and Shilin Zhang are listed alphabetically by last name. †Project lead. ✉ Corresponding authors.
Contact: zicanhu@smail.nju.edu.cn, shilinzhang@smail.nju.edu.cn, yafuly@gmail.com, chengyu@cse.cuhk.edu.hk, zhiwang@nju.edu.cn
We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity, and propose DIVER, an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking.
-
The sequence-level vs. token-level Diversity on RLVR
-
Metrics for Quantifying sequence-Level Diversity
-
Promoting Global Diversity for Deep Exploration
-
Mitigating Reward Hacking
conda create -n diver python=3.10 -y
conda activate diver
pip install -r requirements.txtcd dataset
huggingface-cli download --resume-download huzican/DIVER-Training-Openr1-Math-46k --local-dir openr1
huggingface-cli download --resume-download huzican/DIVER-Test --local-dir valid.all
cd model
huggingface-cli download --resume-download huzican/Qwen2.5-Math-7B-16k-think --local-dir Qwen2.5-Math-7B-16k-think export MODEL_PATH="Qwen2.5-Math-7B-16k-think"
bash scripts/train/train_diver.sh --model $MODEL_PATH export CHECKPOINT_PATH="checkpoints/diver"
bash scripts/eval/eval_checkpoint.sh --model $CHECKPOINTS_PATHDIVER builds upon veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for RLVR reward model. We thank the open-source community for datasets and backbones, including OpenR1-Math-220k, OpenR1-Math-46k, Qwen-2.5 and Llama-3.1 model.
If you find our paper useful, please consider to star this repository and cite it:
@article{hu2025diversity,
title={Diversity-Incentivized Exploration for Versatile Reasoning},
author={Zican Hu and Shilin Zhang and Yafu Li and Jianhao Yan and Xuyang Hu and Leyang Cui and Xiaoye Qu and Chunlin Chen and Yu Cheng and Zhi Wang},
journal={arXiv preprint arXiv:2509.26209},
year={2025}
}


