Skip to content

NJU-RL/DIVER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diversity-Incentivized Exploration for Versatile Reasoning

arXiv Huggingface Collections GitHub stars

Zican Hu12*, Shilin Zhang12*, Yafu Li2†, Jianhao Yan42, Xuyang Hu2, Leyang Cui4, Xiaoye Qu2, Chunlin Chen1, Yu Cheng3, Zhi Wang12

1Nanjing University 2Shanghai AI Laboratory 3The Chinese University of Hong Kong 4Westlake University

*Equal contributions. Zican Hu and Shilin Zhang are listed alphabetically by last name. Project lead. Corresponding authors.

Contact: zicanhu@smail.nju.edu.cn, shilinzhang@smail.nju.edu.cn, yafuly@gmail.com, chengyu@cse.cuhk.edu.hk, zhiwang@nju.edu.cn

Overview

DIVER


📖Introduction

We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity, and propose DIVER, an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking.

Key Highlights:

  • The sequence-level vs. token-level Diversity on RLVR

  • Metrics for Quantifying sequence-Level Diversity

  • Promoting Global Diversity for Deep Exploration

  • Mitigating Reward Hacking

    result

🚀Usage

Installation

conda create -n diver python=3.10 -y
conda activate diver
pip install -r requirements.txt

Preparation

cd dataset
huggingface-cli download --resume-download huzican/DIVER-Training-Openr1-Math-46k --local-dir openr1
huggingface-cli download --resume-download huzican/DIVER-Test --local-dir valid.all

cd model
huggingface-cli download --resume-download huzican/Qwen2.5-Math-7B-16k-think --local-dir Qwen2.5-Math-7B-16k-think

Training

  export MODEL_PATH="Qwen2.5-Math-7B-16k-think"
  bash scripts/train/train_diver.sh --model $MODEL_PATH

Evaluation

  export CHECKPOINT_PATH="checkpoints/diver"
  bash scripts/eval/eval_checkpoint.sh --model $CHECKPOINTS_PATH

📊Main Results

Zero RLVR on DIVER vs. Baselines based on Qwen2.5-Math-7B

result

Comparison of different Pass@k performance

result

✨Acknowledgement

DIVER builds upon veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for RLVR reward model. We thank the open-source community for datasets and backbones, including OpenR1-Math-220k, OpenR1-Math-46k, Qwen-2.5 and Llama-3.1 model.


📝Citation

If you find our paper useful, please consider to star this repository and cite it:

@article{hu2025diversity,
  title={Diversity-Incentivized Exploration for Versatile Reasoning},
  author={Zican Hu and Shilin Zhang and Yafu Li and Jianhao Yan and Xuyang Hu and Leyang Cui and Xiaoye Qu and Chunlin Chen and Yu Cheng and Zhi Wang},
  journal={arXiv preprint arXiv:2509.26209},
  year={2025}
}

About

[ICLR 2026] The Official Implementation of DIVER

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors