Skip to content

VenomRose-Juri/MASPO-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MASPO: Unifying Gradient Utility, Probability Mass, and Signal Asymmetry for Robust and Sample-Efficient LLM Reasoning

Mass-Adaptive Soft Policy Optimization (MASPO)
Official Implementation

Paper Framework: verl Domain: LLM Reasoning PRs Welcome

Overview

MASPO (Mass-Adaptive Soft Policy Optimization) is a unified RLVR framework for improving the robustness and sample efficiency of LLM reasoning.

Existing RLVR methods such as GRPO rely on rigid, uniform, and symmetric trust-region mechanisms, which are often misaligned with the optimization dynamics of large language models. MASPO addresses three key limitations:

  1. Inefficient gradient utilization caused by hard clipping
  2. Probability mass insensitivity under uniform ratio constraints
  3. Asymmetric signal reliability between positive and negative samples

To address these issues, MASPO combines:

  • a soft Gaussian gating mechanism,
  • a mass-adaptive limiter, and
  • an asymmetric risk controller.

Extensive experiments show that MASPO consistently improves both reasoning accuracy and sample efficiency across multiple model scales.

πŸ“– Abstract

Current Reinforcement Learning with Verifiable Rewards (RLVR) paradigms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical disconnects in existing methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) probability mass insensitivity arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming strong baselines in sample efficiency and reasoning accuracy across diverse LLM scales (1.5B/7B/14B).

🎯 Key Contributions

  • Unified Perspective: We systematically identify inherent challenges in current trust region paradigms and propose a holistic perspective that aligns RLVR optimization by addressing three fundamental misalignments: inefficient gradient utilization, probability mass insensitivity, and asymmetric signal reliability.

  • All-in-One Framework: We propose a comprehensive solution, which unifies a soft Gaussian gating for continuous updates, a mass-adaptive limiter for targeted long-tail exploration, and an asymmetric risk controller for signal-aware optimization into a single framework.

  • Superior Performance: Comprehensive evaluations on diverse mathematical benchmarks consistently demonstrate that MASPO achieves superior sample efficiency and reasoning performance. Further experiments confirm its robustness across varying LLM scales and effectiveness in stabilizing long-chain reasoning.

🎨 Overview

MASPO Overview

Figure 1: Three core limitations of GRPO

πŸš€ Quick Start

Installation

This implementation is built on top of verl, a flexible and efficient RLHF / RLVR framework.

Please install verl first by following its official setup instructions, and then install the dependencies for this repository:

# Clone the repository
git clone https://github.com/VenomRose-Juri/MASPO-RL
cd MASPO-RL

# Install verl dependencies (see verl documentation for details)
pip install -r requirements.txt

Running MASPO

We provide a complete example script for training with MASPO on GSM8K dataset:

bash examples/maspo_trainer/run_deepseek-r1-distill-qwen-7b.sh

Enable MASPO in Config

To use MASPO, set the following configuration fields:

actor_rollout_ref.actor.policy_loss.ratio_clip.ratio_mode: maspo
actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_sigma_base: 1      # Base variance parameter
actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_alpha: 0.3         # Mass-adaptive scaling factor
actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_beta_pos: 0.03     # Positive risk control parameter
actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_beta_neg: 0.03     # Negative risk control parameter
actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_sigma_high: 10     # Upper bound for variance
actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_adv_low: 0.1        # Lower bound for advantage scaling
actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_adv_high: 10       # Upper bound for advantage scaling

Hyperparameter Guidelines:

  • Default recommendation: maspo_sigma_base=1, maspo_alpha=0.3, maspo_beta_pos=0.03, maspo_beta_neg=0.03 (robust baseline)
  • For smaller models (1.5B): Can use maspo_alpha=0.5 for better exploration in long-tail tokens
  • For larger models (7B+): Use maspo_alpha=0.3 to maintain stability while preserving exploration
  • Mass-adaptive scaling (maspo_alpha): Controls how strongly the trust region adapts to token probability. Recommended range: [0.3, 0.5]
  • Risk control (maspo_beta_pos, maspo_beta_neg): Modulates update magnitude based on signal confidence. Recommended: 0.03 for both

Example Configuration

Here's a minimal example showing how to configure MASPO:

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    actor_rollout_ref.model.path=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    actor_rollout_ref.actor.policy_loss.ratio_clip.ratio_mode=maspo \
    actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_sigma_base=1 \
    actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_alpha=0.3 \
    actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_beta_pos=0.03 \
    actor_rollout_ref.actor.policy_loss.ratio_clip.maspo_beta_neg=0.03 \
    actor_rollout_ref.actor.use_kl_loss=False \
    algorithm.use_kl_in_reward=False \
    trainer.total_epochs=15

Full Training Script

For a complete training example, see examples/maspo_trainer/run_deepseek-r1-distill-qwen-7b.sh. This script includes:

  • GRPO advantage estimator configuration
  • MASPO policy loss configuration
  • KL loss settings
  • FSDP configuration
  • vLLM rollout settings
  • Wandb logging setup

πŸ“Š Experimental Results

Main Results

We evaluate MASPO on multiple mathematical reasoning benchmarks including AIME24, AIME25, AMC23, MATH500, Minerva, and OlympiadBench. For evaluation datasets, refer to πŸ€— xuekai/flowrl-data-collection. The following table shows the comprehensive comparison with competitive baselines:

DeepSeek-R1-Distill-Qwen-1.5B Results

Method AIME24 AIME25 AMC23 MATH500 Minerva Olympiad Avg.
A@32 / P@32 A@32 / P@32 A@32 / P@32 A@32 / P@32 A@32 / P@32 A@32 / P@32 A@32 / P@32
GRPO 33.2 / 71.8 27.7 / 49.9 79.5 / 94.8 77.6 / 90.8 26.1 / 48.8 46.3 / 64.7 48.4 / 70.1
Clip Higher 36.6 / 70.1 30.1 / 55.8 82.8 / 94.9 71.7 / 88.6 24.8 / 49.3 42.0 / 61.6 48.0 / 70.1
DAC 40.0 / 73.6 28.7 / 51.2 80.2 / 95.0 78.1 / 92.5 27.4 / 54.0 46.3 / 64.5 50.1 / 71.8
Entropy Adv. 29.6 / 67.8 23.9 / 44.2 77.1 / 94.1 78.5 / 89.7 25.2 / 48.6 46.4 / 64.3 46.8 / 68.1
BAPO 38.3 / 72.3 28.0 / 55.7 80.1 / 94.9 74.9 / 91.0 25.1 / 44.7 44.0 / 61.3 48.4 / 70.0
SAPO 39.3 / 73.7 28.0 / 49.9 82.7 / 94.8 79.1 / 91.4 29.6 / 52.7 48.2 / 66.7 51.2 / 71.5
MASPO (Ours) 41.0 / 74.8 28.4 / 58.0 82.2 / 95.0 78.0 / 89.7 30.7 / 54.1 47.8 / 65.7 51.4 / 72.9

DeepSeek-R1-Distill-Qwen-7B Results

Method AIME24 AIME25 AMC23 MATH500 Minerva Olympiad Avg.
A@32 / P@32 A@32 / P@32 A@32 / P@32 A@32 / P@32 A@32 / P@32 A@32 / P@32 A@32 / P@32
GRPO 48.2 / 82.5 37.4 / 60.5 88.1 / 96.6 84.8 / 92.4 37.4 / 57.2 57.2 / 73.9 58.9 / 77.2
Clip Higher 47.4 / 82.4 37.4 / 67.2 89.0 / 96.6 85.3 / 95.4 38.8 / 58.5 57.5 / 75.5 59.2 / 79.3
DAC 50.2 / 81.2 36.2 / 61.7 88.1 / 99.6 85.1 / 94.8 36.6 / 58.7 57.9 / 76.0 59.0 / 78.7
Entropy Adv. 49.1 / 81.5 34.4 / 58.1 87.7 / 96.6 84.8 / 92.5 36.5 / 54.0 56.4 / 73.7 58.2 / 76.1
BAPO 47.1 / 80.3 37.7 / 58.4 89.2 / 97.2 85.0 / 94.5 38.3 / 57.5 57.1 / 74.5 59.1 / 77.1
SAPO 47.5 / 79.2 35.3 / 58.0 88.7 / 95.0 85.5 / 92.9 39.4 / 58.1 56.6 / 74.6 58.8 / 76.3
MASPO (Ours) 53.2 / 82.4 42.9 / 73.2 91.4 / 95.0 86.0 / 94.7 39.3 / 58.6 58.0 / 74.9 61.8 / 79.8

Main Takeaways

  • 1.5B Model: MASPO outperforms GRPO by +3.0% (48.4 β†’ 51.4) and best baseline (SAPO) by +0.2% (51.2 β†’ 51.4) in Avg@32
  • 7B Model: MASPO outperforms GRPO by +2.9% (58.9 β†’ 61.8) and best baseline (Clip Higher) by +2.6% (59.2 β†’ 61.8) in Avg@32
  • MASPO demonstrates superior performance across the majority of benchmarks on both scales
  • Bold indicates best performance, underlined represents second-best

Scalability Analysis

Method 1.5B (A@32 / P@32) 7B (A@32 / P@32) 14B (A@32 / P@32)
GRPO 48.4 / 70.1 58.9 / 77.2 53.6 / 67.4
MASPO 51.4 / 72.9 61.8 / 79.8 56.4 / 71.1
Improvement +3.0 / +2.8 +2.9 / +2.6 +2.8 / +3.7

MASPO demonstrates consistent improvements across all model scales, confirming its scalability and robustness.

πŸ”¬ Algorithm Overview

Core Idea

MASPO addresses three fundamental misalignments in current RLVR paradigms by proposing a unified framework:

  1. Inefficient Gradient Utilization: Hard clipping imposes a binary cutoff that discards valuable directional gradients from exploratory samples exceeding the boundary, thereby significantly diminishing the effective utilization of informative gradient signals.

  2. Probability Mass Insensitivity: Uniform ratio constraints ignore the vast disparity in token probabilities, failing to account for the massive mass displacement in head tokens versus the negligible shift in tail tokens.

  3. Asymmetric Signal Reliability: Symmetric advantage handling ignores the disparate signal-to-noise ratios between verified positive solutions and ambiguous negative ones.

Mathematical Formulation

The MASPO framework integrates three key components:

Soft Gaussian Gating

MASPO replaces hard clipping with a differentiable soft Gaussian gating mechanism:

$$ \mathcal{F}^{\mathrm{MASPO}}_{i,t} = \begin{cases} \exp \left( -\frac{(\rho_{i,t}(\theta)-1)^{2}}{2 \sigma^{2}_{\mathrm{pos}}} \right) & \text{if } \hat{A}_{i,t}>0 \land \rho_{i,t}(\theta)>1 \\ \exp \left( -\frac{(\rho_{i,t}(\theta)-1)^{2}}{2 \sigma^{2}_{\mathrm{neg}}} \right) & \text{if } \hat{A}_{i,t}<0 \land \rho_{i,t}(\theta)<1 \\ 1, & \text{otherwise} \end{cases} $$

Dual-Variable Adaptive Variance

The variance $\sigma$ is dynamically determined by combining mass-adaptive scaling and asymmetric risk control:

$$ \sigma_{\mathrm{pos}} = \underbrace{\frac{\sigma_{\mathrm{base}}}{\pi_{\theta_{\mathrm{old}}}^{\alpha}}}_{\mathrm{Mass-Adaptive}} \cdot \underbrace{\left( 1+\beta_{\mathrm{high}} \hat{A}_{i,t} \right)}_{\mathrm{Risk\ Controller}} $$

$$ \sigma_{\mathrm{neg}} = \underbrace{\frac{\sigma_{\mathrm{base}}}{\pi_{\theta_{\mathrm{old}}}^{\alpha}}}_{\mathrm{Mass-Adaptive}} \cdot \underbrace{\left(1-\beta_{\mathrm{low}} \hat{A}_{i,t} \right)^{-1}}_{\mathrm{Risk\ Controller}} $$

Where:

  • Mass-Adaptive Limiter: The term $\frac{\sigma_{\mathrm{base}}}{\pi_{\theta_{\mathrm{old}}}^{\alpha}}$ inversely scales the trust region width with token probability, expanding exploration budget for low-probability tokens while enforcing strict constraints for high-probability tokens.
  • Asymmetric Risk Controller: The second component modulates the trust region based on signal confidence, expanding for high-confidence positive signals and constraining for ambiguous negative signals.
MASPO Method

Figure 2: MASPO method illustration

Key Advantages

  1. Continuous Optimization Landscape: Soft Gaussian gating converts the disjoint cliff (characteristic of binary clipping) into a smooth and continuous manifold, ensuring samples that marginally exceed the trust region contribute attenuated but non-zero gradients.

  2. Targeted Exploration: Mass-adaptive scaling allocates exploration budget dynamically based on token probability mass, enabling efficient exploration in long-tail regions while maintaining stability in high-probability regions.

  3. Signal-Aware Updates: Asymmetric risk control aligns update magnitudes with signal confidence, maximizing utilization of verified positive signals while preventing catastrophic unlearning from ambiguous negative signals.

πŸ“ Project Structure

maspo/
β”œβ”€β”€ verl/
β”‚   β”œβ”€β”€ trainer/ppo/
β”‚   β”‚   └── core_algos.py          # MASPO implementation
β”‚   β”œβ”€β”€ workers/actor/
β”‚   β”‚   └── dp_actor.py            # Actor with MASPO support
β”‚   └── trainer/config/actor/
β”‚       └── actor.yaml             # Configuration file
β”œβ”€β”€ examples/
β”‚   └── maspo_trainer/
β”‚       └── run_deepseek-r1-distill-qwen-7b.sh        # Example training script
└── README.md

πŸ”§ Implementation Details

Code Location

  • Core Algorithm: verl/trainer/ppo/core_algos.py (lines 907-922)
  • Actor Integration: verl/workers/actor/dp_actor.py (uses compute_policy_loss with MASPO mode)
  • Configuration: verl/trainer/config/actor/actor.yaml (ratio_clip section)

Key Implementation Points

  1. MASPO is integrated into the existing PPO framework via the ratio_clip_config.ratio_mode parameter
  2. When ratio_mode="maspo", the algorithm applies the soft Gaussian gating with dual-variable adaptive variance
  3. The implementation supports both FSDP and Megatron backends
  4. The unilateral design selectively attenuates aggressive overshoots without hindering conservative updates

πŸ” Related Work

This work builds upon and improves several existing methods:

  • GRPO (Group Relative Policy Optimization): The baseline hard-clipping method
  • SAPO (Soft Advantage Policy Optimization): Soft clipping with global gating mechanism
  • DAC (Dynamic Adaptive Clipping): Adapts bounds based on policy probabilities
  • BAPO (Balanced Advantage Policy Optimization): Dynamically adjusts clipping bounds to ensure positive sample contribution
  • Clip Higher: Relaxes upper bounds globally but ignores token-specific probabilities
  • Entropy Advantage: Adds entropy terms to the advantage for implicit rebalancing

For more details on the theoretical analysis and comparison, please refer to our paper.

πŸ› Troubleshooting

Common Issues

  1. Training Instability: If you encounter training collapse, try reducing maspo_alpha or maspo_beta_pos/maspo_beta_neg values. For larger models, use more conservative settings (e.g., alpha=0.3, beta=0.03).

  2. Memory Issues: Ensure you have sufficient GPU memory. Consider using gradient checkpointing and parameter offloading:

    actor_rollout_ref.model.enable_gradient_checkpointing: True
    actor_rollout_ref.actor.fsdp_config.param_offload: True
  3. Convergence Speed: If training is too slow, you can increase the learning rate slightly, but monitor for stability. MASPO's soft gating mechanism generally allows for more aggressive learning rates compared to hard clipping.

Performance Tips

  • Use use_remove_padding=True for better memory efficiency
  • Enable use_fused_kernels=True if your model supports it
  • Adjust ppo_mini_batch_size and ppo_micro_batch_size_per_gpu based on your GPU memory
  • For long-tail exploration, consider increasing maspo_alpha to 0.5 for smaller models

Hyperparameter Tuning

  • Mass-Adaptive Scaling (maspo_alpha):

    • Lower values (0.1-0.2): More conservative, similar to GRPO behavior
    • Recommended range: 0.3-0.5 for balanced exploration and stability
    • Higher values (0.8+): May cause instability, use with caution
  • Risk Control (maspo_beta_pos, maspo_beta_neg):

    • Default: 0.03 for both parameters
    • Higher values: More aggressive signal-aware updates
    • Lower values: More conservative, closer to uniform handling

πŸ“š Citation

If you find MASPO helpful in your research or applications, please consider citing our paper:

@article{maspo,
  title={MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning},
  author={Fu, Xiaoliang and Lin, Jiaye and Fang, Yangyi and Zheng, Binbin and Hu, Chaowen and Shao, Zekai and Qin, Cong and Pan, Lu and Zeng, Ke and Cai, Xunliang},
  journal={arXiv preprint arXiv:2602.17550},
  year={2026}
}

πŸ™ Acknowledgments

This implementation is built on top of verl, a flexible and efficient RLHF framework. We thank the verl community for their excellent infrastructure.

πŸ“ License

This project follows the same license as verl. Please refer to the verl repository for license details.


⭐ If MASPO helps your research or applications, please give us a star! ⭐

Note: This is the official implementation of MASPO. For more details, please refer to our paper ❀️.

About

Mass-Adaptive Soft Policy Optimization (MASPO) - Official Implementation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors