ProxMO · Official Implementation
Practical credit assignment for multi-turn reinforcement learning with large language models
ProxMO is a lightweight, practical framework for multi-turn LLM agent training that addresses a fundamental challenge: context-dependent credit assignment.
|
|
|
|
ProxMO addresses a fundamental challenge in multi-turn LLM agent training: context-dependent credit assignment.
Existing group-based policy optimization methods struggle with multi-turn tasks because:
-
📈 Episode-level: Identical statistical deviations carry vastly different informational values
- A failure at 90% success likely reflects noise 🔴
- A success at 10% success represents a breakthrough 🟢
- Yet both receive identical gradient magnitudes!
-
🔍 Step-level: Exact state matching fragments semantically similar states into singletons
- Strict matching → singleton groups, normalization undefined
- Loose matching → equal weighting of dissimilar states
- Can't distinguish action quality within trajectories
ProxMO introduces two lightweight mechanisms:
-
🎚️ Success-rate-aware modulation (episode-level)
- Adapts gradient intensity based on task difficulty
- Amplifies rare successes in hard tasks (low success rate)
- Attenuates noisy failures in easy tasks (high success rate)
-
🔗 Proximity-based soft aggregation (step-level)
- Replaces hard boundaries with continuous weighting
- All states contribute proportionally to semantic proximity
- Eliminates singleton degeneracy, enables robust baseline estimation
ProxMO adapts gradient intensity to task difficulty using empirical success rate p:
Low-success regime (p → 0): 🔴 Rare successes amplified (w ≈ 1.05)
✓ Consolidate breakthroughs
Medium-success regime (p ≈ 0.5): Minimal modulation (w ≈ 1.0)
✓ Balanced learning
High-success regime (p → 1): Noisy failures attenuated (w ≈ 0.95)
✓ Reduce over-correction
Mathematical formulation:
w(R, p) = 1 + β · f(R, p)
where f(R, p) uses sigmoid-based functions to detect task difficulty
and adjust gradient strength asymmetrically
Instead of hard state matching, ProxMO computes baselines via continuous similarity weighting:
Traditional (GiGPO): ProxMO:
┌─────────┐ ┌──────────┐
│ State │ Exact Match? │ State │ TF-IDF Similarity
│ Cluster │ ↓ Yes/No │ Weights │ ↓ Continuous
└─────────┘ └──────────┘
Hard boundaries Soft weighting
Singleton degeneracy Robust estimation
Key advantages:
- ✅ All states contribute proportionally to semantic proximity
- ✅ Eliminates singleton groups in high-dimensional spaces
- ✅ Smooth gradient flow via continuous weighting
ProxMO consistently outperforms existing methods on challenging multi-turn interactive benchmarks:
| Method | Pick | Look | Clean | Heat | Cool | Pick2 | All | Score | Succ. |
|---|---|---|---|---|---|---|---|---|---|
| Closed-Source Models | |||||||||
| GPT-4o | 75.3 | 60.8 | 31.2 | 56.7 | 21.6 | 49.8 | 48.0 | 31.8 | 23.7 |
| Gemini-2.5-Pro | 92.8 | 63.3 | 62.1 | 69.0 | 26.6 | 58.7 | 60.3 | 42.5 | 35.9 |
| Open-Source Baselines | |||||||||
| Base | 5.9 | 5.5 | 3.3 | 9.7 | 4.2 | 0.0 | 4.1 | 25.1 | 6.3 |
| ReAct | 17.4 | 20.5 | 15.7 | 6.2 | 7.7 | 2.0 | 12.8 | 42.1 | 14.3 |
| Reflexion | 37.8 | 24.0 | 23.3 | 14.5 | 20.7 | 3.9 | 23.5 | 58.6 | 23.5 |
| RL Methods | |||||||||
| GRPO | 80.0 | 50.0 | 75.0 | 88.9 | 63.2 | 50.0 | 70.3 | 73.1 | 52.2 |
| GiGPO | 95.3 | 80.2 | 92.9 | 92.7 | 70.6 | 78.5 | 85.2 | 81.7 | 62.3 |
| ProxMO (Ours) | 94.3 | 92.9 ✨ | 89.3 | 92.2 | 89.5 ✨ | 87.0 ✨ | 90.6 ✨ | 85.3 ✨ | 67.1 ✨ |
| Δ vs GRPO | +17.9% | +85.8% | +19.1% | +3.7% | +41.7% | +74.0% | +28.9% | +16.7% | +28.6% |
| Method | Pick | Look | Clean | Heat | Cool | Pick2 | All | Score | Succ. |
|---|---|---|---|---|---|---|---|---|---|
| Closed-Source Models | |||||||||
| GPT-4o | 75.3 | 60.8 | 31.2 | 56.7 | 21.6 | 49.8 | 48.0 | 31.8 | 23.7 |
| Gemini-2.5-Pro | 92.8 | 63.3 | 62.1 | 69.0 | 26.6 | 58.7 | 60.3 | 42.5 | 35.9 |
| Open-Source Baselines | |||||||||
| Base | 34.8 | 22.9 | 18.1 | 7.3 | 2.5 | 3.6 | 16.2 | 25.1 | 8.4 |
| ReAct | 50.1 | 33.8 | 35.7 | 12.5 | 17.3 | 18.9 | 29.8 | 47.8 | 21.0 |
| Reflexion | 63.4 | 40.2 | 46.5 | 29.7 | 37.9 | 22.6 | 44.1 | 56.3 | 30.2 |
| RL Methods | |||||||||
| GRPO | 90.7 | 66.2 | 94.1 | 91.2 | 78.9 | 70.5 | 79.8 | 79.2 | 67.2 |
| GiGPO | 97.5 | 81.3 | 88.5 | 85.7 | 90.0 | 83.5 | 89.5 | 85.5 | 74.8 |
| ProxMO (Ours) | 98.4 ✨ | 88.6 ✨ | 95.7 ✨ | 93.8 ✨ | 91.3 ✨ | 89.8 ✨ | 94.5 ✨ | 87.2 ✨ | 76.5 ✨ |
| Δ vs GRPO | +8.5% | +33.8% | +1.7% | +2.8% | +15.6% | +27.5% | +18.3% | +10.1% | +13.8% |
|
|
| Metric | GRPO → ProxMO (1.5B) | GRPO → ProxMO (7B) | Win Rate |
|---|---|---|---|
| ALFWorld Overall | 70.3 → 90.6 ✨ | 79.8 → 94.5 ✨ | 2/2 |
| WebShop Score | 73.1 → 85.3 ✨ | 79.2 → 87.2 ✨ | 2/2 |
| WebShop Success | 52.2 → 67.1 ✨ | 67.2 → 76.5 ✨ | 2/2 |
| vs GiGPO | 90.6 vs 85.2 | 94.5 vs 89.5 | 2/2 |
🏆 Absolute Winner: ProxMO beats both GRPO baseline and GiGPO across all major metrics
conda create -n proxmo python=3.10 -y
conda activate proxmo
pip install -e .📦 Install dependencies:
pip install gymnasium==0.29.1 stable-baselines3==2.6.0 alfworld⬇️ Download required assets (stored in ~/.cache/alfworld/):
alfworld-download -f✅ Test the installation:
alfworld-play-tw
⚠️ Important: WebShop requires Python ≤3.10. Create a separate environment if needed.
📥 Install WebShop:
cd ./agent_system/environments/env_package/webshop/webshop
./setup.sh -d all💡 Note: If you encounter issues with Google Drive downloads, manually download the required files or use alternative download methods.
After installing the environments and dependencies, run ProxMO training with:
🎮 Train on ALFWorld:
bash examples/proxmo_trainer/run_alfworld.sh🛍️ Train on WebShop:
bash examples/proxmo_trainer/run_webshop.shThese scripts handle all configuration and setup automatically. Refer to the scripts for customization options.
├── proxmo/
│ └── core_proxmo.py # ProxMO algorithm (episode/step advantage)
├── verl/
│ ├── trainer/
│ │ ├── main_ppo.py # Training entry point
│ │ └── ppo/
│ │ └── ray_trainer.py # Ray training; ProxMO via AdvantageEstimator.ProxMO
│ └── workers/
│ └── reward_manager/ # Other reward managers (e.g. DAPO, Prime)
├── agent_system/
│ └── environments/
│ ├── env_package/
│ │ ├── alfworld/ # ALFWorld environment
│ │ └── webshop/ # WebShop environment
│ └── env_manager.py # Environment management
└── examples/
└── proxmo_trainer/ # Training scripts
├── run_alfworld.sh
└── run_webshop.sh
- ✅ Step-wise interaction loops with environment feedback integration
- ✅ Customizable memory modules for history management and context tracking
- ✅ Flexible per-step input structures supporting diverse observation types
- 📌 Works with any sequential decision-making task (embodied, web, text-based)
- ✅ Parallelized environment rollouts (no speed degradation)
- ✅ Efficient group-based advantage estimation (no critic networks)
- ✅ Minimal computational overhead (+1.09% vs GRPO)
- 📊 Vectorized TF-IDF computation, O(N²) complexity fully parallelizable
- ✅ Plug-and-play integration with existing GRPO pipelines
- ✅ Hyperparameter-robust design (stable across domains & scales)
- ✅ No architectural changes required to existing models
- 🛡️ Drop-in replacement with immediate deployment capability
- ✅ Extensive experiments on 2 benchmarks (ALFWorld, WebShop)
- ✅ Multiple model scales tested (1.5B, 7B parameters)
- ✅ Full ablation studies proving independent mechanism contributions
- ✅ Hyperparameter sensitivity analysis demonstrating robustness
- 📋 Consistency across diverse task categories and difficulty regimes
If you find ProxMO helpful in your research or applications, please consider citing our paper:
@article{proxmo,
title={Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training},
author={Fang, Yangyi and Lin, Jiaye and Fu, Xiaoliang and Qin, Cong and Shi, Haolin and Liu, Chang and Zhao, Peilin},
journal={arXiv preprint arXiv:2602.19225},
year={2026}
}⭐ If ProxMO helps your research or applications, please give us a star! ⭐
Note: This is the official implementation of ProxMO. For more details, please refer to our paper ❤️.
