Gymnasium-style RL framework for LLM agent training — MDP environments, three-layer process reward & SFT/DPO/GRPO policy optimization. CLI + MCP ready.
-
Updated
Mar 15, 2026 - Python
Gymnasium-style RL framework for LLM agent training — MDP environments, three-layer process reward & SFT/DPO/GRPO policy optimization. CLI + MCP ready.
CRYSTAL: Beyond Final Answers: Benchmark for Transparent Multimodal Reasoning Evaluation | arXiv 2603.13099
Add a description, image, and links to the process-reward topic page so that developers can more easily learn about it.
To associate your repository with the process-reward topic, visit your repo's landing page and select "manage topics."