CRYSTAL: Beyond Final Answers: Benchmark for Transparent Multimodal Reasoning Evaluation | arXiv 2603.13099
benchmark reinforcement-learning computer-vision deep-learning evaluation vqa dartmouth reasoning multimodal chain-of-thought mllm llm-evaluation vision-language-models multimodal-reasoning grpo process-reward step-level-evaluation
-
Updated
Mar 18, 2026