We build and test Quantitative Reasoning abilities in Small generative models skipping the SFT phase, and directly went with RL phase for building reasoning knowledge without data supervision.
reinforcement-learning chain-of-thought vision-language-models qwen2-vl deepseek-r1 grpo deepseek-r1-zero smolvlm no-sft
-
Updated
Jan 6, 2026 - Jupyter Notebook