Hi! Thanks for your great work, but there are some problem about the GRPO reward. While the paper describes a process reward using scores across 8 dimensions, the actual code provides doctor_reward (5 dimensions) and doctor_reward_v2 (7 dimensions). Furthermore, the reward weights do not align with the paper, and the code defaults to calling doctor_reward. Additionally, only the final process reward of whole trajectory(not turn by turn) is used for loss calculation, which seems to deviate from the core logic proposed in the paper.