simmediumrlmetric · varies

MAPLE: Elevating Medical Reasoning from Statistical Consensus to Process-Led Alignment

Description

Recent advances in medical large language models have explored Test-Time Reinforcement Learning (TTRL) to enhance reasoning. However, standard TTRL often relies on majority voting (MV) as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path is not necessarily the clinically correct one. In this work, we propose a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap betw

Source

http://arxiv.org/abs/2603.08987v1