simmediumrlmetric · varies

From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

Description

Leveraging the model's internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issu

Source

http://arxiv.org/abs/2603.16500v1