simmediumrlmetric · varies

Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks

Description

Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL

Source

http://arxiv.org/abs/2603.12875v1