simmediumrlmetric · varies

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Description

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoni

Source

http://arxiv.org/abs/2603.12246v1