simmediumrlmetric · varies

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

Description

Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under simila

Source

http://arxiv.org/abs/2603.05900v1