simmediumoffline-rlmetric · varies

floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

Description

A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically th

Source

http://arxiv.org/abs/2509.06863v2