← Back to Benchmarks
simmediumvision-robotmetric · varies

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Description

Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded

Source

http://arxiv.org/abs/2603.11563v1