simmediumimitationmetric · varies

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Description

Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different act

Source

http://arxiv.org/abs/2502.19645v2