← Back to Benchmarks
simmediumroboticsmetric · varies

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Description

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancem

Source

http://arxiv.org/abs/2603.22760v1