← Back to Benchmarks
simmediumroboticsmetric · varies
SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation
Description
Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancem