← Back to Benchmarks
simmediummanipulation-datametric · varies
FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model
Description
Predictive foresight is important to intelligent embodied agents. Since the motor execution of a robot is intrinsically constrained by its visual perception of environmental geometry, effectively anticipating the future requires capturing this tightly coupled visuomotor interplay. While recent vision-language-action models attempt to incorporate future guidance, they struggle with this joint modeling. Existing explicit methods divert capacity to task-irrelevant visual details, whereas implicit m