simmediummanipulation-datametric · varies

ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

Description

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D representation to bridge perception and action. ST-VLA co

Source

http://arxiv.org/abs/2603.13788v1