← Back to Benchmarks
simmediumvision-robotmetric · varies

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

Description

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated

Source

http://arxiv.org/abs/2602.03983v2