← Back to Benchmarks
simmediumvision-robotmetric · varies
Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement
Description
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated