← Back to Benchmarks
simmediumroboticsmetric · varies

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Description

Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals.

Source

http://arxiv.org/abs/2603.14363v1